Hitachi Vantara Pentaho Community Forums
Results 1 to 12 of 12

Thread: File output with "force the enclosure" still does not work in 4.3

  1. #1

    Default File output with "force the enclosure" still does not work in 4.3

    Is this true? I can't seem to make it work. I have some fields that are '<empty>'

    Also what is the "Disable the enclosure fix?"

    Thanks

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Sad, but it's true: The Text Output (CSV) step currently is not implemented correctly.

    You avoid trouble when selecting unambiguous meta characters for separating and quoting of values.

    "Force the enclosure around fields" better should read "Always quote string fields", because that is what it does.
    This CSV feature was ment to help early spreadsheet calculators to distinguish numeric and string values.

    Usually, if a field value contains the separator character, it gets quoted.
    If the field value does contain a quote character, too, the embedded quote characters will be doubled.
    "Disable enclosure fix" deactivates this escape mechanism. It is ignored if "Force the enclosure" is checked.
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Claiming that "The Text Output (CSV) step currently is not implemented correctly" suggests that there is a correct way to generate CSV files but that's unfortunately not the case.

    The little standard that there is was written up on pages like this one:
    http://en.wikipedia.org/wiki/Comma-separated_values
    It suggests that the "proper" way (and really there is no proper way) to quote fields is to only quote String fields. This makes the most sense too since the other fields don't need quoting at all.

    So in the end what it comes down to is that IF some stupid developer were to change the current behavior, thousands of ETL developer would start screaming at that developer for changing a perfectly fine output format. And that means that you need to add new options to keep compatible behavior.

  4. #4
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Matt, with all due respect for your wonderful work on Kettle:

    Quote Originally Posted by MattCasters View Post
    Claiming that "The Text Output (CSV) step currently is not implemented correctly" suggests that there is a correct way to generate CSV files but that's unfortunately not the case.

    I am pretty old by current IT standards. When I started to work in IT business using chain printers and punch card readers, comma separated values were already there. Since it's a common concept, I never really missed a formal specification issued by an international standards body. I saw each and every CSV feature coming and it fell in place quite naturally. With the advent of localized number formats it became necessary to quote numbers, too, since the decimal point in some countries is a comma. Because Java and Kettle use localization (thanks for that) it is a mistake to disregard Number fields, when testing for embedded meta characters (comma in this case). That's what I meant with "incorrectly implemented".

    Quote Originally Posted by MattCasters View Post
    The little standard that there is was written up on pages like this one: http://en.wikipedia.org/wiki/Comma-separated_values
    Besides oral history I often refer to the documents from Robert Lynch (2001) and Paul Hsieh.

    Quote Originally Posted by MattCasters View Post
    It suggests that the "proper" way (and really there is no proper way) to quote fields is to only quote String fields. This makes the most sense too since the other fields don't need quoting at all.
    I object. See my remarks on localization, above.

    Quote Originally Posted by MattCasters View Post
    So in the end what it comes down to is that IF some stupid developer were to change the current behavior, thousands of ETL developer would start screaming at that developer for changing a perfectly fine output format. And that means that you need to add new options to keep compatible behavior.
    It is not very likely to witness further changes in the CSV format (I prefer to talk about "Delimited Text" format, CSV being a special case). This format has now been quite stable over years. You only will have to accomodate, if some major player breaks the rules (like MS did).

    I don't think you should introduce new features into the CSV handling steps, but a small code change could make PDI even more amiable for the international user.
    I will open a Jira if nobody else does.

    All the best to you.
    So long, and thanks for all the fish.

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    You have your viewpoint and experience. From my viewpoint I have seen all the possible messed up text file formats pass the line-up.
    But your are wrong about arguing it to me. Create a JIRA case with a reproduction case, ask for improvements, don't just complain about it in here. At least don't complain about being disappointed when you don't file a JIRA case and then don't see a change.

    Just remember that there are a LOT of people out there depending on a certain backward compatible behavior. So you can expect us to possibly add options or change default settings for the Text File Output step but please don't expect a change in the behavior of the current options.

    So thanks in advance for the JIRA case & the nice discussion.

    And... good luck with the punch cards ;-)

  6. #6

    Default

    Guys

    I'm not doing csv file. I need tab delimited file with double quotes surrounding text/string fields and it does it partially. Whenever the fields are null or empty, it does not add quotes to those fields. I like every property on text output step dialog, but implementation does not work as it suppose to. I expect by checking the check box next to "force the enclosure" to surround all output fields with enclosed character selected.

    I think this the only thing that does not work in text output step.

    I'm some what old. I remember starting up IBM 360 with typewriter instead of monitor.

    @Matt

    Do you want me to open a Jira case?

    Thanks

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    yes, add a feature request for a checkbox to quote all fields thereby making it clear that the other option only quotes String fields. <sigh>

  8. #8
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    For your ongoing entertainment, here an excerpt from the work of Robert Lynch:

    -------------------------------------------------------------------
    EMPTY DATA CONVENTIONS
    -------------------------------------------------------------------
    When data is 'empty' (either the empty string, or the numeric
    value '0', or FALSE if boolean), you have the option of either
    writing out:

    ""

    or writing nothing at all. Therefore, it is common to see CSV files
    that look like both of these examples:

    "","Thos.","","Aquinus","Esq.","Pros.forPope","","Somewhere..."

    ,Thos.,,Aquinus,Esq,Pros.forPope,,Somewhere...

    They're both equivalent, and do not violate the spirit of the CSV
    standard.
    https://svn.osgeo.org/metacrs/sr.org...eotiff/csv.txt
    So long, and thanks for all the fish.

  9. #9

    Default

    I agree with you. I think my problem is only for empty or null fields even if they defined as string fields in the step.

  10. #10

    Default

    @Matt

    There is one there already. http://jira.pentaho.com/browse/PDI-1605

    Do you want me to create a new one for 4.3? I think 4.3 version is in the right track with check box for "Force the enclosure around fields?". It's just needs to be implemented it.

    Let me know

    Thanks

  11. #11
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Quote Originally Posted by MattCasters View Post
    So thanks in advance for the JIRA case & the nice discussion.
    You're welcome: http://jira.pentaho.com/browse/PDI-8348

    Quote Originally Posted by MattCasters View Post
    And... good luck with the punch cards ;-)
    You tried really hard to hit bulls eye with a single shot, then.
    For obvious reasons.
    So long, and thanks for all the fish.

  12. #12
    Join Date
    Oct 2006
    Posts
    9

    Default

    Just replying in an old issue, because still having troubles with writing (imho) correct csv files.

    I'll stick to this standard:
    https://www.ietf.org/rfc/rfc4180.txt

    It looks like the text file output (for writing csv) does not encloses string with line endings....

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.