Hitachi Vantara Pentaho Community Forums
Page 1 of 2 12 LastLast
Results 1 to 10 of 14

Thread: Text file input incorrectly matching fields and values

  1. #1

    Default Text file input incorrectly matching fields and values

    I must be missing something here - I haven't found other examples of folks having trouble with this, and this would have to be something Pentaho can do accurately. I am using "Get File Name" and passing the value to the "Text input" step, which reads in that file and passes it along to a step which renames the fields.

    I've had this running for a month or so now, hourly, and every once in a while the data gets out of sync with the field headers. The values for "End Date" somehow end up in "Instructor Name", "Part Name" ends up in "Reference Number", so on. It's just a few fields, but unfortunately those are the ones we use the most in downstream transformations. It should be a super straightfoward transformation, but it's been so inconsistent I might just have to pull it out.

    Anyone experienced this, or have ideas? Thanks!

    Name:  pentaho.jpg
Views: 283
Size:  39.0 KB
    ---
    Data Integration version 6.1.0.1-198
    Report Designer version 6.1.0.1-196

  2. #2
    Join Date
    Apr 2008
    Posts
    4,686

    Default

    Are you doing your own field separation, or are you letting Text File Input do it?

    I've had issues when I'm doing my own field separation and there's a string field that has a , in it.
    Unfortunately, the forum software resizes images, so your overall screenshot is something that I can't read

  3. #3

    Default

    Oh bummer...well it just shows me looking at the input file as PDI gives me an error. You can see that it is looking at the value of a text field and saying it can't turn it into a date, which is totally non-surprising since it's not a date. Yes, I think you are right that it's likely a separation issue - I recalled that another integration also had issues with character escaping with data from this vendor. I'm going to explore that route - thank you for the reply!
    ---
    Data Integration version 6.1.0.1-198
    Report Designer version 6.1.0.1-196

  4. #4
    Join Date
    May 2016
    Posts
    277

    Default

    I hate being forced to use a comma as a separator, I never choose it, there's almost always some problem with a field containing it, specially given that in Spanish the comma is used as decimal separator , using a ";" is not as bad, but if you have a text free field you're going to find it some day. I've always wondered how it became a standard for .csv files...

  5. #5

    Default

    I have the option of tab-delimited files too (why didn't I choose that in the first place?). So I'm going to switch everything over to that - hopefully that's safe, even though we do have several free text fields for descriptions and such...
    ---
    Data Integration version 6.1.0.1-198
    Report Designer version 6.1.0.1-196

  6. #6
    Join Date
    Apr 2008
    Posts
    4,686

    Default

    Another thing to look out for is that your supplier may be providing bad data. (Don't ask -- I've seen it ) where they sometimes send two separators instead of one.

    If that's what you're dealing with (rather than doing the separation yourself), then your best bet to is to set up an error handling that sends someone an email to review the file, fix, and resubmit.

  7. #7

    Default

    Yes! Check out this line from my source data (from a vendor) that is erroring in Pentaho in the input step, and highlighted some of the values in red (one at the beginning, one a little back from the end) that I think could be causing this:

    "","","",,"","","","",,"",""
    ,"Yes","","","","","",,"",,"2.0","Schultz, firstname","usernameADMIN",,"203340","203339","","System Defaults","",,"Electronic",,"79478","","Specialized Knowledge & Applications","",,"3701",,"","","120.00","","","","","","1","No",,"","","Cornerstone Administration Division","Yes","Yes","Yes","Visible","","","","","","",,,,"",,"",,"","79478",,"120.00",,"","","","","","","","","","","",,"","22","75","53","22","Y","02/08/2017 01:26 PM","0.00","​This course will focus on a deeper understanding of the food & beverage industry and vertical markets such as Dairy, and Produce.","05/03/2017 12:00 PM","1.75","79478-1902","9221","75","1","da513313-7b5d-44b2-b21d-00b6e5da2a64","RSM US LLP","","05/03/2017 05:30 AM","05/03/2017 10:15 AM","Approved","TMC","Food & Beverage Industry 201 (79478)","Session",,,,,,"No","","","No","No","0.0000","0.0000","DA513313-7B5D-44B2-B21D-00B6E5DA2A64","05/03/2017 12:00 PM",,,"Hyatt Regency O'Hare","79478-1902-1","05/03/2017 10:15 AM","CST"

    So sometimes they send "","", for empty values and sometimes they send ,,,,. Is that the problem? Now that I've done a little reading and actually understand the difference between csv and txt, I would prefer csv, but I don't trust this vendor to send reliable csv files. Txt seems to work better in practice. Either way I think you are right gutlez that I should set up error handling.
    Last edited by katieldouglass; 03-04-2017 at 01:58 PM.
    ---
    Data Integration version 6.1.0.1-198
    Report Designer version 6.1.0.1-196

  8. #8
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Kettle doesn't choke on the inconsistent representation of missing field values shown in your sample line.

    I recall two conventions regarding the use of quotes as field delimiters besides formal requirements (i.e. when the field or record separator is part of a value):

    • Data type signaling: Text values are quoted, Numbers are not. This is helpful for external representation of spreadsheet data, where not all values of a single column must have the same type.
    • Null value signaling: Empty strings are quoted, null values are not. This is helpful to preserve the distinction made in a database even in a simple export file.

    A field value like "120.00" tells me that the first convention doesn't matter in your case.

    With Kettle you must realize that there are two steps for processing CSV files: CSV-File-Input and Text-File-Input with different feature sets.
    CFI only reads a single file, but will handle field values containing line breaks.
    Generally, TFI offers more control over the way a CSV file is processed.

    Kettle relies on the same number of fields in every row (even line with TFI).
    If it encounters text in a Number field, that's bad, too.
    You can help to improve data quality on the producer side by generating reports, as gutlez proposed.

    Never try to pull through with bad data.
    So long, and thanks for all the fish.

  9. #9

    Default

    Thanks marabu, that is interesting to know. What I'm not understanding is whether I use CSV or Text input, both the "","" and ,,, are being interpreted as <null>. Is there a setting I am missing?

    If the "","" vs ,,, is not my issue, I am still confused what could be the issue. It's an intermittent problem, and changing anything in these input steps means I'm changing things downstream - perhaps becoming my own enemy...
    ---
    Data Integration version 6.1.0.1-198
    Report Designer version 6.1.0.1-196

  10. #10
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    There's a setting KETTLE_EMPTY_STRING_DIFFERS_FROM_NULL in the kettle.properties - default is N.

    As to your troubles, with a thorough sanity check in place you will find the cause, I'm sure.
    Maybe we could tell you more if we could see a data sample leading to those jumbled fields.
    Last edited by marabu; 03-05-2017 at 04:59 PM.
    So long, and thanks for all the fish.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.