Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: CSV file input reads too many rows when reading in parallel

  1. #1
    Join Date
    Aug 2011
    Posts
    5

    Default CSV file input reads too many rows when reading in parallel

    Hi,

    I have a transformation that simply reads a CSV file using CSV file input.
    When I have only one instance of this step running, I can see in the debug log that the total number of rows read is the same as the number of lines in the csv file.

    If I read in parallel and set the number of instance to 2, it will read a few extra lines which are duplicates. The higher the number of instances, the more the number of duplicates rows produced. Is this the correct behavior and do I need to filter duplicated rows out?

    Thanks
    Stephane

  2. #2
    Join Date
    Aug 2011
    Posts
    5

    Default

    I m using 4.2.0

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Let's just say it depends on the CSV file. Also try 4.2.1 (just released to SF) because I remember a bug getting fixed.

  4. #4
    Join Date
    Aug 2011
    Posts
    5

    Default

    Matt,

    I just tried 4.2.1 and I am getting the same error.
    My csv file has 152132 lines
    and CSV file input reads 153455 rows as shown below.
    CSV file input.0 - Finished processing (I=76131, O=0, R=0, W=76130, U=0, E=0)
    CSV file input.0 - Finished processing (I=77325, O=0, R=0, W=77325, U=0, E=0)

    Is there something I can do on my side to fix this ? Could you also elaborate on what you mean by "Let's just say it depends on the CSV file" ?

    Thanks much
    Stephane

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    We've seen problems when you have errors in the CSV file itself, like unclosed quotes, newlines in a data file, carriage returns, that sort of thing.
    Figuring it out usually takes a while. However, if you would have a few sample lines and a reproduction case it would become much simpler to pin down.

  6. #6
    Join Date
    Aug 2011
    Posts
    5

    Default

    You are right. I removed the user agent field from the file and everything is well.
    Is there a quick solution to this such as changing the data type from string to blob ?
    In the mean time, will try to determine what cases triggers this...
    thanks

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Certain causes like newlines in data fields are known and indicated in the latest UI but otherrs... <sigh> text files... don't get me started.

  8. #8
    Join Date
    Aug 2011
    Posts
    5

    Default

    There is another show stopper, perhaps related.
    Reading a file with the following content using CSV input

    Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_7; en-us) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Safari/530.17 Skyfire/2.0 operator="WIND Home"

    will spit out
    ozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_7; en-us) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Safari/530.17 Skyfire/2.0 operator="WIND Home


    The first character is truncated and the last one is missing. Seems to happen when double quotes are in the string.

    Any work around you can think of?

    This happens with both 4.2.0 and 4.2.1
    Thanks

  9. #9
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    It's likely not a CSV file. You probably need to be careful parsing this.
    There is a RegEx example in samples/transformations that can parse weblog files.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.