Hitachi Vantara Pentaho Community Forums
Results 1 to 17 of 17

Thread: Exception reading line using NIO

  1. #1
    Join Date
    Sep 2008
    Posts
    4

    Default Exception reading line using NIO

    Hello all,

    Version 3.1.0 RC1, Build 771

    I am receiving this error as described by this defect:

    http://jira.pentaho.com/browse/PDI-1519

    "
    Exception reading line using NIO"

    This occurs with a Simple CSV Input. I have not been able to pinpoint the exact record that is causing the error; running subsets of the data work fine. My data file is rather large, but I can zip it up and send it to anyone should they like to have an example of when this happens.

    I really like the product, and any guidance working through this would be greatly appreciated. Thanks in advance.
    Rich Herman
    Atlanta, GA

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Try one of these latests builds and confirm/deny in the JIRA case that your problem is fixed.

    ftp://download.pentaho.org/client/da...gration/3.1.0/

  3. #3
    Join Date
    Sep 2008
    Posts
    4

    Default

    Quote Originally Posted by MattCasters View Post
    Try one of these latests builds and confirm/deny in the JIRA case that your problem is fixed.

    ftp://download.pentaho.org/client/da...gration/3.1.0/
    Thanks for the quick reply, I'll test it as soon as I can.
    Rich Herman
    Atlanta, GA

  4. #4
    Join Date
    Sep 2008
    Posts
    4

    Default

    After downloading build 818 I am now getting this error:
    Code:
    Unexpected error : 
    org.pentaho.di.core.exception.KettleFileException: 
    
    Exception reading line using NIO
    1000
    1000
    at org.pentaho.di.trans.steps.csvinput.CsvInput.readOneRow(CsvInput.java:607)
    at org.pentaho.di.trans.steps.csvinput.CsvInput.processRow(CsvInput.java:126)
    at org.pentaho.di.trans.step.BaseStep.runStepThread(BaseStep.java:2655)
    at org.pentaho.di.trans.steps.csvinput.CsvInput.run(CsvInput.java:691)
    Caused by: java.lang.ArrayIndexOutOfBoundsException: 1000
    at org.pentaho.di.trans.steps.csvinput.CsvInput.readOneRow(CsvInput.java:575)
    3 more
    I can continually get this error now; if I change the NIO buffer Size to a smaller value (say 1000), the error happens within the first 1600 rows. I made sure the the Encoding was set correctly, and that there were proper line feed characters at the end of each row. It does have a lot of columns (105), and I wonder if that may be causing the problem in some way.

    I am attaching the job, and the data file if anyone cares to repro the problem.

    Thanks in advance.

    Attached Files Attached Files
    Rich Herman
    Atlanta, GA

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    The number of columns is never the issue. I bet it's a missing enclosure or something like that.
    Debugging these can be a serious pain.

  6. #6
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    By the way, you really should file a bug report with that information.

  7. #7
    Join Date
    Sep 2008
    Posts
    4

    Default

    Quote Originally Posted by MattCasters View Post
    By the way, you really should file a bug report with that information.
    Thanks, I submitted it.
    Rich Herman
    Atlanta, GA

  8. #8
    DEinspanjer Guest

    Default

    I spent a little time looking at this and here is what I found:

    There is a bug in the CSV code. It was introduced in revision 8605 which was a fix for PDI-1519. With files of a certain size (I tried with a very small file and was unable to reproduce the error), having records that end in an extra delimiter can cause the error you are seeing.

    I spent a little time trying to debug it but honestly, the code where the problem is happening is a little confusing and I don't have enough time to do it justice. I hope that maybe Matt or cboyden can take a quick look at it and wipe it out before 3.1 GA.

    You have a simple work-around though. Either get rid of that extra delimiter at the end of all your records (you define only 105 fields, but since you have 105 delimiters, that means you technically have 106 fields), or define another field in your CSV input step named UNUSED or something.

    Thanks for including the test case that got us this far. I'm also adding some notes to the bug you entered, http://jira.pentaho.com/browse/PDI-1715

  9. #9
    Join Date
    May 2006
    Posts
    4,882

    Default

    It definitely is funky code... also no code fix yet.

    At 3.1 there's going to be an extra CSV plugin step which will be able to read specific file attached above.

    Regards,
    Sven

  10. #10

    Default

    I'm seeing this same error. I tried the suggested workaround of adding an "unused" field on the end, and I still got the error.

    The file that I'm working with has 16 million rows, 5GB in size. The error occurs 13,222,397 rows into the file.

    I tried to read the file with a text input step instead of CSV input and I got an OutOfMemoryError at the same row.

    I tried to use PDI (Spoon) to extract just the problem row (and a few rows around it). I changed the delimiter so that I could read the entire row as one field, then added a sequence number and a a filter on the sequence number. This crashed on the same row.

    Any other ideas on work arounds? I'm wondering if I should try writing a perl script to extract the problem lines. Unfortunately this is a windows box, so I'd have to install perl.

    --Jeff Wright

  11. #11
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Are you sure it's not some file encoding problem? It seems very strange that a Text File Input would run out of memory since it uses next to no memory.

  12. #12

    Default

    Quote Originally Posted by MattCasters View Post
    Are you sure it's not some file encoding problem? It seems very strange that a Text File Input would run out of memory since it uses next to no memory.
    The transform with the text file input step was feeding a select values and then two parallel sets of sort/output steps. Each sort was configured for 40% of memory.

    So perhaps the remaining 20% of heap was not enough for Spoon and the steps. But by the way I'm running with -Xmx1024m.

    I'm wondering if all the end of line characters disappear after 13 million lines or something silly like that.

    --Jeff Wright

  13. #13
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    If you don't have any hope of sorting the total data set in memory, just set the sort size (amount of rows) to something reasonable like 100,000.
    That way you are certain to never run out of memory.

    To get the line from the data file, you can just do a

    Code:
    head -13222398 file.txt | tail -1
    If the file has a header, otherwise one less. Windows users would need to install cygwin first

    Matt

  14. #14
    DEinspanjer Guest

    Default

    Nope, I don't think it is likely there would be some arbitrary limit where things started going bad for just the input steps. I deal with multi-million record files every day. The sort steps will certainly cause you some trouble though and you'll need to tune them appropriately.

    As far as getting an extract, you could easily do that via a sed oneliner:
    sed -ne '13222350,13222450 p' [myfile.txt] > [extract.txt]

    If you are on Windows, you could download CSVed, a great little program for viewing delimited files. It uses a chunking memory mapping technique so it can load even very large files and not hang.

  15. #15

    Default

    Quote Originally Posted by DEinspanjer View Post
    Nope, I don't think it is likely there would be some arbitrary limit where things started going bad for just the input steps.
    I meant I'm wondering if there is some corruption in my input data where the EOL characters disappear at a certain point.

    Thanks for the pointer about CSVed, never heard of that before and yes, I'm on windows. Looks like a neat tool but it won't open my file. It says file is empty. Huh. Windows says the file is 5.03GB.

    I've also been unsuccessful at getting a custom Java program to read the file.

    btw, in my experience all the unix commands except perl have line length limitations built in, so that you can't just use head or sed on files with arbitrarily long lines. But maybe that's been fixed in modern Linuxes.

    --Jeff Wright

  16. #16

    Default

    Turns out that part way through record number 13,222,397 the rest of my file is filled with NULL characters.

    I wound up writing a Java program to process it character by character. My first Java program was using BufferedReader's readLine() method, and that threw an OutOfMemoryException looking for the end of line.

    When reading with the Kettle CSV file input step I saw the same error message as described in the start of this thread.

    --Jeff Wright

  17. #17
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    The files we're testing with are typically larger than 2^31-1.
    Either way, there is no limit to the size of a file as far as Java is concerned.

    For Linux most filesystems (ext3, reiser, etc) all support very large files. (TB ranges)
    For Windows, FAT32 has a 4GB size limit [ (2^32)-1 ]. On NTFS that limit was lifted for all practical purposes as well.

    Cygwin is free (GPL) and is installed in a few minutes. It's really worth doing that if you're on Windows.

    Daniel, thanks for the "sed" trick, I knew there had to be a better way :-)

    Matt

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.