Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: Handling errors in a bad CSV formated file

  1. #1
    Join Date
    Feb 2007
    Posts
    9

    Default Handling errors in a bad CSV formated file

    Hello,

    I am using PDI 4.2.1 on linux debian with sun jre java 1.6

    I have 2 CSV file with a very bad format.

    The first one, is a file which is separated with tab and without enclosure characters. I know that each lines begin with a date and ends with 2 tabulations chars.
    Sometimes, a return line character (\n) is inserted in the middle of one field.
    Right now I use a text file input to parse the file, and I put for the first field a date type. When it hits this middle \n, the first field of the next line is not a date anymore. It generated an error, and I just ignore it.
    How could I remove those annoying \n characters? So I could get all the data from my files.

    The second one, is a file which is separated with tab and without enclosure characters. I know each lines begin with an integer.
    Sometimes, a backslash and a return line character (\\n) is inserted in the middle of one field.
    Right now I use a text file input to parse the file, and I set the Escape field to \. I put the first field an integer type. When it hits this middle \\n, the first field of the next line is not an integer anymore. It generated an error and I just ignore it.
    How could I remove the annoying \n characters?

    For those cases should I use the User defined Java Class to read the file? Can I extends the text file input inside this user defined java class step?

    Thanks

  2. #2
    Join Date
    Feb 2007
    Posts
    9

    Default

    Ok I found the solution:

    I took the text input step, and parse the file with a fixed input and putting the whole line inside a field.

    Then I get the user defined java class and Use this code:
    private String line_prev;

    public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
    {
    Object[] r = getRow();
    if (r == null) {
    setOutputDone();
    return false;
    }

    if (first)
    {
    first = false;
    }

    // Get the value from an input field
    String line = get(Fields.In, "line").getString(r);

    if (line.endsWith("\\")) {
    line = line.substring(0, line.length() - 1);
    if (line_prev == null)
    line_prev = line;
    else
    line_prev = line_prev.concat(line);
    } else {
    if (line_prev != null) {
    get(Fields.Out, "line").setValue(r, line_prev + line);
    line_prev = null;
    }

    // Send the row on to the next step.
    putRow(data.outputRowMeta, r);
    }

    return true;
    }

    And then used the split step to Split the line and get the different CSV fields.

    Thanks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.