Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: "Ranged" Stream Lookup

  1. #1

    Default "Ranged" Stream Lookup

    I am attempting to process a text file that comes to us from an external source.

    This file is a CSV file but has a hierarchical structure (think of nested XML but without the XML bit )

    This means that each input line may have a different number of fields however the first 2 fields are always a line counter followed by a line type identifier.

    I have managed to come up with a scheme using a file input and a set of row filters to split this data into various streams. All good so far.

    Now the structure of the file is essentially a header record followed by a series of detail records related to the header, then a trailer record, then another header record and more detail records...

    What I want to end up with is a set of records of the detail data with one of the header fields added.

    Sot he input is somewhat like this...
    Code:
    0001, HEADER, XXXX
    0002, DTL, Field2, Field3, Field4
    0003, DTL, Field2, Field3, Field4
    0004, DTL, Field2, Field3, Field4
    0005, FOOTER
    0006, HEADER, YYYY
    0007, DTL, Field2, Field3, Field4
    0008, DTL, Field2, Field3, Field4
    0009, FOOTER
    and what I am hoping to end up with is...
    Code:
    XXXX, Field2, Field3, Field4
    XXXX, Field2, Field3, Field4
    XXXX, Field2, Field3, Field4
    YYYY, Field2, Field3, Field4
    YYYY, Field2, Field3, Field4
    I created a filter that gave me the line number and header value like so
    Code:
    0001, XXXX
    0006, YYYY
    and I was hoping to use this to look up the header value for each detail record based on the line number. The header value related to any detail line would be the value from that stream with the highest number that is lower than the line number for the detail record.

    This is essentially the same as the way the VLOOKUP function in Excel works in it's default mode.

    Can anyone suggest a way to achieve this?

    Thanks,
    Michael

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    Not that easy I'm afraid... one of the base restrictions of PDI is that all rows flowing over 1 hop have to be of the same format.

    And no ranged lookup exists.

    The only way I can currently come up with is to use text file input step reading 1 line as 1 big field. and then using javascript step to cut out the field after the HEADER. Skipping the header line, parsing the regular rows in javascript and appending the HEADER field along with it.
    And this would even assume you always have the same number of FieldX to output.

    Alternatively, after 3.1-GA there will be a new CSV plugin that supports "varying" number of fields per line, but still then you would need a javascript piece to tag the HEADER field onto the output.

    Regards,
    Sven

  3. #3

    Default

    Hmm, I've already got it handling different number of records per line. It works because I use a filter to split the different line types into different streams to process using the row type in the second column.

    I need to read up on the javascript capabilities because that is the only other way I could think of doing it (I'm new to this and something in the javascript example made me think this might work).

    BTW, the number of fields for each specific type of row is fixed and does not change


    Cheers

  4. #4

    Default

    I have solved my problem for now.

    Basic idea is to load the file as CSV with a number of columns that accounts for the line type with the highest number of fields in the file. Fortunately (for me) this seems to handle shorter lines as well without complaining.

    The result is then passed through a javascript filter that uses a variable to keep track of the current value from the previous header record we processed (identified in the javascript using the row type from the second field). This value is added as a new column to the output.

    Code:
    var currentHeader;
    
    if (currentHeader == null) {
        currentHeader = 'DUMMY';
    }
    
    if (Field_001 == 'HEADER') {
        currentHeader = Field_002;
    }
    From there I pass the output through a series of filter transforms that split out the different row types based on the record type field and use a "Select Values"
    transform to grab the correct number of columns from the stream.

    Seems to work well. I have to actually do something else with the data after this initial splitting phase but hopefully it will be a bit simpler now that I have the needed data on every row.


    Cheers.

  5. #5
    DEinspanjer Guest

    Default

    I have an idea of a way you might be able to do it without a JS step.. would you be able to attach whatever transformations you currently have parsing it into different streams so I could try it out without having to rebuild what you've already done?

  6. #6

    Default

    Sample code attached. Very small and simple but demonstrates what I'm doing.
    Attached Files Attached Files

  7. #7
    DEinspanjer Guest

    Default

    What do you think of this method of doing it?
    The main trick is that I use a filtered Cartesian join to insert the correct record header id into each of the detail records.

    Might be worth throwing a very large file at the two methods and seeing which performs better.

    NOTE: I used steps that are only available in Kettle 3.1-RC1 (Namely the Switch/Case step). If you had to run in 3.0.x instead, you could swap that switch case for a pair of cascading filter steps.
    Attached Files Attached Files
    Last edited by DEinspanjer; 08-21-2008 at 07:41 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.