Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: Filling gaps in stream from previous row on specific fields only.

  1. #1
    Join Date
    Dec 2015

    Question Filling gaps in stream from previous row on specific fields only.

    Pentaho 5.3 EE

    Hi All,
    I am trying to figure out a solution but really struggling to resolve...

    I have a stream of data originally read from an excel spreadsheet. There are multiple spread sheets, I have the column meta data stored in a database so I know for each file which fields I am expecting to receive.
    I am able to validate and read the spreadsheet, which now needs converting to a flat (delimited) file.
    Some columns in the spreadsheet are NULL values.
    As part of the conversion process, I need to deal with NULL values.

    Some of these NULL values will be converted to blank values, others, defined in the database meta data tables, will be populated using values from the previous row.

    Example input stream:
    my_ref mycol1 mycol2 mycol2
    1 0001 A B C
    2 NULL D E F
    3 0005 G NULL H
    4 0007 I J NULL

    Example after processing:
    my_ref mycol1 mycol2 mycol2
    1 0001 A B C
    2 0001 D E F
    3 0005 G H
    4 0007 I J 0
    my_ref column is defined to be copied forward / populated from previous row.
    So, my_ref on row 2, would be set to the value from the previous row (0001).
    There could n number of rows in this state.
    There could be n number of columns in a row requiring to be copied forward.

    NULL values in mycol1 / mycol2 / mycol3 are not defined to be copied forward and so will be converted to blank values (depending on data type).

    I have found examples where all null values are populated from previous row values, but its the specific fields nature of my problem causing the complication.

    My thoughts on a solution transformation:
    • For each row, look at each field and retrieve its field name (normalise each row to get field name / type / value?).
    • If value is NULL, Use the field name to lookup the meta data to see if this field should be populated from previous row.
    • If row is to be populated from previous, determine previous row value.
    • If not, check data type and set to 0 / blank string etc depending on data type.

    I am not sure I can use the analytic query step as I do not know until run time what columns I will receive and so cannot set the steps meta data.
    Also, as the key identifer (my_ref) may be NULL, it is difficult to group the records.
    I don't think the analytic query step supports metadata injection?

    I am not sure I can use a javascript step as I need to do the lookup.

    I can retrieve the column meta before hand, so I know what fields I will receive and which should be populated from previous, but need to find a way to relate it to the incoming data in order to do the comparison.

    Hopefully the above makes sense...


  2. #2
    Join Date
    Jun 2012


    Some hints:

    • MS-Excel-Input sports a "repeat" feature per field, which you can use to eliminate missing key values.
    • MS-Excel-Input supports the injection interface, so you can configure its field list from your metadata catalog at runtime.
    • If you can't use default constraints at the receiving end (e.g. database table), you can use a User-Defined-Java-Class to iterate the fields of a row to provide defaults as governed by your metadata catalog.
    • Alternatively, while the If-Field-Value-Null step doesn't support injection, you still can modify a transformation via XSLT or Java.
    So long, and thanks for all the fish.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.