Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Find first non-null value scanning rows backwards from current row

  1. #1
    Join Date
    Jan 2016
    Posts
    14

    Question [Solved] Find first non-null value scanning rows backwards from current row

    Intro

    Hi there, this would be my first post on Pentaho Forums. I've been using PDI for some time now but the issue I'm facing now hasn't been yet approached by me in Kettle before. Actually, I haven't even stumbled upon such a thing in the past.

    Software


    I'm using Pentaho Data Integration 5.4

    Input data & explanation

    Input data from a file (simplified, there are more columns):

    Code:
        number       name
        1009       ProductA
        2150       ProductB
        3235       ProductC
                   ProductD
                   ProductE
        1234       ProductF
        7765       ProductG
        4566       ProductH
                   ProductI
        9907       ProductJ
    The issue is that I had an Excel file format xlsx which has the data with merged cells, and for one value of id there are 1..n rows of values.

    After converting that file to csv values for next rows (other than first) are missing, despite the one column which was not merged (see example id=3, id=6).

    I'm generating a sequence using step Add sequence, the input is sorted the way it was originally stored in a file.

    Steps to achieve the goal

    Basically what I need to do is:


    1. Find first non-null value that has sequence_number less than current_row.sequence_number
    2. Concatenate the value from field name to that matching row
    3. Keep scanning next rows with sequence_number higher than the last scanned


    As stated before, there can be 1..n rows of values for such case.

    Expected output

    Code:
        number       name
        1009       ProductA
        2150       ProductB
        3235       ProductC; ProductD; ProductE
        1234       ProductF
        7765       ProductG
        4566       ProductH; ProductI
        9907       ProductJ

    My approach


    I believe I'm able to do this in a loop, by using Analytic Query and calculating LAG(1) and then concatenating the column name for one row with null values and discarding other column values from null row - and then doing this in a loop (for like 20 times assuming this is maximum), but I do consider this a bad idea.

    There are probably better ways to achieve this result using for example Modified Java Script Value step with scanning the rows backward from current (based on sequence number), but I'm unaware of those functions, if they do exist.

    How can I achieve this using Modified Java Script Value step, or any other efficient way without using a loop for entire content of the file until there are no empty rows?

    As an additional question,
    Is there any place where there are docummented special functions for Java Script step that Pentaho uses? It'd probably be a lot easier if only I knew what I can do with existing functions whose existance I'm unaware of for now.
    Last edited by Ogrey; 01-18-2016 at 05:47 PM. Reason: Marking the thread as solved for future readers

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Don't look back, carry on instead ...
    Try the Repeat field option in Text-File-Input or MS-Excel-Input.
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    My Approach:
    1) Leave file as Microsoft XLSX
    2) Using Microsoft Excel input with the Apache POI (Not the POI Streaming!) configuration, accept the columns as described. Set "Number" to have the attribute "Repeat" to "Y"
    3) Use Sort Rows on "Number"
    4) Use Group By to group on "Number", with the function "Concatenate strings with" ";"

    Three steps, final output:
    Code:
    number Names
    1009 ProductA
    1234 ProductF
    2150 ProductB
    3235 ProductC;ProductD;ProductE
    4566 ProductH;ProductI
    7765 ProductG
    9907 ProductJ
    @marabu ... Great minds...

  4. #4
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    @gutlez: He who uses few words will finish first
    So long, and thanks for all the fish.

  5. #5
    Join Date
    Jan 2016
    Posts
    14

    Default

    Thank you guys both for pointing me into the right direction to solve the case
    I have more than 1 column like number that has been merged in Excel, so below is the explanation for future readers.

    Here's how I approached this:
    1) Microsoft Excel Input (Apache POI)
    1.1) In Fields tab I've marked column Repeat as "Y" for every column but "Names"
    2) Sort rows - sorted my rows by columns "Number" and "Names" in that order. This way I'd get a valid sorting result for Group by step
    3) Group by
    3.1) In Group fields section I've put every column but "Names"
    3.2) In Aggregates section I've put "Names" as a Subject and for Type I've chosen "Concatenate strings separated by ,"

    Working as expected.

    I've also found another approach, but it requires more overhead, using Modified Java Script Value with if statement checking whether all columns but "Names" are not null and if it's true assigning values into variables declared for each column and later using them in Sort rows step and Group by step.

    Again, the second method requires more calculation, but I thought of mentioning it just for the purpose of learning :-)

    Edit: Is there a way to specify a different concatenation delimiter than "," ?
    Last edited by Ogrey; 01-18-2016 at 05:45 PM. Reason: adding a question

  6. #6
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by Ogrey View Post
    I've also found another approach, but it requires more overhead, using Modified Java Script Value with if statement checking whether all columns but "Names" are not null and if it's true assigning values into variables declared for each column and later using them in Sort rows step and Group by step.
    That way will be slower / less efficient...

    Quote Originally Posted by Ogrey View Post
    Edit: Is there a way to specify a different concatenation delimiter than "," ?
    Depending on the PDI version, you can select "Concatenate strings with" as the option (rather than "Concatenate strings with ,") and in the value column put a different character (like ";")

  7. #7
    Join Date
    Jan 2016
    Posts
    14

    Default

    Quote Originally Posted by gutlez View Post
    That way will be slower / less efficient...
    I've mentioned that :-) Wrote about it only to show that there is a different approach.

    Quote Originally Posted by gutlez View Post
    Depending on the PDI version, you can select "Concatenate strings with" as the option (rather than "Concatenate strings with ,") and in the value column put a different character (like ";")
    Somehow I missed that. Thanks.
    Last edited by Ogrey; 01-18-2016 at 07:06 PM.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.