Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: How to clear result rows inside a transformation?

  1. #1
    Join Date
    Aug 2016
    Posts
    290

    Default How to clear result rows inside a transformation?

    Let's say that need to generate/handle/validate some data(file names, dates, times) in separate steps before passing it to the core ETL transformation.

    The job has the following steps/transformations:

    Start --> T1 (Generate Result Rows) --> T2 (Validate Result Rows) --> core ETL --> Success

    T1 generates one or more result rows.
    T2 validates the result rows, let's say validate on date-time. Only validated rows are passed to result rows.
    core ETL then uses the result rows.

    The problem is if T2 receives result rows that are not validated, the core ETL receives all result rows from T1 instead of T2! How can one avoid this? How to delete all result rows based on logic in a transformation?

    Result Rows is an enigma, highly useful, but hardly documented at all and very difficult to debug. It's just a mystery to me how result rows works, what steps/jobs it works between, when it is incremented or replaced.

    I mean, result rows is the only way to pass lists inside the program, yet it is confusing to try and understand how it works and what the scope of the list is. Why hasn't this been documented?
    Last edited by Sparkles; 06-08-2018 at 06:57 AM.

  2. #2
    Join Date
    Aug 2016
    Posts
    290

    Default

    To answer myself:

    It is impossible to clear result rows inside a transformation. Either the transformation will output new result rows (in which case the previous result rows are cleared), or it will pass the existing result rows if no new ones are created. To handle logic based on whether to pass existing/previous result rows or not, I had to create a variable inside the transformation (Pass rows yes/no), then in the parent job do a simple evaluation to control the flow of the result rows in subsequent transformations. Not the most elegant solution!

  3. #3
    Join Date
    Apr 2012
    Posts
    253

    Default

    You need to use an intermediate file or database table. That's the only solution I've found and for large datasets probably the best solution.

  4. #4
    Join Date
    Aug 2016
    Posts
    290

    Default

    That would work but it is also very messy having external files or database tables for internal program logic. You would then need to handle logic to administrate, add and clean up these files/tables. With a simple variable you avoid all this.

  5. #5
    Join Date
    Jul 2009
    Posts
    476

    Default

    There is a Data Validator step, https://wiki.pentaho.com/display/EAI/Data+Validator, which might help you. It supports error handling, so error rows can go to a separate step, such as a dummy step, instead of being passed with the good rows. I once played around with this step a little bit, but haven't used it in production.

  6. #6
    Join Date
    Aug 2016
    Posts
    290

    Default

    Problem is, if you filter all of them away, then you're not producing any result rows. Next transformation in the parent job will then take the original invalid result rows instead.

    This is one of those things that makes working with result rows really shady. You never know exactly how it is implemented and it sure isn't documented anywhere.
    Last edited by Sparkles; 07-31-2018 at 10:59 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.