Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: Force database commit after each row using single threader

  1. #1
    Join Date
    Nov 2013
    Posts
    5

    Post Force database commit after each row using single threader

    I've researched these forums extensively and tried seemingly every option mentioned, but I am unable to find a way handle this use case. Any help would be greatly appreciated. Imagine a file with two rows:

    WidgetName,Color,Weight
    -------------------------
    WidgetA,Blue,<empty>
    WidgetA,<empty>,10

    I want to process this data and put it in a Widget table in an Oracle database, with schema:

    WidgetName,Color,Weight

    When processing the above file, what I want is one row in the database at the end that looks like this:

    WidgetA,Blue,10

    In my transform, for each row I look up to see whether a row for WidgetName (using just WidgetName as a key) already exists. If it does, I do an update, if it doesn't I do an insert. Since Pentaho doesn't process rows sequentially by default, when I run the basic transform, two rows are inserted, presumably because the insert of the first row has not finished before the lookup for the second row happens. I know the overall logic in the transform is correct because if I split the two rows in my file into two separate files and process them one at a time, I get one row, as desired. This led me down the path of single threader.

    In my single threader attempt, I have a single threader step that reads from the file and sends each row to the initial job one at a time. I know the single threader one-at-a-time part is working because I put in a delay step and this is reflected in the run time. However, the single thread still does not solve my problem. The exact same behavior exists. I can see by watching the database that no commit is executed until the entire transform is done. Even though the rows are processed one at a time, the second row still does not find the inserted data from the first row.

    I have tried decreasing the commit size of all my insert steps to 1, trying to force autocommit to true, manually inserting a SQL step with an explicit commit call after each insert, changing the cache values, changing the initial transformation's "Make the transformation database transactional" settings, all to no avail.

    Does anyone have any idea how I might accomplish this use case?

    Thanks!

  2. #2
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    I would create a job with 2 transformation.
    Transformation 1 transform the data, then do a copy rows to Transformation 2 which is in charge of insert/update into database.

    This way all rows are processed first, then uploaded.
    -- Mick --

  3. #3
    Join Date
    Nov 2013
    Posts
    5

    Default

    Hi Mick,

    Thanks for the response. This seems like a workaround that I'm willing to try, but the way to accomplish this still isn't clear to me. Even if I split out the database step, I'm still going to have the problem of having to do a lookup/merge on the result set at some point. I am not sure the best way to store the data for that temporary step, so that it can be easily merged at a later step. Every path I can think of seems to add an enormous amount of extra work for something a database would naturally handle. I'd also have to re-specify merge/update logic for every possible combination that would already be specified in the Update steps. Is there something I'm missing on how to easily do this? Everything I have read about merges and joins in Pentaho concerns merging multiple steps in a transform, not multiple rows in an input stream.

    Thanks,
    Curt

  4. #4
    Join Date
    Nov 2013
    Posts
    10

    Default

    Hello cawright00:

    I feel your logic works as a script, but kettle don't work that way. What I will do is "merge join" the 2 rows on the key "WidgetName", and then fill the <empty> by a "modified javascript", and then insert the data into database.

    When how many rows on a key is unsure, you have to use a script to resolve the problem with loop. Then it's just not the kettle's work.

  5. #5
    Join Date
    Jul 2009
    Posts
    476

    Default

    cawright00, look at the "Group by" step under the Statistics folder. In your example, you would group by the WidgetName field. In the Aggregates section, under Type, choose "Last non-null value." Note that you will need to sort the data by WidgetName (and possibly other fields like timestamps, if you have them) before the Group by step. After the Group by, you can insert/update your table.

  6. #6
    Join Date
    Nov 2013
    Posts
    5

    Default

    All,

    Thanks for the tips. Sorry for not updating this thread earlier. I eventually ended up getting the Single Threader step to work appropriately. Something in my transform had become corrupted during all of my experimenting. Once I started again from scratch and did what I thought was right, it started working appropriately. I am now processing the rows transactionally one at a time, guaranteeing that the results of each row are written before proceeding to the next one.

    Thanks,
    Curt

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.