Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Odd behavior with Merge Rows (diff): generates same deletes and inserts

  1. #1
    Join Date
    Sep 2015

    Question Odd behavior with Merge Rows (diff): generates same deletes and inserts

    Hi Folks,

    We've discovered that with our application, we get excellent performance using Merge Rows (diff) with Synchronize After Merge (as opposed to using the Insert/Update step to perform updates to the warehouse tables).

    It works great on many of our tables, but on some of them, the Merge Rows (diff) step, when it should be recognizing everything as identical between the Reference and Compare rows, instead marks about 1/4 of the rows as identical, but for the other 3/4 of the rows, it generates both a new and delete entry for the rows. I've piped the new/deleted rows out to a table (with the new/deleted flag column) so I can check them--the two sets of rows (new/deleted) are exactly the same--at least as far as SQL Server is concerned.

    The Merge Rows (diff) step is followed by a Switch / Case that passes each of the flag types to Dummy steps (just so it's easy to see what happened in the Step Metrics display). So, here's an example of what I get (other steps omitted to focus on the odd behavior):

    Stepname Read Written
    Merge Rows (diff) 6115180 5351915
    Switch / Case 5351951 5351951
    new 2294361 2294361
    deleted 2294361 2294361
    identical 763229 763229

    Now, in addition to having exactly the same number of new & deleted entries, as noted above the values in each pair of rows marked as new & deleted (piped out to a DB table so I can compare them) are visually, and via an SQL compare, exactly the same.

    If I run this transformation multiple times in a row, I do not always get precisely the same set of rows marked as both new & deleted, but it often does give me the same rows. However, if I restrict the rows returned from the reference and compare sources to those that got marked as new & deleted on a prior run, they all end up judged to be identical! Here's the Step Metrics when selecting just the rows whose keys ended up in the deleted pile on the prior run:

    Stepname Read Written
    Merge Rows (diff) 2294361 2294361
    Switch / Case 4588722 2294361
    new 0 0
    deleted 0 0
    identical 2294361 2294361

    If I were to guess what's happening when I run on the entire contents of the reference and compare sources, it appears that somehow it's getting off track (so that each compare row is not matching what should be the corresponding reference row). But, if I run on just a block of rows from before to after the ID where we started getting the bogus new/deleted pairs, it zips right past the point without any difficulty: the Merge Rows (diff) step judges all the rows to be identical.

    So, I'm pretty well stumped. Has anyone seen anything like this when working with millions of rows?

    Some other facts for your contemplation:
    Pentaho version is
    The Reference data is from a SQL Server table (nothing fancy, it's just a regular table).
    The Compare data comes from a SQL Server table also, but then has some transformation steps applied:
    • A User-defined Java Class that converts UTC to the appropriate local timezone so we have both in the warehouse table
    • A series of Lookup steps to capture the warehouse keys from warehouse dimension tables
    • One check that checks a status value and sets another column to zero in 2 of 3 cases
    • A Select values step to remove some excess fields that do not need to go into the warehouse table in question

    The only marginally complex step is the UTC to Local time conversion--and we use this all over in our environment--and besides, the data we're reading as the Reference source was created using that same UDJC.

    Any insight welcomed! (I'll be sure to post back if I figure out anything.)

    My next steps are:
    • Check the release notes for the available 6.1 service packs for a mention of Merge/Sync
      (I found nothing obvious here, but will try on newer versions to see what happens.)
    • Dig into the source code for Merge Rows (diff)to see if there's anything I can figure out there
      (It appears that the only way to get both a new & deleted entry for the same row could only happen if the two streams were not exactly the same (as far as keys go); but the odd thing is, the Merge Rows (diff) step does not reach a particular point and then stop marking any rows as identical.)

    Last edited by JCraig; 11-22-2016 at 07:22 PM.

  2. #2
    Join Date
    Apr 2008


    Can you post a really simple version of your transform that shows this happening?

    Two Data Grids pointing into a Merge Join, point to your switch/case, and three dummy steps?
    This will help us narrow down if it is a configuration issue, or something with PDI itself.

  3. #3
    Join Date
    Aug 2011


    Awaiting your example transformation, but here is some stuff to check that can result
    in different keys in merge step:
    - if you order your stream inside the DB and have string keys, there might be issues related to the collation of strings.
    One solution is to sort everything in pdi.
    - for stings key too, check if encoding is the same on both side
    - for dates/timestamps keys, check you have same precision of data on both sides:
    For example, one stream comes from a file with Dates up to milliseconds, but once in the database the date is stored only up to seconds.
    So the values are not the same for pdi.
    - dont use Number fields as keys, they are generally not comparable depending on the datasource (because of floating point representation)

  4. #4
    Join Date
    Sep 2015


    Thanks for replying, folks. Much appreciated.

    Can you post a really simple version of your transform that shows this happening?
    If only.... I have not been able to nail it down--although it did occur to me that it'd be interesting to see what happens if I use the same exact data source for both Reference and Compare.

    RE Mathias.CH's suggested items to check, our keys are MS SQL Server bigint values (Java Integer). If there were a difference in date precision, that might suggest apparently unnecessary "changed" rows, but we get this bizarre deleted/new behavior.

    In looking at the Java code for the Merge step, the only way this would seem to happen would be if the two streams were somehow disrupted so that each of the rows in question hit the Merge step substantially out of order; I have not been able to figure out how that would happen. I'm trying an in-PDI sort; we'll see what that does.

  5. #5
    Join Date
    Sep 2015


    OK. So, if I put the sorting in the transformation right before the Merge Rows step, I get all the rows marked identical. Unfortunately, that kind of kills the speed advantage using Merge/Sync has over the Insert/Update approach. But at least it's clear that something is happening in the string of steps leading up to the Merge that is disrupting the ordering of the rows--and apparently in a fairly substantial way (so that we end up with the Deleted/New pairs). That explains the Merge Rows (diff) behavior, now if I can just figure out a way to deal with it that's faster than doing a complete resort in PDI.

    Thanks again for taking the time to reply!

    If I learn anything further that seems like it might be of use to anyone, I'll post it.

    Last edited by JCraig; 11-23-2016 at 03:05 PM.

  6. #6
    Join Date
    Sep 2015

    Talking Identity of the Problem: multiple paths to Merge Rows (diff) step

    I've attached a simple example that shows one way sorted data from Reference and Compare sources can become disordered.

    The critical point seems to be having multiple paths to the Merge Step (in this case the following illustration of part of the attached transformation). Note that there are two paths from the "Canceled/Postponed" Switch/Case step to the "Remove Extra Field" Select Fields step--the rows are no longer guaranteed to be ordered as they enter the Merge Rows (diff) step. The other Merge/Sync transformations that have given us this same deleted/new pairs behavior also have multiple paths between the sorted Compare data and the Merge step. I've been able to use a Java Expression step to substitute for the multiple paths in the case show in the illustration. So, we're back in business!

    Name:  MultiplePathsToMerge.PNG
Views: 763
Size:  7.4 KB

    Thanks again for the replies.

    Attached Files Attached Files

  7. #7
    Join Date
    Jun 2012


    Splitting and merging of streams in Kettle is governed by certain rules. You can control splitting by the data movement option i.e. copy or distribute rows to streams. Merging policy depends on the step. The default is to read chunks of rows (500) from every incoming stream in a round-robin fashion. Append-Streams and Sorted-Merge for example are different.

    If there's a mesh in your graph, Sorted-Merge will reestablish the previous order. Collations can be harmful, though, but you already know.
    So long, and thanks for all the fish.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.