Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: Tranformation hangs with the "merge joins step"

  1. #1
    Join Date
    Mar 2012
    Posts
    11

    Default Tranformation hangs with the "merge joins step"

    I run into a problem with the “merge joins step”. In fact, the transformation hangs in particular conditions.

    When there is no input in one of the incoming step of the "merge join step" and when the transformation option value "Nr of Rows in rowset " is less than the total rows, the transformation stops on this first rowset.

    To solve the problem we need to set a high value to “Nr of Rows in rowset”. But I guess we rapidly run into OOM problems.

    I attach a simple transformation to reproduce this problem._ex_merge_join.ktr_ex_merge_join.ktr

    Regards

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    A step blocks when one output buffer is full (the row set size determines this). The consequence of this rule is that if you work hard enough on it, you can design a transformation with circular dependencies that can block.

  3. #3
    Join Date
    Mar 2012
    Posts
    11

    Default

    OK Matt. Thank you for the reply.
    I can understand that a transformation will block with some kind of circular dependencies. But in my example, I tried to make it as simple as possible.


    • I generate 30000 rows with a sequence number
    • I divide the resulting flow in two flows by copying the rows
    • I let flow1 unchanged so the 30000 rows will follow this flow
    • In flow2, I do some filtering on the rows that are passed. In this example, the rows which sequence number is less than 10000 are excluded from the flow
    • The two flows are then merged by a “merge join” step (on the generated sequence key)


    In the example, I set the option “Nr of Rows in rowset” to 100. When I run the transformation, it blocks on the first 100 rows. If I change the filter in flow2 to authorize some rows from the first 100 to be included, it is OK. Or if I set the transformation to have a buffer of 30000 rows (whatever the filter), it is OK.
    Maybe I did not understand the intent of the “merge join” step or one of its prerequisites. But in this quite simple sample, I can’t understand where there is any kind of circular dependency that can block. Could you clarify for me ?

    In my use case, I found a workaround in outputting the result of flow2 in a text file and in merging the two flows in another transformation. Nevertheless, I suppose that it should/could be done in one transfomation

    This transformation was run on Win7/32bit machine and this behavior appears on both kettle 3.2 and 4.2.1

    Regards

  4. #4
    Join Date
    Sep 2009
    Posts
    810

    Default

    Could you post a sample file for us?

    Cheers
    Slawo

  5. #5
    Join Date
    Mar 2012
    Posts
    11

    Default

    The sample file is in the first post : _ex_merge_join.ktr (but double link !!!)

    I attach it again


    _ex_merge_join.ktr

  6. #6
    Join Date
    Sep 2009
    Posts
    810

    Default

    Alright, let me try and explain,

    Setting the nr. of rows in rowset effectively sets the max number of rows a "hop" can hold. I.e. rows in buffers between steps.
    Setting that to 100, the upper hop from add sequence holds a maximum of 100 rows before it blocks (and hence add sequence blocks, because it can't add rows to the output)
    Now, how come the upper hop gets filled up?

    1. the add sequence copies 100 rows to top and bottom hops
    2. the top hop gets filled up and blocks add sequence from adding more rows to the stream
    3. the bottom hop processes the 100 rows and sends them to nirvana, they never make it to merge join
    4. that means, merge join can't start consuming rows from its inbound hops. The upper one has 100 rows, but the lower one has 0

    So with merge join not consuming rows, and add sequence not adding any more to the output, you get a jam.

    PS: I find myself explaining the above quite often. I feel we maybe should try to detect that situation and issue a warning in PDI, or add a warning akin to the "rows must be sorted" to steps that accept input from multiple sources... care to open a jira for that?

    Cheers
    Slawo

  7. #7
    Join Date
    Mar 2012
    Posts
    11

    Default

    Thank you very much Slawo, you explanation is quite clear.

    So in this sample, it is the fact that the two streams depends on a single stream that blocks the process : “Merge Join” is waiting for rows in each buffer from the two incoming hops. One is full with 100 records and the second is empty with zero records. So it can’t consume any rows : it had to know the next key in each stream before discarding, joining or outer joining.

    And, as the “sequence step” can’t feed anymore the upper branch because the buffer is full, it can’t feed the lowest one. I guess that if I had some supplementary steps in the upper branch, the “sequence” step can output more rows because there is more buffers. But it is not a solution for this problem

    So in my transformation, if I duplicate the “generate row” and “sequence” step, it wil be OK because there is no more bottleneck.

    Well, well that’s OK. But you don’t know it when this is the first times you use this step !! In fact we have to make sure that the two incoming streams are independent. IMHO I believe that a kind of warning in this dialog should be welcome.
    In your previous post, did you suggest me to open a ticket on JIRA for this feature?

  8. #8
    Join Date
    Sep 2009
    Posts
    810

    Default

    Yes, if you don't mind, please open a Jira to include a warning dialog regarding above limitation for steps that consume more than one stream and link to this forum thread. It's much appreciated.

    Best
    Slawo

  9. #9
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Said warning should already be present in current trunk. (4.4)
    When you design a transformation with multiple input it checks for a possible deadlock situation. I'll try it on the sample given. It's a WTF example but it at least can serve as a test case :-)
    Last edited by MattCasters; 03-20-2012 at 06:30 PM.

  10. #10
    Join Date
    Aug 2014
    Posts
    1

    Default

    Okay, tested few cases and the Guideline is:
    Add blocker step to the stream where there is no blocking happening to make both streams blocker
    If one stream is partially blocking like filter with first 10k rows filtered and nr set to <10k then add blocker in both parallel streams to convert non blocking stream to blocking and partially blocking also needs to be completely blocking lest it gives problem once rows above 10k are generated.

    Hope this helps

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.