Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: 2.5 Merge Join behavior

  1. #1
    dward Guest

    Default 2.5 Merge Join behavior

    I've run into a problem using a Merge join (inner) step when one of the datastreams is significantly smaller than the other. The step reaches the end of the first stream, finds the first non-matching row in the second stream, and then signals 'done'. (This is generally right, since we now have all possible output rows.)

    However, the upstream steps in the larger datastream may not have finished processing all their rows. In my case, the table import step is still churning away, loading its 10k rows when the merge join stops. The db step and the steps between it and the join all go to sleep, waiting for the downstream steps to accept more rows.

    Shouldn't the join step accept all rows from both streams? Or can/should the 'done' propogate upstream to the other steps?

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I don't think it's accurate to say that done is flagged when there is a single non-matching entry in the second stream.
    However, since the streams are sorted on the join key, that could be the conclusion.

    So I ran a test and confirmed this problem.
    We'll have to read all the rows. Not just to stop the stall but also to make sure all work (before the merge join) is effectively done.

    Thanks for catching this one. I think there have been a few reports about this, but you're the first one to detect the true cause :-)

    Matt

  3. #3
    Join Date
    May 2006
    Posts
    4,882

    Default

    Small example? I would expect the join only to end upon reading the "end" of both streams.

    Regards,
    Sven

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Fixed in subversion for versions 2.5.2 and 3.0.0-RC2.

  6. #6
    dward Guest

    Default Thanks for the quick fix!

    Thanks for the quick turn-around on the fix!

    I have one small issue/question with the fixed code, however:

    Code:
    /*
     * We can stop processing if any of the following is true:
     *   a) Both streams are empty
     *   b) First stream is empty and join type is INNER or LEFT OUTER
     *   c) Second stream is empty and join type is INNER or RIGHT OUTER
     */
    if ((data.one == null && data.two == null) ||
    	(data.one == null && data.one_optional == false) ||
    	(data.two == null && data.two_optional == false))
    {
    
    	// Before we stop processing, we have to make sure that all rows from both input streams are depleted!
    	// If we don't do this, the transformation can stall.
    	//
    	while (data.one!=null && !isStopped()) data.one=getRowFrom(meta.getStepName1());
                    while (data.two!=null && !isStopped()) data.two=getRowFrom(meta.getStepName2());
    
        setOutputDone();
        return false;
    Shouldn't setOutputDone() be called before the step tries to consume all the remaining input? Logically, it is done creating output at that point in time, and it would ensure that steps downstream from the join won't have to wait while all remaining input rows are produced upstream and then consumed here....

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Yes, but then the upstream steps would still running while the downstream steps have already finished.
    That somehow feels a bit counter-intuitive. In any case, does it really matter?

    Cheers,
    Matt

  8. #8
    dward Guest

    Default Doesn't really matter :-)

    Thanks for the reply Matt.

    I realized that my fundamental problem was duplicating a stream and late recombining (via a join, merge, or lookup step). I had several attempts go awry and deadlock when a wait for data on one of the branches backed up the entire transformation. Since then, I've seen a note or two from you in the forums suggesting this topology be avoided. :-)

    One last question, however. It appears that Merge Rows and Merge Join are both case-insensitive, while Stream and DB lookup are both case-sensitive. Is this correct? Any chance for an "Ignore Case" flag for these steps in the future?

    Dan

  9. #9
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    We will extend as much as possible in this direction yes. In 3.0 we built case sensitivity into the core API all over the place.

    Case insensitive database lookup is extremely tricky however.
    Certain databases always look up case insensitive, others never.
    For these others you would have to add where clauses like :

    Code:
      WHERE upper(code) = upper(?)
    However that would ruin performance almost always as indexes are usually no longer found by the optimizer. It gets messy pretty quickly.
    Very tricky to do right.

    Suggestions are always welcome of-course. It's not because something is difficult that it shouldn't be attempted :-)

    Matt

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.