Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: Infamous Merge Join

  1. #1

    Default Infamous Merge Join

    We are experiencing a problem with the merge join step in both 2.5.2 and 3.0.0. Because of the high number of columns and rows in one of the data sources, I don't think there is a way to avoid using the Merge Join step.

    In our scenerio, one data source is a database table, we are sorting the integer key DESC. We expect about 100,000 rows in this dataset. This source is defined as the First Step in the Merge Join.

    The second source is a very large and wide file (37 million rows, 3000+ columns), the file has been presorted on the integer key in descending order. This step is defined as the Second Step in the Merge Join.

    The join type is INNER, so we are expecting only rows to output from the merge join where they exist in both the file and table.

    The problem seems to be that both data sources have to have exactly the same key values.

    Test #1
    Took the first 2 records in the sorted file. Changed them to match the 2nd and 10th records returned from the database query in descending order

    KEY
    999281614
    999210036

    Ran the job and it didn't find the keys.

    Test #2
    Took the first 2 records in the sorted file. Changed them to match the 1st and 10th records returned from the database query in descending order

    Key
    999291401
    999210036

    Ran the job and it only found the first key.

    Test #3
    Ran the same test as Test #2 but used kettle 3.0

    Key
    999291401
    999210036

    Same results as Test 2 - it only found the first key.

    Is there some obvious mistake or is this a bug in the Merge Join Step? Should we be sorting ASC? Would another step work better?

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    > Should we be sorting ASC?

    Hi Chris, I just looked at the code and I strange as it might sound, I do think you should sort ascending.
    We could add an option to also do the desc case, but for now, go asc.

    As for the keys being different, that should be no problem. For safety it's best that the data types are identical to avoid implicit data conversion issues.
    However, if the data types of the keys are the same, there is no reason to think it wouldn't work, not even with 3000 columns ;-)

    I do think that version 3 would run ... euh... slightly faster in this situation.

    I'll keep an eye on this thread, let me know what happens.

    All the best,
    Matt

    P.S. Go here to get rid of most if not all bugs in v3.0.0 : http://forums.pentaho.org/showthread...180#post175180
    Last edited by MattCasters; 11-27-2007 at 07:04 AM.

  3. #3

    Default

    Can I tack on another item for merge join -
    When looking for output fields from merge join, I get the message I couldn't find any fields.
    This is something that did not work for me in the past. Now, I am using 2.5.2 and still have the same problem that I get the message I couldn't find any fields. Same happens when I check for output fields in the included sample file which leads me to think this is a bug that hasn't got any attention.
    It would be very helpful to fix this.

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    That issue is fixed in version 3 and is almost unfix-able in version 2.
    And indeed, I'm not giving it that much attention anymore.

    Matt

  5. #5

    Default Ascending in Merge Join

    Hi Matt,

    Ran some test data and ASC seems to work fine. Thanks for looking at this.

    Chris

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.