Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: Joining two streams - 3.2

  1. #1

    Default Joining two streams - 3.2

    I have two streams I'm trying to join

    S1:
    ID | name
    1 | one
    2 | two
    3 | three

    S2:
    ID & Value
    1 | A
    1 | B
    2 | C
    3 | D

    I need:
    ID | Name | Value
    1 | one | A
    1 | one | B
    2 | two | C
    3 | three | D

    I can do this with a DB lookup step, but the db call takes about a second each, and I have about a million rows, is there a Join step that I can do this with? I need to join both streams and where an ID is found multiple times in Stream 2, end up with 2 rows outputted.

    Thanks.

  2. #2
    Join Date
    Sep 2009
    Posts
    810

    Default

    Hi dilbert,

    I'm assuming these tables are in a different DB, so you cannot do a SQL join which is likely to be fastest.

    You could select stream1 and stream2 using the table input step (*important*: order output by ID on both steps) and merge them using the "Merge Join" step specifying S1 first, S2 second

    Make it an INNER if there's always at least one match in S2, or you'd like to drop the S1 row if there isn't
    Make it an LEFT OUTER if all rows from S1 should always pass, joining all applicable S2 rows, or a NULL row if there aren't any matches
    Make it an RIGHT OUTER if all rows from S2 should always pass, joining to all applicable S1 rows, or a NULL row if there aren't any matches
    Make it a FULL OUTER if you want to keep all rows from both S1 and S2, joining NULL rows on any side if there's no match

    Cheers

    Slawo

  3. #3

    Default

    Hi & thanks for the quick response.

    I was using the Merge Join step, both inputs are sorted on the ID column, and I do have the correct number of output rows, however... the value that is added from the second stream is always the last value that shows. Like this:

    I get:
    ID | Name | Value
    1 | one | B <- should be A
    1 | one | B
    2 | two | C
    3 | three | D

  4. #4

    Default

    So with more testing it appears I can get two separate results without changing anything... One time it gives me the correct results and the next it'll be wrong. This is probably why this made it to production...

  5. #5
    Join Date
    Sep 2009
    Posts
    810

    Default

    Hi dilbert,

    I'd try restarting spoon and maybe creating the merge join from scratch. Maybe something got confused on the key definitions. Just to be sure: you are using the ID field as the only join key on both streams, right?

    Which version of 3.2. are you using btw? Have you tried 3.2.5, which is the most recent bugfix release of 3.2, yet?

    Cheers

    Slawo

  6. #6

    Default

    The version of kettle I'm using is 3.2.0. The issue came to our attention from a different box, so I don't think that it's just an error with what is in memory. I'll download the latest version of 3.2 and see if I can reproduce it.

    Yes, the ID field is the only field that is joining both streams. It almost seems like the step is using cache'ing or something to try to do it quicker, but it doesn't work...

  7. #7

    Default

    Where do I get the latest version of 3.2? I looked here and all that is posted is the version I have which is 3.2.0, the : http://sourceforge.net/projects/pent...20Integration/

  8. #8
    Join Date
    Sep 2009
    Posts
    810

    Default

    If you are willing to build it yourself (and got the ant build system around) check it out at svn://source.pentaho.org/svnkettleroot/Kettle/branches/3.2.5 and type ant zip (in other words, build the zip target)

    There's also the possibility to get a 3.2.5 build here:
    http://www.cloud2land.com/pentaho-da...tle-snapshots/

    You'd have to register at the cloud2land blog first, though

    Cheers

    Slawo

  9. #9

    Default

    build 3.2.5 appears to have fixed the issue.

    I think it had to do with a race condition or something coming into the merge step.

    Testing: - 20 runs of the exact same script
    3.2.0 - 9/20 times returned correct results, yes, more than 1/2 it was wrong.
    3.2.5 - 20/20 times returned correct.

    Now I have to roll out 3.2.5 to all of our dev's and prod boxes...

  10. #10
    Join Date
    Sep 2013
    Posts
    28

    Default Joining two Text input files - Merge join?

    Hello

    I am trying to join or to union two text file inputs. I tried it with Merge join FULL OUTER but my result is not really satisfying and I think there is a better way to solve this. The step Stream lookup doesn't get me to my goal either. I attached the .ktr with the two text files.

    My goal and problem 1: merge the two inputs / it needs to include every dataset from both input1 and input2 / eventually the fields maincat_aggr need to be joined in the new field L0+L1 (double entries of maincat_aggr's need to be avoided).

    Problem 2: is there a single step after Merge join (or whatever solution step you have) that joins me the appropriate fields instead of these several concat-steps?

    Problem 3: my Text file output puts every single field in a next row. Why won't a dataset stay in one row? PLUS the datasets are full of spaces. What did I do wrong? Sorry, I am a novice.

    Many thanks for your help!
    Regards, zioso
    Attached Files Attached Files
    pdi-ce-4.4.0-stable
    timezone UTC/GMT -4 Stunde

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.