Hitachi Vantara Pentaho Community Forums
Results 1 to 13 of 13

Thread: JOINs

  1. #1

    Default JOINs

    Hi,
    I'm trying to do the join of two inputs in the most efficient way.
    It seems that I have three options:
    1) MergeJoin
    The problem with this is that my inputs are not previously sorted so I would have to add one SortRows for each of them.
    I'm afraid that in the case of the large data sets (millions of rows) this would be quite inefficient .
    2) Join Rows
    In this case I have to define conditions that are actually my join attributes.
    But I was wondering what in the case when I have more then one join attributes.
    On the forum I saw something like this:
    <compare>
    <condition>
    <negated>N</negated>
    <conditions>
    <condition>
    <negated>N</negated>
    <leftvalue>key1</leftvalue>
    <function>=</function>
    <rightvalue>fkey1</rightvalue>
    </condition>
    <condition>
    <negated>N</negated>
    <operator>AND</operator>
    <leftvalue>key2</leftvalue>
    <function>=</function>
    <rightvalue>fkey2</rightvalue>
    </condition>
    </conditions>
    </condition>
    </compare>
    Is it possible to have "conditions" with multiple "condition" inside the first "condition". I know it sounds weird but you know what I mean.
    3) StreamLookup
    Well in this case I define the keys according to which to look up (join attributes).
    However, I'm not sure what is the output of this step. Will this be an INNER join or what?
    And as I read somewhere if I don't provide the sorted inputs the step will sort them by itself so it seems exactly the same as the first option,
    right?


    Which of these three you think is the most suitable solution? And some help about the mentioned problems.

    Thanks for help.

  2. #2

    Default

    If anyone has any suggestion or advice, please share.
    Any help would be useful.
    Thanks

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    It's hard to answer since

    1) You didn't let us know what kind of data you're joining, where it is coming from
    2) You didn't let us know how the join should be done (inner, left/right/full-outer, ... a simple lookup perhaps?)
    3) you pasted non-related XML into the question

    HTH,
    Matt

  4. #4

    Default

    Hi Matt,
    Sorry if I was unclear.
    1) Well I assume that data can come from various sources (TableInput, XML, Excel etc.)
    2) Indeed I would prefer INNER joins but I would also like to know what are the best solutions for other (OUTER) joins
    3) The XML I pasted is the one I found on this forum and someone used it inside the JoinRows step as a filter condition, but it looked a little bit strange to me so I wanted to know if something like that is even possible to be in the FilterRows step.

    I hope I made it more clear now.

    Thanks.

  5. #5
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Hi.

    Lookup stream is a inner join.
    New data is appended to every record from stream A that matches with a record on stream B.

    Mick.

  6. #6
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    IF the data that needs to be joined is coming from a database, make sure to sort (if that is needed) on the database, it's faster to do it there.
    And no, for a "Stream Lookup" step, data does not need to be sorted.
    Last edited by MattCasters; 12-06-2011 at 06:59 AM.

  7. #7

    Default

    Thanks Matt and Mick, you helped a lot.

  8. #8
    Join Date
    Sep 2011
    Posts
    190

    Default

    The default behaviour of a stream lookup is an OUTER join, not inner. All stream records are replicated, supplemented by the lookup fields, or the supplied default value.

    Quote Originally Posted by Mick_data View Post
    Hi.

    Lookup stream is a inner join.
    New data is appended to every record from stream A that matches with a record on stream B.

    Mick.

  9. #9

    Default

    So in that case I would have to add a FilterRows step to filter out the ones with the default value for a new added field. That would be reasonable, right?
    Another thing, in the FilterRows step if I don't specify neither send_true_to nor send_false_to would kettle respect the order that I defined with hops? So if I have a hop from FilterRows step to some other step the flow of the rows that fulfill the condition will go in that direction by default, right?

  10. #10
    Join Date
    Jun 2011
    Posts
    102

    Default

    Hi,
    when using Filter rows step you have to define both true and false hop, in your case simply send "false" data to a dummy step.
    HTH,
    Andrea

  11. #11
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Either both steps or none at all.

  12. #12

    Default

    Filter step must be hopped to a true and a false step. The transformation will fail / not work otherwise. Best to do a verify (F11) first before running it and resolve for Error-4 (in red).

  13. #13
    Join Date
    Mar 2006
    Posts
    170

    Default

    Hey dat789,

    I think what Matt was trying to say is you can "in-line" or "short-circuit" a filter step by entering a "condition" like color = blue ... you DO NOT put anything in the 2 fields above that (send true data to step & send false data to step). If you leave both of those blank BUT DO attach the filter step to the next step via a "black" or "regular" hop then ONLY "true" data will flow past that step.

    In other words you set a condition in the filter step and perhaps a name but that's it. The hop will be black NOT Green or Red.

    Hope that helps!

    Kent

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.