Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: How to join constant columns from another step?

  1. #1
    Join Date
    Aug 2016
    Posts
    281

    Default How to join constant columns from another step?

    Is it possible to do something like a "Stream Value Lookup" which adds a handful of fields to a stream, but without looking up keys?

    Basically I can achieve this behavior by adding a dummy "key" to the main stream, then use this key to lookup the fields to be joined into every row. But it would be prettier if there was no need to add this dummy key, and just join the fields into each row without any key/lookup logic.

    Example:

    Let's say a main stream has fields A, B C and we get 2 rows coming in (1,2,3) and (4,5,6).

    A B C
    1 2 3
    4 5 6

    A different step looks up additional fields E, F from some separate source (db or whatever) and joins this row (x, y). The result is shown below.


    A B C D E
    1 2 3 x y
    4 5 6 x y

    This is basically a cartesian product? But I'm not sure if "Jon Rows (cartesian product)" is suitable for large amounts of data and if it needs to receive all rows before doing the join?

  2. #2
    Join Date
    Apr 2008
    Posts
    4,674

    Default

    There is a way to set up the "Join Rows" step for this... But I can't remember the right way.

    Since one data set has only one row, you want to join all the other rows against it. (This will cause the step to only wait on the 1 row, rather than on all the rows). The step documentation (https://wiki.pentaho.com/pages/viewp...pageId=8292150 ) suggests that Merge Join can be more performant.

    If you set up a test set of data (Eg. 10 rows x 3 rows), and alternate the "Main step to read from", you will see the join order switch (3 rows of join from 1st row of 10, 3 rows of join from 2nd, etc vs. 10 rows of join from 1st of 3, 10 rows of join from 2, etc)

  3. #3
    Join Date
    May 2016
    Posts
    272

    Default

    Or you could just use the Merge join step without any condition, or a dummy condition like 1=1. Provided that the second step only has one row (or you set up a Unique rows step previously to get only one row)
    Regards
    OS: Ubuntu 16.04 64 bits
    Java: Openjdk 1.8.0_131
    Pentaho 6.1 CE

  4. #4
    Join Date
    Aug 2016
    Posts
    281

    Default

    I tried the "Merge Join" step. I started with 2 "Generate Rows" steps. One of them only write one row. The other step continously writes a row every second.

    But the Merge Join step didn't output any rows. Also, it gave a warning that input rows needs to be sorted, this is not possible for large amount of rows. I didn't specify any keys and tried the different joins (INNER, LEFT, RIGHT, OUTER).

    The "Join Rows (cartesian product)" on the other hand did work as expected! "Main step to read from" should be the stream with big data", the stream with the single row is not specified in the step config. Now I'm curious how the performance is compared to stream lookup step.
    Last edited by Sparkles; 01-22-2019 at 07:20 AM.

  5. #5
    Join Date
    May 2016
    Posts
    272

    Default

    It seems strange the Merge join step didn't work, I have used it myself other times. Maybe you need to specify the keys for the join, even if they are dummy keys like 1=1. The warning about the input rows needed sorted always shows up, but in this case it's not necessary, since there's no real key to join.
    Regards
    OS: Ubuntu 16.04 64 bits
    Java: Openjdk 1.8.0_131
    Pentaho 6.1 CE

  6. #6
    Join Date
    Aug 2016
    Posts
    281

    Default

    It works if I disable "Never stop generating rows". But if I have 100000 rows, I want row nr 1 to be dispatched immediately, not wait for row nr 100000.

    It seems to me that Merge Join step is even worse than Stream Value Lookup in this case, because you're still polluting the stream with dummy key fields, but in addition the entire stream is effectively blocked until all rows have been read.

    I often find myself in situations where I need to do simple hacks like dummy key fields in order to get the expected behavior
    Last edited by Sparkles; 01-23-2019 at 08:22 AM.

  7. #7
    Join Date
    Apr 2008
    Posts
    4,674

    Default

    Quote Originally Posted by Sparkles View Post
    in order to get the expected behavior
    Considering that Merge Join is supposed to be able to do a number of different match joins, you would have to hold all rows in memory the whole time...
    You can't release any rows until you have all the rows from both sides if you're going to do a proper key-join, since the last row from one side might match the first row from the other side.

    I find that often when a step doesn't work the way I expect it to, I have to step back and look at what it is that the step is doing... at which point it often becomes clear what the "expected behavior" is.

  8. #8
    Join Date
    Aug 2016
    Posts
    281

    Default

    Sorry, I was not accurate in how I described it. What I meant was, I often have to do these simple hacks in order to get the behavior I want. I realize as you point out some of these steps are aimed at doing other things, like the join you describe. But every so often I can't find the right step to do simple things the way I want (without the hacks as mentioned).

  9. #9
    Join Date
    Apr 2008
    Posts
    4,674

    Default

    Quote Originally Posted by Sparkles View Post
    But every so often I can't find the right step to do simple things the way I want (without the hacks as mentioned).
    Yeah, I have the same issue sometimes...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2017 Pentaho Corporation. All Rights Reserved.