Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: Table Input using the IN () operator

  1. #1
    Join Date
    Feb 2011
    Posts
    4

    Default Table Input using the IN () operator

    I have a step generating a stream of ids, and I want to look up a field in a database using those ids. Since the database table is very big, I do not want to read in the whole table to do a inner join (database lookup), and I also don't want to use database join because pinging the production database for every row puts too much unnecessary load to the server

    Ultimately I want to use the following SQL for a table input step, with ? replaced with a list of (i.e. comma-separated) ids from another step

    Code:
    select field from table1 where id IN (?)
    Any idea how I can achieve this?
    Thanks!
    Eric

  2. #2
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    You could use the DB lookup option, or you could use 'Select field from table1 where id=?' and execute for each input row, or you could combine your values into a single value and use the IN statement

    The last of these is the most difficult to do.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  3. #3
    Join Date
    Nov 2008
    Posts
    143

    Default

    I Agree with gutlez in the difficulty, but it becomes easy if someone did it first and is willing to pass it on :-D

    Considering the limitations concerning number of values inside an IN clause, I must say that you have to take in mind how many values you plan on using there.
    If you're going to use plenty of values, you need to do some grouping.

    First, you'll need a job (J1) and two transformations (T1, T2).

    J1
    - simply calls T1 and then T1.
    T1
    - data input;
    - add sequence;
    - create groups (divide sequence by 255 and you get a group number
    - group by step 1, concatenates "in values" separated by a comma
    - group by step 2, concatenates values generated in the previous step separated by ") OR field IN ("
    - set the resulting query in a variable

    T2
    - table input with your query (select field from table where id IN (${variable.defined.in.T1})
    - whatever processing you need done

    And presto!

    Anyawys, check the attached transformation, it might give you better understanding of what I meant.

    Cheers,

    Renato
    Attached Files Attached Files

  4. #4
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    I don't think the solution as posted will work...

    It is reading Value1' or Field in 'Value2' or Field in ... 'ValueN' or Field in

    What needs to happen is that fields need to be wrapped correctly, and then connected.

    So Strings would need to be wrapped in ' and then connected by , and then put into the variable. ie. 'Value1','Value2','Value3'

    This would become a problem for maintainability, etc. and not something I would recommend be done.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  5. #5
    Join Date
    Nov 2008
    Posts
    143

    Default

    I agree with you, but that's just a sample.
    I never said it worked "as is". It just gives an idea.

  6. #6
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I heard about one person that absolutely wanted to do the join in the database. He went as far as creating a string containing the IN statement, one question mark for every input row, in a separate transformation. That is set as a variable. (e.g. IN_STMT=?,?,?,?,?,?,?)
    In the transformation you then use SELECT * FROM foo WHERE bar in ($IN_STMT) and pass in the data.

    It's a little elaborate for my taste but hey, whatever works right? ;-)

  7. #7
    Join Date
    Nov 2008
    Posts
    143

    Default

    How about you give us an idea of what would match your taste, Matt?

  8. #8
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I'm not sure there is an ideal solution.

    There are different possibilities though...

    The data to be joined all comes from the same database

    In that case you keep it in the database and execute one SQL statement.

    The join data to come some text file

    Obviously, SQL was not designed to work with this, hence the IN() clause problem. So I would upload the data that you want to join into an intermediary (staging) table. From there on you can easily join. The other solution I gave above is OK but please remember that IN() statements are really slow performing on most databases. It's better if you can simply join the data.

    If you can't upload any data to a staging table, use the trick I gave earlier.

    The data to be joined comes from multiple database

    In this case it depends on the data volumes. For small data volumes you can use a join step, possibly load some data in memory: "Stream Lookup", "Database Lookup", "Database Join" for example.

    For large volumes you join with a "Merge Join" step. Make sure to have the databases join the data for you. This also works with databases that are notoriously slow for joining, like Progress and certain column stores.

    HTH,

    Matt

  9. #9
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Matt,

    What is the throughput like on:
    Table Input 1
    Split Stream (Copy)
    1.1 Select Values (Keys Only)
    1.2 Unique Values (Of Keys)
    1.3 Table Input (Which would be the IN clause)
    2.1 Stream Lookup from Split Stream / Table Input (1.3)

    ---------------

    Ignore above.

    Realised after posting that that is a VERY SLOW form of the DB Lookup
    If I want to do the above, I would be better off using the DB Lookup step.
    Last edited by gutlez; 02-09-2011 at 12:40 PM.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.