Hitachi Vantara Pentaho Community Forums
Results 1 to 14 of 14

Thread: Input from 2 different Database Connections

  1. #1

    Post Input from 2 different Database Connections

    Hi,

    What I am trying to do is get some ID numbers from a database on one connection.
    Then SELECT some rows from a table on another database connection using the values input from the first.

    Essentially I get :
    12345
    23456
    34567
    etc
    from the first connection,
    and then I want to select values from the second connection, which also has an ID field, if the ID is in the results from the first connection.

    FYI - The 2 IDs are both the same datatype and length.

    I realise that I could get the all the data in from both connections and then do a join,
    but I am dealing with millions of rows and that is far to time consuming for this application.

  2. #2
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    You can try to create a temp table in the second database, populated with the list of IDs from the first database, then fire a Join SQL command to be executed by the second database.
    This way, the join operation will be computed by the database and it should be able to deal with millions of rows.
    -- Mick --

  3. #3
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    If you can't create a temp table as Mick suggests, you can "chain" Table Input steps.

    I would recommend running a sort and Unique on the IDs between them (unless your table input is already pulling unique values)

    eg.

    Table Input 1: Select Distinct ID from Table1 (on connection DB#1)
    Table Input 2: Select otherdata1, otherdata2 from Table2 where ID=? (on connection DB#2, Insert Values from step "Table Input 1")


    Note that this *WILL* make a round-trip to DB#2 for EACH row in Table Input 1.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  4. #4
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Shouldn't we know what type of database connection we deal with?
    So long, and thanks for all the fish.

  5. #5

    Default

    Hi,

    I am not sure what step you are using to join get data from another table on the basis of id which you are retreving from first DB conection.
    If both table are in same database try to get it in single "Database input" as below:

    Select a,b,c...from table 2
    where id in (select id from table 1)...

    Definately if you are doing lookup, it will take time to retreve the data.
    Thanks,
    Vishal Kumar

  6. #6

    Post

    Thanks for your responses.
    I tried the solution offered by gutlez and it works.
    But far to slowly for what I need.

    And I wasn't able to figure out the solution offered by mick_data as I am relatively new at this sort of stuff.

    I have come up with another possible solution though.
    Essentially I am getting all of the ID's out of the second connection, concatenating them and enclosing them in brackets so I can use the "in" operator.
    Although this is still giving me an issue.
    The SELECT statement I am trying to use is:

    select * from text_defect_actions
    where defect_number in ?

    defect_number is a column in the table text_defect_actions.
    The ? is input from a java class that adds the brackets onto the string of ID's separated by ",".
    The attached picture shows the transformation I am trying to run.
    I have circled the problematic section in red.
    I know that the rest of the transformation works fine.

    Thanks
    Attached Images Attached Images  

  7. #7
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    The screenshot shows much more detail than we would expect from your problem description.
    Either you are overacting or you didn't tell enough of your story.
    Quote Originally Posted by marabu View Post
    Shouldn't we know what type of database connection we deal with?
    Sybase should support heterogenous queries, so what you described in your opening post essentially could be done with one single Table Input step.
    When assembling a value list to use with the IN operator, watch your statement size limit.
    So long, and thanks for all the fish.

  8. #8

    Default

    The screenshot is the entire transformation I am working on.
    Everything outside of the red circle definitely works when I get all of the data from the database.
    What I am trying to do is reduce the amount of rows I have to get out of that database, because it is very slow.
    FYI it is a JDBC connection using the Vortex Driver (This was set up by someone else).
    The other connection is an ODBC connection to a Sybase Database, which is much quicker.
    So I am trying to find out what records need updating by getting the records that have been changed in the Sybase database, and getting only the corresponding records out of the JDBC databse.

    As for the statement size, worst case scenario it would be 350 characters so I didn't think that would be an issue?

  9. #9
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    VORTEX is just the driver, not the database.
    You don't want to connect to Sybase via ODBC.
    Install the correct JDBC driver with Kettle.
    350 characters in your SELECT statement should cause no worries at all, limits often are 32k or 64k nowadays.
    Such a small set of values makes me wonder, why you should have performance problems anyway.
    If you are sure the SELECT statement is responsible for the poor performance, you should have a look at the execution plan of your query to find out what really happens.
    So long, and thanks for all the fish.

  10. #10
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by IAS_Tyler View Post
    Essentially I am getting all of the ID's out of the second connection, concatenating them and enclosing them in brackets so I can use the "in" operator.
    Although this is still giving me an issue.
    The SELECT statement I am trying to use is:

    select * from text_defect_actions
    where defect_number in ?
    You can't do it that way... If you want to do it that way, you have to use variables, and take on quoting yourself.

    Now then...
    Take as much of the "Sort Rows 3 -> Unique Rows 2 -> Set Field value to Constant -> Group By -> Select Values -> Format" as you can and push it down to your DB in the Sybase Input step.

    For Performance Testing - to figure out where the bottleneck is, take JUST the parts in the red circle, and then feed that to a Text File Output step. See how that does.
    Kettle will slow rows down depending on both upstream and downstream steps, so if the bottleneck is in the Insert / Update step, the table inputs could be being slowed down.

    @marabu Two Table inputs since the "Sybase Input" and "Defect ?? Input" are different DBs
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  11. #11
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Quote Originally Posted by gutlez View Post
    @marabu Two Table inputs since the "Sybase Input" and "Defect ?? Input" are different DBs
    If Sybase doesn't support "heterogenous queries", you are right, gutlez.
    So long, and thanks for all the fish.

  12. #12

    Default

    I have no choice in connections.
    They have been set up by my boss and I don't have access to change them.

    The SELECT statement isn't the performance issue, it's the amount of incoming data.
    So I am trying to use the data input into the Sybase database from another transformation, which takes far less time, to reduce the number of records returned here.
    I can, and have, used a join to get both lots of data out of the first database, but this is far to slow. (10+ Minutes).

    Essentially I am getting from the Sybase all of the IDs (Defect_Numbers) that have changed today.
    Then trying to get the records out of the defect text input step that have an ID which in in the results of the Sybase Query.

    What is boils down to is I get a Syntax error on my 2nd select statement.
    The statement is:

    select * from text_defect_actions
    where defect_number in ?;

    The error is:

    2013/12/03 10:04:59 - defect text input.0 - ERROR (version 4.4.0-stable, build 17588 from 2012-11-21 16.02.21 by buildguy) : Unexpected error
    2013/12/03 10:04:59 - defect text input.0 - ERROR (version 4.4.0-stable, build 17588 from 2012-11-21 16.02.21 by buildguy) : org.pentaho.di.core.exception.KettleDatabaseException:
    2013/12/03 10:04:59 - defect text input.0 - ERROR (version 4.4.0-stable, build 17588 from 2012-11-21 16.02.21 by buildguy) : An error occurred executing SQL:
    2013/12/03 10:04:59 - defect text input.0 - ERROR (version 4.4.0-stable, build 17588 from 2012-11-21 16.02.21 by buildguy) : select * from text_defect_actions
    where defect_number in ?;


    2013/12/03 10:04:59 - defect text input.0 - ERROR (version 4.4.0-stable, build 17588 from 2012-11-21 16.02.21 by buildguy) : executeQuery exception: (pos: 58 '...t_number in ^?;') - syntax error

    And the ? is the string containing the IDs.
    Last edited by IAS_Tyler; 12-02-2013 at 07:28 PM.

  13. #13
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Quote Originally Posted by IAS_Tyler View Post
    select * from text_defect_actions where defect_number in ?;
    The question mark is a placeholder for a value expression in a (soon to be prepared) SQL statement.
    You can't use a SQL variable for a value list, use text substitution instead.
    That's what gutlez already tried to tell you.
    Use a separate transformation to build the value list including paratheses and store it in a variable (Set Variables).
    Use the variable in a different transformation: SELECT * FROM text_defect_actions WHERE defect_number in ${YOUR_VARIABLE_NAME_HERE};
    So long, and thanks for all the fish.

  14. #14

    Default

    I managed to figure that out just before you replied. :P
    Thanks everyone for your help.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.