Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: How to deal with thousands of tables in different connections

  1. #1

    Default How to deal with thousands of tables in different connections


    I'm coping with a very complex problem (to me). My problem is:

    1. To look up a uniq id from 1000 tables which are located on different servers (there's a config file listing these mysql database connections).
    2. After got these ids, look up another 2000 tables which are located on other servers, and match the ids from the first step
    3. insert them back to the second database(where they come,where they go).

    I thought fetching scan all tables in the first step, and serialize the data into a file, and deserialize then when processing the second step. Unfortunately, the de-serialization doesn't suport regrex, and only one serialization file allowed, so again I can't combine 1000 serilizated file.

    Ideas please, very urgent. My boss seems doubting the possibility of kettle, but I can use kettle only.


    Before looking up the first step, we need to go to a directory table, which lists where I could fetch the 1000 tables, the directory tables have the information of the connection, that's, before looking up the connection, I have no ideas where to fetch the table.

    In order to reduce the memory consumption, we cannot load 2000 tables at once in the step II, we have to do it one by one. So main procedure is :

    get ONE table from 2000 tables ( mentioned in the step2)
    look up it in the directory table to get the connection information
    after getting the connection information, we then look up the target table to get the final records
    group and make other operations to these records.

    My more complex question is how can we set the connection with out looking up the directory twice as it could be very slow (database lookup).

    Thanks for your helps, very urgent.
    Last edited by xiawinter; 03-02-2009 at 04:10 AM. Reason: ideas not complete.

  2. #2
    Join Date
    Nov 1999


    Hi Samuel,

    I guess your question is a little bit odd, but I'm sure it can be done.
    Since as you mention the slowest part of the whole exercise are the remote databases, you would probably be best served by a job that iterates over the various databases and tables.

    There is a sample in "samples/jobs/process all tables" that is similar.
    I guess what is needed on top of that is:

    • read database, source and target table and ID information from another source (you should have that) : change "Get list of tables.ktr"
    • Set all the variables you need like table name, field name, database server, port, ... for both source and target : change "set variables.ktr"
    • Do whatever you need to do as far as the retrieval and storing of the id is concerned in "Process one table.ktr". Make sure to create a database connection that works with variables. (generic or not)

    That's pretty much all you need to do to get it to work.


  3. #3


    Matt, thanks very much for your quick reply.

    Yes. The problem is so complex that I never met it before.

    I followed the sample job in kettle, and I could iterate in the difference connections.

    Still two questions in my mind for long time:

    1. why do we have to add ${VARIABLE} in the node name?
    2. Can a job have several nodes of copy rows into result in different transformations? Does the node conflict each other in different transformations?


Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.