Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Is there any way to look up data in the different tables?

  1. #1

    Default Is there any way to look up data in the different tables?

    Hi All,

    I've found that database lookup doesn't support variables, so it can only be quired in a table.

    I now have 12 tables each year (tables divided by month), and each table has around 20 million or more records. These tables are stored in the central database.

    My job now is transferring data from the local databases to the central one, but before importing data into any table, I need to make sure the imported data is not in the central database. So I wanna make a database lookup in each month table.

    As you know, I have 12 tables for each year, and I have several years data (at least 3 year), but the data on the local database might have each month data (36 months), though data volume can be very low (might only 4-5 records for some months, and 400-500 for the others, and 400,000 at most in a month).

    I thought I could make a database lookup in each month table, but database lookup doesn't support variables, therefore, I can't find the right table.

    Any solution to this should be greatly appreciated.

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    Also not in v2.5, variables are available in db lookup in v3 but at the moment that version is still in destruction/construction phase.

    I don't see a quick way out.

    Regards,
    Sven

  3. #3

    Default

    ohhh. this is not a good news to me...

    Let's give up database lookup now. Is there any solution to my problem?

    I'd like to make it much more simple:

    to check whether the central database has the data which going to be imported from local database, but the tables on central database has been spited by month, so data are distrusted. Never consider union all tables, which should be 2 large for looking up.

    I thought extracting data from central database (input table supports variables), but the problem here is that even if there's only one record in a month, I have to extract all 20 million data from the central database, which is really time wasting.

    waiting for other solutions, and specially thanks to Sven. Lots of problems got solved by you.
    I don't see a quick way out.
    Quote Originally Posted by sboden View Post
    Also not in v2.5, variables are available in db lookup in v3 but at the moment that version is still in destruction/construction phase.

    I don't see a quick way out.

    Regards,
    Sven

  4. #4
    Join Date
    May 2006
    Posts
    4,882

    Default

    In input table you can use variables... and you can change the query to only extract the ones you need ... use input table and if the data is not too big load it into a stream lookup step.

    Regards,
    Sven

  5. #5

    Default

    Thanks, Sven.

    The problem is that my input table is quite large, 20 million records for each table (I need to check a table each time).

    It might not take much time extract only Key Field, but if I only have one record in a month, I still have to extract 20 million records from the central database.... that's really amazing !

    expecting newer solution...
    Quote Originally Posted by sboden View Post
    In input table you can use variables... and you can change the query to only extract the ones you need ... use input table and if the data is not too big load it into a stream lookup step.

    Regards,
    Sven

  6. #6
    Join Date
    Jun 2007
    Posts
    9

    Default create a journal table

    I think the right approach to your issue if synchronizing databases is not to check whether the data is already in the central db but to KNOW which data was loaded and which has not been yet.
    You can accomplish this by having a journal table for ETL and store there last id/timestamp of loaded records per table. This approach takes some design efforts but imho this is the only fast and scalable way to load data in your case.
    -slava.

  7. #7

    Default

    Quote Originally Posted by slava123 View Post
    I think the right approach to your issue if synchronizing databases is not to check whether the data is already in the central db but to KNOW which data was loaded and which has not been yet.
    You can accomplish this by having a journal table for ETL and store there last id/timestamp of loaded records per table. This approach takes some design efforts but imho this is the only fast and scalable way to load data in your case.
    -slava.
    Thanks for your reply, Slava.

    I don't think we're doing database synchronizing, but to import data from local databases to the central databases. We can't import data directly to the central database is repeated records are not allowed in the central database.

    The volume is too large for the central database, so we split the database by month, and we've already had 3 years (36 tables) data.

    The local database got data from other source. the source data can be distributed from 3 years ago to today, and we should save each month data in each table.

    If we could use database lookup, the easiest e way is to check each record in local DB existing in central database. if not, insert the record into that month table. Now database lookup doesn;t support variables, so I can't find the tables I need.

    I believe Kettle can finish this task with your helps, .

    Thanks for all your quick replies!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.