Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Read news from a db - Pentaho Kettle.

  1. #1
    Join Date
    Dec 2014
    Posts
    7

    Default Read news from a db - Pentaho Kettle.

    Hi, I'm job with pentaho and I have a question:

    How to "read news" from a table to database?
    For example I copy a table "A" from the database "1" to table "B" from database "2" with transformations 'Input Table' and 'Ouput Table'. But I insert a new record or update record in table "A", so through pentaho read the changes and update table "B".

    It is understood?

    Thanks

  2. #2
    Join Date
    Jul 2009
    Posts
    476

    Default

    You will need to identify those new or changed rows in the source table ("A") yourself, and create some kind of method to identify how far along you are with processing them into the target table ("B"). You can accomplish this in different ways. One way is to use created_at/updated_at timestamp columns in the source table to identify new/changed rows, and then to create a separate tracking table to record the latest timestamps that you processed, so you don't re-process the same rows over and over. There are other ways, too.

  3. #3
    Join Date
    Dec 2014
    Posts
    7

    Default

    Quote Originally Posted by robj View Post
    You will need to identify those new or changed rows in the source table ("A") yourself, and create some kind of method to identify how far along you are with processing them into the target table ("B"). You can accomplish this in different ways. One way is to use created_at/updated_at timestamp columns in the source table to identify new/changed rows, and then to create a separate tracking table to record the latest timestamps that you processed, so you don't re-process the same rows over and over. There are other ways, too.
    robjs thanks for your answer. I thought about that option since in the tables of database there are fields "created_at" and "updated_at".
    Can you guide me more about this method that you refer to?

    Sorry, I do not speak very good english!
    Thanks!!

  4. #4
    Join Date
    Dec 2014
    Posts
    7

    Default

    Quote Originally Posted by robj View Post
    You will need to identify those new or changed rows in the source table ("A") yourself, and create some kind of method to identify how far along you are with processing them into the target table ("B"). You can accomplish this in different ways. One way is to use created_at/updated_at timestamp columns in the source table to identify new/changed rows, and then to create a separate tracking table to record the latest timestamps that you processed, so you don't re-process the same rows over and over. There are other ways, too.
    robjs thanks for your answer. I thought about that option since in the tables of database there are fields "created_at" and "updated_at".
    Can you guide me more about this method that you refer to?

    Sorry, I do not speak very good english!
    Thanks!!

  5. #5
    Join Date
    Jul 2009
    Posts
    476

    Default

    If your table "A" in database "1" has created_at and updated_at columns, then your initial load into table "B" in database "2" would find the latest created_at/updated_at timestamp in table "A", record that in a tracking table, probably in database "2", and then load all rows. Each subsequent incremental load, or "news" as you referred to it, would select only the rows with a timestamp greater than the timestamp in your tracking table, load those rows, note the latest timestamp in the incremental load, and update the timestamp in the tracking table.

    This method has some weaknesses related to the created_at/updated_at values. If your server clock is rolled back into the past for a time synchronization adjustment, that can be a problem. If a transaction creates a row in table "A" but doesn't commit the row until much later, you might miss it because when you see it, the row appears to have an old timestamp. Daylight Savings Time can be a problem if your timestamps use local time without specifying the time zone, especially when you "fall back."

    You might be able to mitigate these weaknesses by selecting all rows in table "A" that have a timestamp that is slightly older than your tracking timestamp or less. For example, if your tracking timestamp is 2014-12-24 07:52:00, then you might select all rows in table "A" that have a timestamp greater than or equal to 2014-12-23 07:52:00. You'll end up re-processing a lot of rows, but it sounds like that shouldn't change your results.

    If you're only loading data on a daily basis, it might make sense just to load "yesterday's" data.

    One other thing, if you need to capture *changes* to rows in table "A" into table "B", you might want to use the Insert/Update step, even though it is slower.

    Your design really depends on your requirements.

  6. #6
    Join Date
    Dec 2014
    Posts
    7

    Default

    Quote Originally Posted by robj View Post
    If your table "A" in database "1" has created_at and updated_at columns, then your initial load into table "B" in database "2" would find the latest created_at/updated_at timestamp in table "A", record that in a tracking table, probably in database "2", and then load all rows. Each subsequent incremental load, or "news" as you referred to it, would select only the rows with a timestamp greater than the timestamp in your tracking table, load those rows, note the latest timestamp in the incremental load, and update the timestamp in the tracking table.

    This method has some weaknesses related to the created_at/updated_at values. If your server clock is rolled back into the past for a time synchronization adjustment, that can be a problem. If a transaction creates a row in table "A" but doesn't commit the row until much later, you might miss it because when you see it, the row appears to have an old timestamp. Daylight Savings Time can be a problem if your timestamps use local time without specifying the time zone, especially when you "fall back."

    You might be able to mitigate these weaknesses by selecting all rows in table "A" that have a timestamp that is slightly older than your tracking timestamp or less. For example, if your tracking timestamp is 2014-12-24 07:52:00, then you might select all rows in table "A" that have a timestamp greater than or equal to 2014-12-23 07:52:00. You'll end up re-processing a lot of rows, but it sounds like that shouldn't change your results.

    If you're only loading data on a daily basis, it might make sense just to load "yesterday's" data.

    One other thing, if you need to capture *changes* to rows in table "A" into table "B", you might want to use the Insert/Update step, even though it is slower.

    Your design really depends on your requirements.
    robjs, thanks for your answer again.

    Insert/Update is very slow because the tables there are many records. So, I get the rows that "updated_at" is greater than "last_updated_table", but as I realize that row is to insert and update row is for? and as I separate rows and then implement those actions? if that is an upgrade, as that field upgrade?

    thanks!!!

  7. #7
    Join Date
    Jul 2009
    Posts
    476

    Default

    The Insert/Update step is slow because for every row in the stream, it does a select to see if the row already exists in the destination table. If the row exists, then the step updates it. If the row does not exist, then the step inserts it. So the step requires two round trips between the PDI server and the database server, one to check whether the row exists, and the other to apply it. For a small number of rows, this is OK, but if you are processing a lot of rows, then you will notice that it is slow.

    The "Table Output" step is much faster than Insert/Update, but it automatically inserts all of the rows in your stream into the destination table. If some of those rows exist in your destination table already, and you have a table constraint to enforce uniqueness on that key, you will have errors.

    The Bulk Loader steps, which are database-specific, are *MUCH* faster than the Table Output step, but they also automatically insert every row in the stream to the table, and they could have an issue with a constraint on the destination table as well.

    What you decide to do depends on your requirements. One possible idea is to load all of the stream data into a separate staging table in the destination database. Then you could run a SQL script step in the destination database, which runs an update statement on the rows that already exist in the final destination table, and an insert statement on the rows that don't exist there yet. There are probably other ways to do this as well.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.