Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: Dimension lookup of SCD type 2 performing slow while loading RDS table.

  1. #1
    Join Date
    Mar 2014
    Posts
    181

    Default Dimension lookup of SCD type 2 performing slow while loading RDS table.

    Hello all,

    I am using Pentaho 5.0.1 community edition and I have table input step to read all the data from a local postgres database at the company headquaters and the next step is a dimension lookup/update step performing SCD type 2 where the order status and shipping status changes, inserts the data into a postgres RDS instance..

    To read 300,000 records of data from the local company database, and insert into an amazon RDS instance, it takes almost 2 hours.

    I have attached the DDL of the table in amazon RDS instance and the Kettle transformation. Also attached are the logs for test run I did for 2000 records.

    How can I make this better to improve the performance?

    What suggestions do you give me?

    Thanks,

    Ron
    Attached Files Attached Files
    Last edited by Ron256; 05-29-2014 at 07:38 PM. Reason: Attaching log file

  2. #2
    Join Date
    Apr 2008
    Posts
    4,687

    Default

    Some initial feedback:
    1) Turn down the logging. Logging slows your transform down as it writes to the logging constantly.
    2) Simplify your SQL - you appear to be doing casts of casts of subselects... This will slow your data send down.
    3) Put PDI as close to your destination DB as you can.


    I'm sure that others may have more concrete hints... This is just general feedback.

  3. #3
    Join Date
    Mar 2014
    Posts
    181

    Default

    Quote Originally Posted by gutlez View Post
    Some initial feedback:
    1) Turn down the logging. Logging slows your transform down as it writes to the logging constantly.
    2) Simplify your SQL - you appear to be doing casts of casts of subselects... This will slow your data send down.
    3) Put PDI as close to your destination DB as you can.


    I'm sure that others may have more concrete hints... This is just general feedback.

    I just enabled logging for the purpose of showing you what is going on, but when i ran it, logging was on basic or minimal.

    Thanks for the quick reply.

    Thanks,

    Ron

  4. #4
    Join Date
    Jul 2009
    Posts
    474

    Default

    I looked at your attachments. It looks like you are doing SCD Type 2 updates when the order status or shipping status changes, but if anything else changes, you just update the current row. That's OK, just conveying my understanding of what you are doing.

    The 2 hours it takes for 300,000 records translates into 41 records per second. That's not very good, but network latency might be slowing you down. For each row, the Dimension Lookup/Update step has to query the database to see if the table has a row with the dimension key, and then it inserts or updates the table, as appropriate, with the row from your stream. That's two round trips between your PDI server and Amazon RDS per row, even if you commit infrequently, so 41 records per second is 82 round trips per second, or 164 network trips per second, which is roughly 6 milliseconds per network trip. 6 ms sounds reasonable, although I'm not a networking expert, so you could be network-bound.

    Try running your transformation with a local Postgres database on your PDI box, and see how long it takes. I'm betting it will be much faster. That would at least tell you the problem, although not the solution.

    The other thing I noticed is that your index starts with the dimension key, src_ordsgsku_id, which is good, but it doesn't include your version field, order_version. I *think* (but am not sure) that when the dimension lookup first checks to see if the dimension row exists, it looks for the latest version only. So if you have an order that goes through 20 status changes (probably an exaggeration), and send a new row to the Dim Lookup Step, it will query Postgres for the last matching row. Postgres will then use its index, but it will grab all 20 rows from the index and then have to examine them to find the latest version. So your PG server is doing many times more I/O than it needs to. If your index has order_version field in second position, then I *think* your database I/O will go down. However, the network latency issue is probably much more significant than this one.

    HTH

  5. #5

    Default

    You could also try to perform the dimension update with an H2 embedded database. Very very fast. The main drawback of H2 is that you can't cluster PDI--only one JVM can use the database at a time.

    At our company we populate large SCD dimensions in a two-step process. We build a staging dimension using a fast, low-latency database backend, then copy new rows from this dimension to the data warehouse (the "real" dimension table) in a separate transformation. The main reason we do this is that our data warehouse backend, while excellent at data aggregation and analytic queries, is useless for fast, low-latency single-row lookups.

  6. #6
    Join Date
    Mar 2014
    Posts
    181

    Default

    Hi Robj,

    I actually see what you are saying. Thanks for your input. BTW the lines I run my transformations with a local postgress database within the local PDI installation where I work. It took 30 minutes to load the 300,000 records of data into my PSA.

    I will also go ahead and create an index on the order_version field then run it again to see the performance.

    Thanks,

    Ron

  7. #7
    Join Date
    Mar 2014
    Posts
    181

    Default

    Jsturm,

    Can you please throw some steps on how the H2 embedded database works with the dimension update step.
    Is it possible to give me some hints using my case as an example?

    Thanks,

    Ron

  8. #8
    Join Date
    Apr 2008
    Posts
    4,687

    Default

    Local DB also meets with suggestion #3 from my post...
    If you put PDI (or Carte) on your RDS node, directly next to the DB, it may help with performance as well.

  9. #9
    Join Date
    Mar 2014
    Posts
    181

    Default

    Hi Gutlez,

    Yes, it does!

    Thanks,

    Ron

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.