Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: correct expectation for SCD update step?

  1. #1

    Default correct expectation for SCD update step?

    Hello,

    I'm trying to see if I can improve the speed of my dimension lookup/update step.

    I've created a transformation that takes data dumped into a text file from my operational system and uses the Dimension update/lookup step to do a Kimball Type I update on my entities dimension. This is a list of customers and vendors. Each customer or vendor has a unique integer entity_id. This is the sole primary key for the table.

    The dimension table has 63,245 rows. The input stream is the same size, as I have just been overwriting the table as I try to improve performance. (The table has one more row, the dummy zero key row that is needed by PDI.) I have 16 fields including the primary key, entity_id. Most fields are strings. (The biggest are 40 characters, most are smaller). Four fields are Integer. One field is Date.

    The best performance I get on the update step is 39 r/s. This step seems to be the bottleneck as only one step precedes it, a select values step that identifies the meta-data type (Integer, date, etc.) for each field. At a given moment in the processing of the transformation, the select values step will be 10,000 rows ahead of the dimension lookup/update step. I'm guessing that this maximum gap is fixed by the 'Nr of rows in rowset' setting in the Transformation settings.

    I have also turned off logging when launching the transformation, hoping that would help.

    My question is whether this rate of 39 r/s is what I should expect or whether there is something I can do to improve the speed. It takes about 50 minutes to complete the punch-through update of 63k rows.

    I am using PDI 3.2.0-RC1 and Postgres 8.3.

    Thank you!

  2. #2
    Join Date
    Feb 2009
    Posts
    296

    Default

    How is your database handling it? It might be the actual bottleneck as the Dimension step just talks to the database...
    And yes: you can expect more.
    Fabian,
    doing ETL with his hands bound on his back

  3. #3

    Default how do I check how my database is handling it?

    Do you mean the SQL that my database is running?

    When I check for the current running query, I see the following:

    UPDATE "public".entities_practice\r

    SET vendor_customer_status = $1\r
    , home_branch = $2\r
    , price_branch_override = $3\r
    , ship_branch_override = $4\r
    , "name" = $5\r
    , address_line_1 = $6\r
    , address_line_2 = $7\r
    , city = $8\r
    , "state" = $9\r
    , zipcode = $10\r
    , country = $11\r
    , phone_number = $12\r
    , website = $13\r
    , inside_salesrep = $14\r
    , outside_salesrep = $15\r
    , outside_salesrep_name = $16\r
    , is_bill_to = $17\r
    , is_ship_to = $18\r
    , bill_to_id = $19\r
    , creation_date = $20\r
    , number_in_chain = $21\r
    , customer_status = $22\r
    , default_price_class = $23\r
    WHERE technical_key = $24

    I'm not sure how the technical_key, version, date_from, and date_to fields are being handled, since they aren't listed in the update statement. Perhaps they are ignored because there is no change?

  4. #4

    Default Got it

    Once I thought about that SQL statement for a moment, I realized that since the update statement was using the technical key field in the where clause, it would probably be helpful to define an index on that field.

    Wallah! It's averaging 374 r/s now.

    Thanks, Fabian!

  5. #5
    pstoellberger Guest

    Default

    There are two options to improve the performance:
    - Caching
    In PDI 3.2 there is a checkbox "Enable the Cache" enable that and define the number of cached rows (i think standard is 5000). so you can start by twiddling on this cache size (you will probably run out of memory if you set that cache size too big)

    - Increase the number of copies of that step
    Right Click on Dimension lookup/update steo > Change Number of Copies to Start> twiddle with the value there (i usually start with trying 3 copies)

    good luck

  6. #6
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    With a little bit of luck I can make the cache pre-load work on updates as well in one of the next releases. For tiny SCDs like the one described above, that would be optimal.

  7. #7
    Join Date
    Feb 2009
    Posts
    296

    Default

    I'm not too sure about that. Please consider having an extra option for Update-Cache as there might be a number of jobs running in parallel trying to work on the same dimension.
    That would break if you go with cached updates, right?
    Fabian,
    doing ETL with his hands bound on his back

  8. #8
    Join Date
    Feb 2009
    Posts
    296

    Default

    Quote Originally Posted by Matthew Pugsley View Post
    Do you mean the SQL that my database is running?
    No, I was more thinking about "How is your database coping?" there might be an I/O or CPU bottleneck on the database side of things.
    And seeing how much the index did for you I'd go check that database even more. Check if it is idle ;-)
    Fabian,
    doing ETL with his hands bound on his back

  9. #9
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Parallel dimension updates are a challenge in any case. Even without caching there are database connections and transactions to consider. The way I was going to tackle that was by applying a partitioning algorithm to the various parallel update copies. During pre-load you would only read any given natural key partition making sure you don't duplicate caches. That would work just fine.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.