Hitachi Vantara Pentaho Community Forums
Results 1 to 30 of 30

Thread: Dimension lookup/update very slow

  1. #1
    Join Date
    Sep 2011
    Posts
    24

    Question Dimension lookup/update very slow

    Hi,
    I am using dimensional lookup / update but the charge is very slow.
    the source table has 13,000 records and is taking over an hour.
    I am attaching the transformation.
    any suggestion?

    build version : 5.0.1-stable
    Attached Files Attached Files
    Last edited by odivon; 12-17-2015 at 10:45 AM.

  2. #2
    Join Date
    Jul 2009
    Posts
    476

    Default

    A few ideas:

    (1) Do you have an index on the dimension table's car_id_cargo column, which is your key field? Each time the step processes a stream row, it has to look for existing database rows in the table based on the key.
    (2) Does your input step only select distinct values? I'm guessing it does, but if not, you might end up processing the same values multiple times.
    (3) Are you sure you want to use "Insert" as your Type of dimension update for denominacao? I think "Punch through" is more likely what you want, but I'm not sure.
    (4) Since your table will have 13000 rows, you might specify 0 as the cache size, which will be all rows, instead of 5000, in your Dimension Lookup/Update step.

  3. #3
    Join Date
    Sep 2011
    Posts
    24

    Default

    Quote Originally Posted by robj View Post
    A few ideas:

    (1) Do you have an index on the dimension table's car_id_cargo column, which is your key field? Each time the step processes a stream row, it has to look for existing database rows in the table based on the key.
    (2) Does your input step only select distinct values? I'm guessing it does, but if not, you might end up processing the same values multiple times.
    (3) Are you sure you want to use "Insert" as your Type of dimension update for denominacao? I think "Punch through" is more likely what you want, but I'm not sure.
    (4) Since your table will have 13000 rows, you might specify 0 as the cache size, which will be all rows, instead of 5000, in your Dimension Lookup/Update step.
    ------------------
    Thank robj,


    (1) no index on the table;
    (2) there really are equal values, but different ID, this influences into something?
    (3) need the "Insert" because even want to create new version without changing the previous versions
    (4) ok changed


    Still continuing with 2 lines per second in the output table, very slow
    Last edited by odivon; 12-18-2015 at 07:36 AM.

  4. #4
    Join Date
    Jul 2009
    Posts
    476

    Default

    I would start by creating an index on the car_id_cargo column, which is your natural key. You should create another index on the car_sk_cargo column, which you are using as the technical key, particularly because you have "Use table maximum + 1" selected to create the key. (Sequences or auto increment fields are better.)

    Every time a row passes through this step, PDI has to look for a matching natural key (car_id_cargo), which will be faster if you have an index on that column. When it needs to create a new row, PDI, will run a select on the table to find the largest technical key (car_sk_cargo), which will also be faster with an index. Or you can change the step to use a sequence or auto increment field.

  5. #5
    Join Date
    Sep 2011
    Posts
    24

    Default

    Quote Originally Posted by robj View Post
    I would start by creating an index on the car_id_cargo column, which is your natural key. You should create another index on the car_sk_cargo column, which you are using as the technical key, particularly because you have "Use table maximum + 1" selected to create the key. (Sequences or auto increment fields are better.)

    Every time a row passes through this step, PDI has to look for a matching natural key (car_id_cargo), which will be faster if you have an index on that column. When it needs to create a new row, PDI, will run a select on the table to find the largest technical key (car_sk_cargo), which will also be faster with an index. Or you can change the step to use a sequence or auto increment field.
    --------
    thanks again robj,
    includes the indices and the slow pace continues, also changed the field to auto increment.

  6. #6
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    How "close" is PDI to the DB?
    Every Dimension Update row is going to incur 2 round-trips to the DB, and will only update the cache of the dimension as the rows are used. So if it's a long distance (say 10 or 15 ms) to the DB, it adds up quite quickly.

  7. #7
    Join Date
    Sep 2011
    Posts
    24

    Default

    Quote Originally Posted by gutlez View Post
    How "close" is PDI to the DB?
    Every Dimension Update row is going to incur 2 round-trips to the DB, and will only update the cache of the dimension as the rows are used. So if it's a long distance (say 10 or 15 ms) to the DB, it adds up quite quickly.
    ------
    Did not quite understand your question, the input table is in the same place and reading happens with 7,000 r / s already out with 2 r / s.
    PS. I am performing an initial charge, the target table is empty.
    Thank gutlez

  8. #8
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Table Input doesn't need to really round-trip to the database... It sends a single query to the DB, and the DB will continue to send data back, as fast as the client can handle it.

    If you are on the PDI server, and you ping the DB server, what is the response time? You can double that time, and add a buffer, and have a pretty good guess as to how long it will take to insert rows into the Dimension.

    Also... Is the bottle-neck really truly at the Dimension lookup, or is it somewhere else? PDI will let a backup in one part of a transformation push data back to another, so it may make it look like one step is the cause when it's a different step.

  9. #9
    Join Date
    Sep 2011
    Posts
    24

    Default

    Quote Originally Posted by gutlez View Post
    Table Input doesn't need to really round-trip to the database... It sends a single query to the DB, and the DB will continue to send data back, as fast as the client can handle it.

    If you are on the PDI server, and you ping the DB server, what is the response time? You can double that time, and add a buffer, and have a pretty good guess as to how long it will take to insert rows into the Dimension.

    Also... Is the bottle-neck really truly at the Dimension lookup, or is it somewhere else? PDI will let a backup in one part of a transformation push data back to another, so it may make it look like one step is the cause when it's a different step.
    ----------------
    Hi gutlez,
    The response time to the DB of origin and the destination is 79ms, I have attached the execution of grid showing the speed of 9,703 r / is the destination (Dimension lookup / update) 4 r/s. Even if I fold the reading time should be from 4000 to 5000 lines per second.
    The bottleneck is the component Sort rows (15 r/s) and especially in the Dimension lookup / update (4 r/s). Already removed the Sort component rows but the slowness remains the same.
    I'm out of ideas here ...
    Thank's!
    Attached Images Attached Images  

  10. #10
    Join Date
    Feb 2011
    Posts
    152

    Default

    I too am interested in this as I have only been able to get about 4 r/s using an indexed integer field as the lookup value.

  11. #11
    Join Date
    Sep 2011
    Posts
    24

    Default

    Quote Originally Posted by dnrickner View Post
    I too am interested in this as I have only been able to get about 4 r/s using an indexed integer field as the lookup value.
    -----------------
    The input table is PostgreSQL and the output table is Informix, changed the output to MySQL and the slow pace continues (5 r / s).

  12. #12
    Join Date
    Feb 2011
    Posts
    152

    Default

    I am actually copying from Oracle and going to MSSQL at the same rate as you. I have often wondered if it was simply a driver issue, but since we are talking to different RDBMS systems, it makes me think that it is not a driver issue.

  13. #13
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    It really *REALLY* depends on your configuration.
    There's a minimum of 2 round-trips for Dim Lookup/Update, and depending on configuration, it may make more, or even trip a full table scan. That can be VERY costly (in terms of speed)

  14. #14
    Join Date
    Feb 2011
    Posts
    152

    Default

    I understand what you are saying about configuration. In my scenario, both PDI and the database are on the same physical server. I have no network between them. I have the lookup field in a clustered index. I do not know what else I can change to speed it up further. I have even tried changing the commit size up to 50,000 and still no performance gain. The database is in simple logging mode too.

    Suggestions?

  15. #15
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Check the query plan that is being generated by your DB. It might not be indexed on the right fields (especially if you are using SCD2), thereby causing table-scans.

    It's also possible that dirty reads could be causing slow updates to the index, which then slows down the lookup.

    Lots of options of things that could be going on. But you shouldn't expect a really high throughput on that step.

    What is your cache set to? Do you have the preload cache configured? Are you getting a lot of updates to the dimension?
    So many variables!

  16. #16
    Join Date
    Sep 2011
    Posts
    24

    Default

    I put the input and output in the same database in MySQL, and continues the 4 r / s.

    gutlez,
    I've tried with and without cache, only one column is being updated, but I'm giving initial charge in the dimension.
    Last edited by odivon; 12-23-2015 at 07:48 AM.

  17. #17
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    As I mentioned - there are a LOT of variables that affect the speed at which rows will be processed in the Dimension Update step.
    MySQL is actually one of those variables, as it likes to use page locks, rather than row locks. So when you update one row, it locks the whole page of data, slowing everything else down in the process.

    I can't tell you what speed to expect, but I can tell you not to expect 1000 r/s - even 500 r/s is likely high.

  18. #18
    Join Date
    Sep 2011
    Posts
    24

    Default

    The problem is already tested on PostgreSQL, Informix, MySQL, cached and non-cached, with index and no index and no more than 4 r / s, this time prevents any project.

    gutlez,
    you mentioned a preload cache, what would that be? as I set?
    Last edited by odivon; 12-23-2015 at 01:42 PM.

  19. #19
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by odivon View Post
    you mentioned a preload cache, what would that be? as I set?
    In PDI 5.4, in the Dimension Lookup/Update step configuration page, there is an option to cache rows. Directly beneath it, another checkbox: "Pre-load cache" which is only enabled if "Update Dimension" is unchecked.

    In theory, you could use filtering to find rows that do not exist in your dimension, and then only update those rows.

    I've had really good luck with the dimension update, but I worked really closely with my DBA to tune the tables.

  20. #20
    Join Date
    Aug 2011
    Posts
    360

    Default

    Hi,

    What are the steps connwcted after the dimension/lookup? Maybe the bottleneck is after the step.
    Moreover, is your dimension supposed to have a lot of entries (or frequwntly changing if SCD2)?
    If so, the dimension/lookup step is not recommanded way to update dimensions bexause it's slow compared to other methods

  21. #21
    Join Date
    Sep 2011
    Posts
    24

    Post

    Quote Originally Posted by gutlez View Post
    In PDI 5.4, in the Dimension Lookup/Update step configuration page, there is an option to cache rows. Directly beneath it, another checkbox: "Pre-load cache" which is only enabled if "Update Dimension" is unchecked.

    In theory, you could use filtering to find rows that do not exist in your dimension, and then only update those rows.

    I've had really good luck with the dimension update, but I worked really closely with my DBA to tune the tables.
    ---------------
    gutlez,
    with the "Update Dimension" unchecked and "Pre-load cache" enabled the load was very fast 1.8s 7.600r / s, but nothing was recorded in the target table ...
    still trying ...

  22. #22
    Join Date
    Sep 2011
    Posts
    24

    Default

    Quote Originally Posted by Mathias.CH View Post
    Hi,

    What are the steps connwcted after the dimension/lookup? Maybe the bottleneck is after the step.
    Moreover, is your dimension supposed to have a lot of entries (or frequwntly changing if SCD2)?
    If so, the dimension/lookup step is not recommanded way to update dimensions bexause it's slow compared to other methods
    --------------
    Hi Mathias.CH thanks,
    I have attached the steps of transformation and fields for comparison (only 1) in dimension / lookup.
    Any suggestion?
    Attached Images Attached Images    

  23. #23
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by odivon View Post
    with the "Update Dimension" unchecked
    Which means that you're telling PDI not to update the table.

    Start *REALLY* simple.
    Table Input to Dimension Update.
    Work up from there.

    You were advised that using the "Table Max + 1" configuration will do a table scan - this will destroy your throughput on a standard RDBMS. On an in-memory DB, it might not matter.

    Indexes are CRITICAL to having this work well. As is working closely with your DBA to make sure that the table settings are tuned correctly.
    Last edited by gutlez; 12-23-2015 at 03:25 PM.

  24. #24
    Join Date
    Sep 2011
    Posts
    24

    Default

    Quote Originally Posted by gutlez View Post
    Which means that you're telling PDI not to update the table.

    Start *REALLY* simple.
    Table Input to Dimension Update.
    Work up from there.
    gutlez thanks,
    already put index, already removed index.
    already put sequence to use the auto-increment field, have used maximum + 1.
    since I enabled cache, since I disabled.
    It's nothing...
    the input table reads the 13,000 records in 2s, I think the problem is not with her.
    No solution...
    -----
    Attached I put one more try with the simple (as requested) with only input and output and using the auto increment field and as you can see still 4 r/s.
    Attached Images Attached Images   

  25. #25
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by odivon View Post
    Attached I put one more try with the simple (as requested) with only input and output and using the auto increment field and as you can see still 4 r/s.
    So work with your DBA to determine what is happening at the DB level.
    Without access to the system, it's really difficult to say what is going on.

    One thing to note: r/s is calculated from the time that the transform started. So if it took 20s for the first row to hit the Dimension Lookup, and 6.4s to process the 118 rows, you get a r/s calculated as 4.5s, but the step is actually processing at 18 r/s

    With the way PDI works, Table Input won't return 13,000 rows in 2s if Dim Lookup is the next step and is causing the problem... It will only return 10000.

  26. #26
    Join Date
    Sep 2011
    Posts
    24

    Default

    Quote Originally Posted by gutlez View Post
    So work with your DBA to determine what is happening at the DB level.
    Has done everything possible with the DBA, index inclusion, input table and output in the same bank, etc.
    Nothing worked, what actually happens is the delay of 1h30m to include 13,000 records, this prevents any project with PDI.
    I will try to use another form of control SCD without using Dimension Lookup component.

  27. #27

    Default

    Quote Originally Posted by odivon View Post
    I will try to use another form of control SCD without using Dimension Lookup component.
    The easy way to manage a type II SCD is including "cargo" in your fact table

  28. #28
    Join Date
    Sep 2011
    Posts
    24

    Default

    Quote Originally Posted by emicarnicero View Post
    The easy way to manage a type II SCD is including "cargo" in your fact table
    I did not understand what should I do???

  29. #29

    Default

    Hi,

    You can try insert update component instead of Dimension lookup and set increment in table .I think it will work.

    Regards,
    Rushikesh

  30. #30
    Join Date
    Sep 2011
    Posts
    24

    Default

    Quote Originally Posted by RushikeshNasare View Post
    Hi,

    You can try insert update component instead of Dimension lookup and set increment in table .I think it will work.

    Regards,
    Rushikesh
    thanks Rushikesh,

    I do not know if the "Insert / Update" solve my problem as the "Dimension Lookup" but I took the test and not much has changed 8 r / s.
    I am installing local DB on my machine to test, then report the result.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.