Hitachi Vantara Pentaho Community Forums
Results 1 to 17 of 17

Thread: speed of the dimension lookup component

  1. #1

    Default speed of the dimension lookup component

    When implementing slowly changing dimensions with the above component I see that it is quite slow when the number of columns involved in update is relatively large. I have about 60 columns that I track and throughoutput of the component is about 10 rows per second which on 4 CPUs system I would find extremely slow for a table with 50K rows in it (and I seem to cache it all according to the settings). 600 comparison per seconds is not that heavy, if done in memory - how can I optimize this? Thanks.

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    There's not really that much to optimize... comparisons take time, you probably can't start up multiple copies, ...

    I don't see an easy fix.

    Regards,
    Sven

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    The quick and dirty fix is to create an option that reads all rows of data into memory from the target table and then work with that.
    There are obviously scalability concerns with this, but perhaps data partitioning could help us there.

    I'm thinking about putting that in the 3.1 release somewhere.

    Matt

  4. #4

    Default

    So, how does the lookup work at the moment? I can see that if I don't update the dimension then I get very fast lookups (as the id is just one indexed column), which means that database query time is reasonable. What changes in case of an update? Is it just a matter of getting the full record back based on the key and then comparison is done in a loop of comparisons for all the columns or it is done via some kind of SQL(in which case I could understand it being slow if you issue 60 columns in where clause...)?

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    We basically look up the correct dimension entry for the supplied date.
    So, we get back that record and compare it to the input. We do this by obviously letting the database do the lookup work.
    In order for us to make sure we get the correct record back, the database has to do some work.
    I'm not sure there is a way around that.

    What can happen is that the RDBMS can get confused by the many where clauses and forgets to pick the correct index (on the natural key) to do the lookup.
    I've seen that happen on occasion. Usually on platforms like Oracle it takes some "analyze table" statements to get that in order. (depends on the RDBMS)
    However, that should be the exception, usually an index on the natural key(s) is enough to get good performance.

    After the lookup is done, and we compare with the input data and then we either do
    • nothing when the values are identical
    • an update of the dimension entry with the corresponding technical/surrogate key
    • an insert for a new version of the dimension entry and an update of the previous dimension entry (to change the date range)
    So depending on the data there might or might not be a lot of work that needs to be done in "update" mode.

    HTH,

    Matt

  6. #6

    Default

    Well, as I said before if I don't choose to update the dimension, but just look up, then it works 10 or more times faster. In this case I can safely assume that lookup itself is not the bottleneck. Its what you do with the record after you fetch it into memory what counts. I had a look at the DimensionLookup.java and I can only assume that this(row 376 in 3.0 GA code): "cmp = v1.compare(valueData1, v2, valueData2);"(or the whole loop) is painfully slow. There must be a way to do that better. For example hash the values and compare just the hashes of two records. Correct me if I'm wrong but we only have to detect a diff on row basis and not column in case just one change we would have to insert the whole row anyway (along with punching through columns that are marked so, but that doesn't require change detection).

    P.S. This obviously needs to be profiled more closely, but in case it turnes out actual update/insert is slow then could batch process be used instead (say send records in batch of commit size settings)?

    Any other comments?

  7. #7

    Default

    Another update. Quick test reveals this: 10K dimension lookup of a table with 25 columns is going at speed of 690 r/s. Lookup with update on the same table with no changes in those columns is going at 60 r/s even though no more database operations are involved (as there are no changes in the input dimension).

    So I commented out the comparison in the DimensionLookup and just put "cmp = 0;" which had almost no effect on the speed. then I commnted out the whole loop leaving this on top:
    boolean insert=false;
    boolean identical=true;
    boolean punch=false;

    Speed is back to 460 r/s which looks much better. If that's not comparison, then what is so slow in that loop? Back to you guys

  8. #8
    Join Date
    May 2006
    Posts
    4,882

    Default

    lol ... the loop is not that big... if compare doesn't cause the slowness you should be able to figure out what else is slow

    Regards,
    Sven

  9. #9

    Default

    I did - its the logRowlevel !!! there is no if (log.isRowLevel()) in front of it it seems...

  10. #10
    Join Date
    May 2006
    Posts
    4,882

    Default

    try again

  11. #11

    Default

    Try what exactly?

  12. #12
    Join Date
    May 2006
    Posts
    4,882

    Default

    Guessing... 90% sure it's in the lookup of the fields in the start of the lookup, which if you have 60 will do a lot of comparisons. That part can be rewritten to pre-process the fields probably... I would suggest you create a JIRA for it.

    Regards,
    Sven

  13. #13

    Default

    I'm telling you its the logging. I already tested that and it works now 30 times faster if i just put "if (log.isRowLevel()) " in front of logRowlevel... Someone should commit that...
    Last edited by len0x; 12-11-2007 at 07:21 AM.

  14. #14
    Join Date
    May 2006
    Posts
    4,882

    Default

    That I can do in a couple of hours

    Regards,
    Sven

  15. #15

    Default

    Great! Hopefully its not too late for 3.0.1

  16. #16
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    nope, it's not ;-)

  17. #17
    Join Date
    Aug 2007
    Posts
    151

    Default So, did it get in??

    Guys,

    After all this discussion, have you got this fix in? The dimension lookup/update still looks slows (in PDI 3.1), so?...

    Best regards,


    Fábio

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.