Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: Lookup multiple date fields against date dimension

  1. #1
    Join Date
    Dec 2017
    Posts
    10

    Default Lookup multiple date fields against date dimension

    For our FACT table row, we need to look up seven different date fields on the date dimension (~86K rows). What would be the best way to do these multiple look ups against the same dataset ? The default seems to have seven DB look up steps. Thanks in advance for any insights.

    I also found this open issue regarding caching across hops, which could have been helpful.
    http://jira.pentaho.com/browse/PDI-6316

  2. #2
    Join Date
    Jul 2009
    Posts
    476

    Default

    There are at least two ways you can do it, probably more. One is to use multiple Dimension Lookup/Update steps, where each step handles a different date field. Another is to use a Table Input step that selects the dimension key and date value from your date dimension, and then connect that to multiple Stream Lookup steps in your main row stream. Each Stream Lookup step would handle a different date field from your row source.

    I use both methods, and they both work fine. It's hard to say which is more efficient in general, or if the difference is even that significant. If your dates in all of those fields tend to cluster around the same values, such as dates in the recent past or the recent future, then it might be overkill to use a Table Input step/Stream Lookup steps solution, because the Table Input step will have to select *all* of the dates in the date dimension table. On the other hand, it's only doing that select once.

    My suggestion would be to just use multiple Dimension Lookup/Update steps first, and then if they are proven to cause a performance issue that needs to be fixed, then maybe try Table Input with Stream Lookups.

    There is an example of the Stream Lookup step in the PDI samples/transformations folder, called "Stream Lookup - basics.ktr".

  3. #3
    Join Date
    Dec 2017
    Posts
    10

    Default

    Thanks Robj. I will proceed with the 7 lookups against the datedim and see what performance we get. If needed will then try the stream suggestion.
    1. Nice performance optimization hint about the dates clustering around a certain time frame. Thank you. I will check if that is indeed the case and adjust my look up data set.
    2. Couple follow up questions-
    Should I enable cache on the datedim on each of these seven look up steps? It seems it will try to cache 7 sets of the table (sounds like a lot). Or rather leave the caching unchecked?
    What is considered a decent size of rows to cache when doing lookups - 1000, 10k, 50k ?

  4. #4
    Join Date
    Jul 2009
    Posts
    476

    Default

    Generally, if you decide to fiddle with the cache setting, you'd want it to be large enough to hold all of the rows you expect to look up. So if you have a date dimension, and you expect that your data will need to look up dates only from the past year, then a cache of 400 would be enough. If you expect to look up more dates, then increase the cache size.

    On my version of PDI, when you create the Dimension Lookup/Update step, it defaults the cache size to 5000, which is reasonable for most cases. I don't believe the step gobbles up that memory before it actually needs it for the cache, so if it needs fewer than 5000 rows, it won't waste memory.

    I would suggest just designing your transformation to work, and then optimize as needed.

  5. #5
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    You can also normalize the data first, then do your Dim lookup...

    ID | Date1 | Date2 | Date3 | Date4 | Date5 | Date6 | Date7 |
    1 | 2017-12-01 | 2017-12-02 | 2017-12-03 | 2017-12-04 | 2017-12-05 | 2017-12-06 | 2017-12-07

    Becomes:
    ID | DateType | Date
    1 | 1 | 2017-12-01
    1 | 2 | 2017-12-02
    1 | 3 | 2017-12-03
    1 | 4 | 2017-12-04
    1 | 5 | 2017-12-05
    1 | 6 | 2017-12-06
    1 | 7 | 2017-12-07

    Which will increase your probability of hitting your cache at least once

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.