Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: Type II dimension unexpected behavior / possible bug

  1. #1
    Join Date
    Apr 2012
    Posts
    3

    Default Type II dimension unexpected behavior / possible bug

    Step: Dimension lookup/update
    Version: 4.2.0

    I'm getting an unexpected result when using a Stream Datefield. The problem is that I'm unable to guarantee the order in which things are processed.

    Suppose I have 3 rows of source data processed in this order for the same lookup value:
    1. a row with a date of '2012-01-01'
    2. a row with a date of '2012-01-10'
    3. a row with a date of '2012-01-05'

    When it gets to processing the 3rd row, it finds the current dimension row where from_date and to_date straddle 2012-01-05, which is the dimension row associated with first row above. It sets that row's date_to to the new rows date_from, but sets the new row's date_to to 2199-12-31 instead of 2012-01-10. The end result is 2 dimension rows for the same lookup value with overlapping date_from/date_to date ranges.

    Thanks,
    Chris

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Out of order dimension updates are simply not supported.

  3. #3
    Join Date
    Apr 2007
    Posts
    2,010

    Default

    yes, that sounds slightly strange. it almost sounds like you already have dimensional data coming in the stream? I've never come across that, For me each row is always unique in all my dimension update scenarios. An interesting one this!

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    It happens a lot with shoddy data quality, for example if you get data from multiple systems with different clocks. In that case the "date of last update" is often incorrect (or plain wrong, using a different definition for "changed" or whatever). In all these cases it's best to leave the field empty to use the system date.

  5. #5
    Join Date
    Apr 2012
    Posts
    3

    Default

    The idea is that we have massive amounts of data with datetimes going back many years that would take months to load. For business reasons we wanted to start loading current data and then slowly back load older data. Thanks for the quick response though. I can work around it if it's something that's just not supported.

  6. #6
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I guess you could theoretically insert a new surrogate key between 2 natural keys and split a validity range.

    tk pk from to
    1 1 -oo +oo

    then

    tk pk from to
    1 1 -oo 2011/12/31
    2 1 2011/12/31 +oo


    then

    tk pk from to
    1 1 -oo 2010/05/14
    3 1 2010/05/14 2011/12/31
    2 1 2011/12/31 +oo


    The difficulty would be to
    1. Detect that a row needs to be inserted between 2 others
    2. Modify 3 rows instead of 2
    Performance might take a serious nosedive unless you want to load complete data sets in memory.

    Then there are obviously also considerations wrt to fact tables, aggregates and so on.

    But... despite everything it still is an interesting idea.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.