Hitachi Vantara Pentaho Community Forums
Results 1 to 14 of 14

Thread: Date of new records with SCD type 2

  1. #1

    Default Date of new records with SCD type 2

    Hi

    I noticed that when I use the 'Dimension lookup/update' step for SCD type 2, the 'Stream Datefield' is used as I'd expect when a new version of a dimension entry is created (End date of previous version = 'Stream Datefield' and Start date of new version = 'Stream Datefield').

    However, when a completely new record (= no previous version existing in the Dimension table) is added by 'Dimension lookup/update', the 'Stream Datefield' is not used, instead, the Start date of the new record is set to what I entered in the 'Min. year' field.

    Is this a bug or am I misunderstanding somerthing? This is 2.5.0.

    Thanks,
    Tobias

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    It's a misunderstanding. What do you think min year field is then for?

    Regards,
    Sven

  3. #3

    Default

    Aktually, I don't know what the 'Min' field is for anymore. What I thought it is for is this:

    It's a lower boundary for new records, for when you first fill a newly created dimension. For dimensions where the 'Start date' of an entry is 'upon detection', you might not know this for preexisting records. For those records that newly arrive _after_ you have created and initally filled the dimension, a current timestamp will be used for 'Start date'. So the 'Min' field is of no importance, after you have done the initial fill of the dimension. Kind of like a 'default' setting in an SQL database.

    But it doesn't seem to work like this, the min field is used for all new records. So basically, if I set Min to 2007, all my records start at January 1st, 2007. No possibility to have them start at 'now' or at June 1st, 2007.

    What I expected is two things. The ability for new records to have 'now as 'Start date', as well as to use a stream field for 'Start date'. The latter is nice since many operational databases use a 'created' field in their tables, which indicates when a record was created in the database. This seems like a perfect fit for 'Start date'.

    Thanks,
    Tobias

  4. #4
    Join Date
    May 2006
    Posts
    4,882

    Default

    From memory that's not how it works. The date in the stream is considered the date one of the values in the row changed (so the end data of the previously valid row). So for a new fresh row, there's no previous row and then the min date is used as start of the range.

    You can try to make a bugtracker for it, but it will most probably be rejected. It would break other people's use.

    Regards,
    Sven

  5. #5

    Default

    Quote Originally Posted by sboden View Post
    You can try to make a bugtracker for it, but it will most probably be rejected. It would break other people's use.
    That's not a problem. To respect the current behaviour, we don't change the behaviour of the 'Stream field', we only rename it to 'Stream Updated Row Start field' or something like that.
    The Min and Max field are changed as follows. A possibility is added to use a field from the stream. If a stream field is used, new records get that as Start/End date. If a 4-digit yyyy value is used, new records get yyyy-01-01 as Start/End date (don't break existing behaviour, although this is now nothing more than convenience functionality).

    Out of interest, if such a change would be accepted, whould it be implemented in 3.0.0? Would it also be backported to 2.5.1?

    Thanks,
    Tobias

  6. #6
    Join Date
    May 2006
    Posts
    4,882

    Default

    That's not a problem. To respect the current behaviour, we don't change the behaviour of the 'Stream field', we only rename it to 'Stream Updated Row Start field' or something like that.
    The Min and Max field are changed as follows. A possibility is added to use a field from the stream. If a stream field is used, new records get that as Start/End date. If a 4-digit yyyy value is used, new records get yyyy-01-01 as Start/End date (don't break existing behaviour, although this is now nothing more than convenience functionality).
    Feels like a car with 5 wheels... but maybe other developers see it otherwise.

    Out of interest, if such a change would be accepted, whould it be implemented in 3.0.0? Would it also be backported to 2.5.1?
    2.5.1 normally not... was to be used as a "fix" only version (although some other things have gotten in as well).
    For 3.0 ... depends on need, availability, and size of the change. My 8-ball says "chances are slim".

  7. #7

    Default

    Quote Originally Posted by sboden View Post
    Feels like a car with 5 wheels... but maybe other developers see it otherwise.
    Well, if you want to retain backwards compatibility and at the same time add conflicting new features, that's what you get.

    Quote Originally Posted by sboden View Post
    2.5.1 normally not... was to be used as a "fix" only version (although some other things have gotten in as well).
    For 3.0 ... depends on need, availability, and size of the change. My 8-ball says "chances are slim".
    So, if I want to have it (and keep in the future) in a production release, I'd have to implement it in 3.0.0 and then backport it to 2.5.0? It's not like it would be a lot of work (I only looked at the code of 2.5.0 so far), but if you can tell me already that it will not be included in the official release, I won't do it.

    Thanks,
    Tobias

  8. #8
    Join Date
    May 2006
    Posts
    4,882

    Default

    So, if I want to have it (and keep in the future) in a production release, I'd have to implement it in 3.0.0 and then backport it to 2.5.0? It's not like it would be a lot of work (I only looked at the code of 2.5.0 so far), but if you can tell me already that it will not be included in the official release, I won't do it.
    Raise a JIRA tracker, check with Matt

    Regards,
    Sven
    Last edited by sboden; 08-16-2007 at 06:28 AM.

  9. #9
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    If it's not broken, why would we fix it? The ]-oo,+oo[ date range was only made configurable to allow SCD to work on certain limited databases.

    On the other question...

    The date range [from_date,to_date[ is not about the time of change, it's about the validity of the record in the dimension.
    And since you don't know the end-date of validity when you create a new version of the dimension entry, you HAVE to put it at +oo (or end of year specified). Otherwise the algorithm wouldn't work as it should. (during fact table updates, you would not find an appropriate key)

    Please note that if you want the "stream date" field (timestamp) to be always present and updated in the dimension, that this is possible. I suggest using the "update last version" option for that one.

    Matt

  10. #10

    Default

    I begin to see where our views of the situation differ.

    What I wanted to do is (mis)use the 'Start Date' used for SCD to also hold the information when a record was created in the operational system. If this is not done, and the 'Start Date' field is used only for SCD, it doesn't matter whether its value is 'some date in the past' or 'exact creation date of operational record'. New fact rows will not be issued before the 'exact creation date' anyway.

    By implementing my change, SCD would still work as supposed. However I now don't think any longer that this mixing of functionality makes sense. I will use an extra field in my DImension to indicate 'creation date'. You can close PDI-52.

    Thanks,
    Tobias

  11. #11
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Hi Tobias,

    I agree that sometimes these things can be subtle.
    However, let me clarify purely from a purely practical viewpoint that you can indeed run into serious trouble if you start mixing up the concepts.

    One particular case of the bad vibes I had was with a (pre-Kettle) customer that had timestamps in the source table. In certain cases (not as rare as I would have liked) the time stamp for date of creation was invalid. That particular date value was all over the place and the ranges started to interfere with the fact loading process in a very negative way. (because we used it to limit the dimension entry validity date range) Soon after that the dimension loading was compromised.
    Slowly but certainly the dimensions and facts became "dirty".

    Unfortunately for us lowly data warehouse designers, bad data quality is the rule, not the exception. It feels like especially the timestamps are subject to error in a world where more and more systems update the same data in the source databases.

    I strongly believe that keeping the 2 subjects apart can protect us (to a certain degree) from data quality mayhem. In fact, if you think that the changed date time stamp is in the least or not always accurate, I would recommend not to use it at all and use the system date instead. (& put the time stamp in a separate field in the dimension entry)

    Thank you for your question & feedback!

    Matt

  12. #12

    Default

    Roth, I ran into the exact same challenge of wanting to use 'now' for start-date, and ended up using the same solution you did - create a field for creation-date to pass to the Step.

    It's reasonable, but if two people identified a problem, might be worthwhile to look at adding a 'now' to the Dimension steps START_DATE -- with the caveat of Locale time issues.

    Edit: re-reading Matt's mention of system date - as in the database system date or the ETL/pentaho system date? Could configure a new 'now' feature to ask the database server for now vs the ETL system.

    This goes under usability for new-comers without impacting expert use.
    Last edited by dhartford; 08-16-2007 at 08:56 AM.

  13. #13

    Default

    timestamp vs date.

    I have no comment. My suggestion was based on date, not timestamp. I'm new to modelling and rely on academic/Kimball information and still building up experience.

  14. #14
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Edit: re-reading Matt's mention of system date - as in the database system date or the ETL/pentaho system date? Could configure a new 'now' feature to ask the database server for now vs the ETL system.
    I don't think it matters all that much. The validity range is going to be slightly off in the best case anyway. I've never seen a use-case where it was relevant.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.