Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: identifying records by content of all fields

  1. #1

    Default identifying records by content of all fields

    I have a data warehouse which stores customer data and order data. The customer data is fairly stable but does change over a period of time. Customer records are large and therefore I do not wish to store duplicates. So far, we have addressed this by creating a hash field, calculated in Java across all data fields for the record. This is carried out in main application and the various reasons the code is not particularly accessible or directly usable for my current purposes. Has anybody tried a similar operation within kettle? If so, how? And is there any code available to save me reinventing the wheel?

    Thanks, in advance,
    Tim

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Hi Tim,

    Hashcodes are very useful in Junk dimensions, for example if you have no unique identifier for the customer information; if the customer data is simply tagged to each order.
    We do support that in the Combination Lookup/Update step.

    In normal slowly changing dimensions hash use is less common practice, although you could consider doing that if you have a lot of fields in the natural key.

    Personally I think it's OK to have a fair amount of duplication of data in a data warehouse. The goal is not to normalize data, the goal is to make reporting & analyses easier.

    All the best,

    Matt

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.