Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: Seeking sanity in an insane dataset

  1. #1
    Join Date
    Dec 2009

    Default Seeking sanity in an insane dataset

    We have a table which tracks a variety of events in one of our systems and is replicated to our data warehouse. The table averages 25,000 new records and 10,000 updates per day. The complexity comes from a couple of lovely details:

    The records are transactional but include a foreign key to a smaller dataset
    The main table contains 19,000,000 records currently
    There is no primary key or unique set of fields in the main table
    The updates can impact any record in the main table
    Records have a created and a last updated date
    Records get archived from the main table based on the foreign key
    Archived records need to be retained in the data warehouse
    The data warehouse table contains 23,000,000 records currently

    Making things a little easier, there are several fields in the record that do not get updated, but not enough to define a unique record.

    For some time we have been simply copying over the entire main table and foreign keys that are missing from the main table are added back into the data warehouse table.

    The process is embarrassingly inefficient and growing more so each day.

    Is it me, or is this not really possible to resolve accurately?

    Any suggestions?

  2. #2



    I'm probably about to be publicly stupid here but why cant you log the last import time somewhere then just select where created>lastImport or lastUpdated>lastImport

    Mark, Cloud2Land

  3. #3
    Join Date
    Mar 2006


    Hi I too will jump in and no doubt make light of a tough situation BUT...

    If the main table has audit stamp columns on it why can't you search the main table with a query for all records that have create or update datetime > than the last time you processed the table. That in theory should give you all the records that need to be inserted or loaded into the warehouse ... even if you have to take those records and join via the FK to the smaller dataset table.

    Now all you have to do is figure out if you are inserting or updating these records ... and I'm hoping that he updated records would have a create timestamp > than the update timestamp so that is how I would figure out what is new vs existing.

    Now a greater issue is the DBA that allowed or designed the main table with no PK still working there? If so time to raise hell!

    My 2 cents and no doubt over simplified.



  4. #4
    Join Date
    Dec 2009


    I am a big fan of making light of this particular situation, so feel free to do so.

    We are pushing for a primary key, but I do not think it will happen any time soon.

    The difficulty now (we just got the update date field added) is not so much in finding the new/updated records in the main table, it is identifying the original record in the data warehouse table. The table allows exact duplicates, although for the life of me I cannot see how they accomplish anything. The main issue is performance when finding the primary keys which have been deleted. This gets worse the more data we retain.

    Does anyone have a suggestion about a Pentaho step that would be more efficient than a database query on this volume of data?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.