Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: Insert New Rows in Data Warehouse Table

  1. #1
    Join Date
    Apr 2008
    Posts
    123

    Default Insert New Rows in Data Warehouse Table

    What is the best transformation step to build a datawarehouse table with. Specifically, I need to add data each day to a datawarehouse table. Should I use Insert/Update? or Table Output? or............something else? and how do I make sure I am getting all the records since the last entry in the table?

    Thanks - Kinda Rookie, but you are going to save me a lot of time........

  2. #2
    Join Date
    Dec 2005
    Posts
    531

    Default

    Lot of questions and little input. ;-)

    Well, what is a DWH Table? Are we talking dimension tables, fact tables or how do you define a DWH table?

    PDI has two steps specifically for various flavors of dimension tables, the Dimension lookup/update and Combination lookup/update step.

    For fact tables, especially when performance is an issue I would prefere the table output, over the Insert/Update. If you really need to do updates on the fact table, you can determin those changes beforehand and deal with them sepereately. If performance is not so much an issue, Insert/Update is an option, too, but only if you are expecting updates.

    If you have a look in the docs folder of your PDI you'll find a nice paper that should help you getting started: Building-data-warehouses-using-open-source-technologies.pdf Although it uses older screenshots the concepts should still be valid.

    Regarding the last question you'll find some input here: http://forums.pentaho.org/showthread...&highlight=cdc

    I find it easiest to remeber something like a timestamp either when you last ran the transformation or from the source, with which you can determine what's a new record in your source. But you have to make sure, that you really get all records and that there is no overlap.

    Hope this get's you on the right track

  3. #3
    Join Date
    Apr 2008
    Posts
    123

    Default

    The idea of using Table Output works for me. How do I use a DATETIME column to check the existence of the values I am about to insert and block the ones that already exist to prevent duplicates?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.