Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: MIgration

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Join Date
    Jun 2013
    Posts
    4

    Default MIgration

    Hello,

    I just downloaded this ETL tool and wanted to know more about this tool.
    Actually we have an EDW and we make this historification in PL/SQL so in Oracle DB.
    It is getting bigger and bigger and time for processing everything take more and more time... I wanted to see if, thanks to this kind of tool, it could be done in an easy/fast way.
    My question is we have 300+ tables to historify, so in PL/SQL it is easy to do it, but heavy for the system... a schema for source tables, stagging, dwh and one for datamarts.
    How to historify each table in an easy way. Are there some examples to follow ?

    Thanks in advance for your help.

  2. #2
    Join Date
    Jul 2009
    Posts
    476

    Default

    There are lots of good examples of jobs and transformations in the "samples" subdirectory. The word "historify" is unclear. If you're talking about keeping historical versions of all rows in a table, then I think Oracle has some built-in features to handle that. If you're talking about slowly-changing dimensions in a star schema, then the Dimension Lookup/Update step can handle that. See http://wiki.pentaho.com/display/EAI/...+Lookup-Update.

    PDI by itself won't run faster than PL/SQL. In fact, it's more likely that your PL/SQL would run faster, but any method's speed depends on design and tuning. In my opinion, the chief benefit of PDI over PL/SQL is that PDI visualizes the data flows, so they're easier to understand and maintain, especially for a team of people.

    Other people probably have other opinions...

  3. #3
    Join Date
    Jun 2013
    Posts
    4

    Default

    Quote Originally Posted by robj View Post
    There are lots of good examples of jobs and transformations in the "samples" subdirectory. The word "historify" is unclear. If you're talking about keeping historical versions of all rows in a table, then I think Oracle has some built-in features to handle that. If you're talking about slowly-changing dimensions in a star schema, then the Dimension Lookup/Update step can handle that. See http://wiki.pentaho.com/display/EAI/...+Lookup-Update.

    PDI by itself won't run faster than PL/SQL. In fact, it's more likely that your PL/SQL would run faster, but any method's speed depends on design and tuning. In my opinion, the chief benefit of PDI over PL/SQL is that PDI visualizes the data flows, so they're easier to understand and maintain, especially for a team of people.

    Other people probably have other opinions...
    Hi Thanks for your answer.
    I can be more clear. You are totally right, the good thing with PL/SQL is that we can personalize whatever we want,
    but now the amount of process is so high that it is not easy to have a global view.
    The methodology used is the one from Inmon, so we upload DB sources (so all tables of each system) in our first layer, in the staging environment we calculate the difference (new rows, updates or deletes), in the datawarehouse environment we put the modifications that have to be done regarding what we have done in the stagging.
    From the datawarehouse, we build datamarts in a star/snowflake schema.
    I wanted to know if there is some examples that do what we are doing... I think it is basic...

    Thanks for your help.

  4. #4
    Join Date
    Jul 2009
    Posts
    476

    Default

    Which part of the process is causing pain, diffing the data warehouse data or cranking out datamarts? Those are two very different processes.

    PDI has a lot of steps that are well-suited towards populating dimensions and fact tables a la Kimball. I could see PDI being more useful there, particularly if you have lots of dimension and fact tables, snowflaked dimensions, and/or slowly-changing dimensions. PDI might be *much* better than PL/SQL if the business rules for creating your star schemas need to be flexible and adaptable.

    The versioning in your ODS/staging environments sounds like a more "brain-dead" kind of process, where you just ingest the fresh data, compare with existing data, and apply inserts, updates and deletes while retaining history. Once you figure out this process for one table, you've essentially figured it out for all of them. I did something similar with Postgres and PL/pgsql stored procedures, which are analogous to Oracle and PL/SQL. If the change-capturing logic at this level is really pretty simple, i.e. *any* change to the data is captured the same way, then I'd lean towards keeping the PL/SQL procedures for this part.

    Just my virtual 2 cents...

  5. #5
    Join Date
    Jun 2013
    Posts
    4

    Post

    Quote Originally Posted by robj View Post
    Which part of the process is causing pain, diffing the data warehouse data or cranking out datamarts? Those are two very different processes.

    PDI has a lot of steps that are well-suited towards populating dimensions and fact tables a la Kimball. I could see PDI being more useful there, particularly if you have lots of dimension and fact tables, snowflaked dimensions, and/or slowly-changing dimensions. PDI might be *much* better than PL/SQL if the business rules for creating your star schemas need to be flexible and adaptable.

    The versioning in your ODS/staging environments sounds like a more "brain-dead" kind of process, where you just ingest the fresh data, compare with existing data, and apply inserts, updates and deletes while retaining history. Once you figure out this process for one table, you've essentially figured it out for all of them. I did something similar with Postgres and PL/pgsql stored procedures, which are analogous to Oracle and PL/SQL. If the change-capturing logic at this level is really pretty simple, i.e. *any* change to the data is captured the same way, then I'd lean towards keeping the PL/SQL procedures for this part.

    Just my virtual 2 cents...
    Hi thanks for your answer.
    The thing I wanted to do is the following :
    I am retrieving all the source systems that we have to make an history. So I wanted for each table :
    -load data from source,
    -compare it with the data in stagging (maybe with pentaho we will not need to store differences in staggin layer ?)
    -update/insert the differences in DWH
    I wanted to know how to do this in pentaho data integration ?
    Are there some examples or tutorials already done ? which objects I have to use in order to respect what has been done ?

    Thanks for your help !

  6. #6
    Join Date
    Jul 2009
    Posts
    476

    Default

    I'll mention again that there are lots of good examples of jobs and transformations in the "samples" subdirectory, under your PDI installation directory. The home page for PDI documentation is http://wiki.pentaho.com/display/EAI/...+Documentation. The links under the "Reference" section for "Pentaho Data Integration Steps," http://wiki.pentaho.com/display/EAI/...egration+Steps, and "Pentaho Data Integration Job Entries," http://wiki.pentaho.com/display/EAI/...on+Job+Entries, are what I use the most.

    I still doubt that you'll see any performance benefit from using PDI to maintain your ODS/staging level of the data warehouse, where you are holding historical copies of your data. I used to be an Oracle DBA, and if you have good PL/SQL stored procedures that use dynamic SQL properly, then I don't know that you can do any better. You might use unlogged tables, direct-path loads, checksums and/or other stuff to speed up the PL/SQL-based process of comparing current data vs. your historical information, but even then, the process will inevitably take longer as your history builds up.

    For the second part of the process, where you spin out your historical data into star schema data marts, you can use a last-changed-date column in your historical data to grab changes to update your dimensions, or you can do full loads. Same goes for your fact tables.

    The documentation page that I listed above has links to books about PDI. I don't own any of them personally, but you might take a look at them. The Pentaho Kettle Solutions book author list includes Matt Casters, who is the principal developer of PDI, and a couple of other guys who have deep experience with the product, and it looks pretty comprehensive. The other books might be useful as well.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.