Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: How to do a difference of imported data

  1. #1
    Join Date
    May 2013
    Posts
    2

    Default How to do a difference of imported data

    Hello All,

    I'm quite new in Kettle usage and I found it very useful and I want first to thank you for the job done.

    I have a usecase that sounds to me quite standard. I want to load on a regular basis a set of data which represents the latest view of a stock. The thing I would like to do is to get the difference between 2 imports and show what has changed.
    My first idea is to load each time the sotck in the same table with a column importDate (set through a Get System Data).

    For the moment, I'm doing (but it doesn't work) a transformation that :
    1/ Get the latest 2 dates of import : Table input with :
    Code:
    select top 2 importDate  from MySotck
    group by importDate
    order by importDate desc
    2/ Switch in 2 parallels branches to get the data for each date of import and send the date to next step
    3/ Doing a sort and then a Merge Diff (which is really heavy)
    4/ I would like then to export (in file or table) the differences

    I hope I'm clear enough. Here is my transformation which does not work

    To summary my problems:
    - how to send the different dates in parameter to each branch ?
    - how to get the difference only between the two imports?

    Do you have any idea or do you know any example that have the same behaviour?

    Thanks for your help,
    Gerald

  2. #2
    Join Date
    Jul 2009
    Posts
    476

    Default

    The Merge Rows (diff) step could work for you, but why not let the database do the work for you instead? If you have the two import dates from the table input step, then you could follow with a "Database join" step that does something like this:

    Code:
    select stock, price, min(import_date), max(import_date) from your_table where import_date in (?,?)
     group by stock, price
     having count(*)=1
    where the ? parameters are your two import dates to compare. Each row of output will be a stock whose price only appeared in one of the import dates, which means either the stock didn't exist in the other import date, or the price changed. You could add whatever other columns you wanted to track besides price. Having the database do this for you will probably be much faster than the Merge Rows (diff) step.

  3. #3
    Join Date
    May 2013
    Posts
    2

    Default

    Thanks for the quick answer.
    It seems to be a good solution. I'll try that and keep you posted.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.