PDA

View Full Version : copy only records that reflect changed data



linden
04-20-2009, 12:06 PM
Hello forum,

I need to copy data that is stored in the source system like (year, month, attribute_1, attribute_2):
2001, 1, A, X
2001, 2, A, X
2001, 3, A, Y
2001, 4, A, X
2001, 5, A, X

The destination dataset should only include records that have changed (not including year and month columns) In this example this would result in:
(year, month, attr1, attr2)
2001, 1, A, X
2001, 3, A, Y
2001, 4, A, X

I have no experience using Spoon. Can this be done in a Spoon transformation?

daeschbacher
04-20-2009, 12:27 PM
Your second dataset can be created by querying the first dataset. After you have that query use the 'table input' step to query your database using the query you wrote then use a 'table output' step to write it to a new table.

linden
04-20-2009, 02:47 PM
What would that query look like?

sboden
04-21-2009, 02:17 AM
You want "Change data capture". For the moment (v3.1) there are no official CDC steps available in PDI. They would be pretty database specific as well. So the best you can do is a DIY:

1) Use the database. In DB2 you could DPROPR to keep track of the changes for you, in Oracle with streams, ... use these database tools and query the resulting "change" data
2) Implement your own cdc, but this would require source database changes... e.g. putting triggers on the tables to detect when a certain row has changed.
3) Sandbox your data. Instead of changing your source database, copy the database tables to your own area and every run apply the database changes of the original database.

Regards,
Sven

linden
04-21-2009, 09:18 AM
Thanks for you reply sboden.

It is rather disappointing though, that on my first attempt I have to revert to sollutions outside the ETL tool. That makes it difficult to defend the use of the tool.

sboden
04-21-2009, 09:39 AM
Well... name 1 ETL tool with free CDC functionality... ;)

DataStage e.g. has CDC functionality but as a separately sold license (they bought a company datamirror before which did databases cdc).
Something to do CDC (per database type) could be made in PDI, but those things are very database dependent and pretty labor intensive to make. So no-one found the time or the need to actually create them.

Regards,
Sven

linden
04-21-2009, 10:26 AM
I was hoping an (any) ETL tool would enable me to significantly reduce the blackbox scripting produced in the past. With the first relatively simple scenario I try out, I have to create a script myself, and my (Python) programming is no good at all! Allmost took me half a day, I am ashamed to say.

I now have a script that collects all source data and compares each row in this dataset with the destination table. If a row is missing in the destination data an insert is performed. Could I perhaps use a JavaScript step from within Spoon to do the same? Then at least the code would be accesible and a visible part of the whole process.

sboden
04-21-2009, 11:57 AM
:D ... Every ETL tool I've worked with requires some "black box" scripting at 1 point or the other... usually in different places for different tools.

For what you want to do there are other things out there of course. Closest to CDC in PDI:
a) - Read your target from database sorted to file (most would use a seperate database/table for this).
b) - Read your source with a table input, your target from file. Use a merge row and you get something as a poor men's CDC
c) filter rows and do the correct action... and then there's the "hole" of having delete/inserts on the same key in the same run (which can be problematic as PDI runs transformation steps in parallel).

But what I most of the time see is some kind of sandboxing where the data is duplicated to another database, and each night all changes are applied to it (with the cdc's from the database in e.g. a separate table, maintained by the database tool).

Regards,
Sven

P.S. I don't see an easy way how you could it with javascript.

daeschbacher
04-21-2009, 03:08 PM
Here is the MySQL query for your example:

select t2.Year, t2.Month, t2.attr1, t2.attr2 from changedData as t1 left join changedData as t2 on t1.Year = t2.Year and t1.Month = (t2.Month-1) where ((t1.attr1 != t2.attr1) or (t1.attr2 != t2.attr2))

this returns:

2001, 3, A, Y
2001, 4, A, X

If you want it to include "2001, 1, A, X" simply do a union with a query that returns the earliest entry from the table.

This would be:

select t2.Year, t2.Month, t2.attr1, t2.attr2 from changedData as t1 left join changedData as t2 on t1.Year = t2.Year and t1.Month = (t2.Month-1) where ((t1.attr1 != t2.attr1) or (t1.attr2 != t2.attr2)) UNION (select * from changedData order by Year, Month limit 0, 1) order by Year,Month

I know that is the lazy way to do it but don't really feel like writing it correctly right now.

linden
04-22-2009, 02:58 AM
Thanks daeschbacher (http://forums.pentaho.org/member.php?u=56341)!

I had a feeling the problem could be solved using only SQL but I couldn't figure out how exactly. Forgot how to think in sets.

I can use an SQL statement like this in a Spoon table input step wich would solve my problem

daeschbacher
04-22-2009, 09:18 AM
I just realized the query I gave you won't cross years. There is a solution but I don't have the time to write it out right now.

DEinspanjer
04-25-2009, 09:33 PM
I'm sure this wouldn't cover every use case of CDC, but you could use a Group By step to get the exact output you ask for above.

Group by expects the data to be sorted according to the keys it is grouping on. Whenever it detects a difference in one of the key values from what the previous row's key values were, it generates a new group.

You can use this to your advantage if your data is sorted by year/month but you are grouping on attr1/attr2. Every time one of those values change, it will generate a new group. Then you just output the year/month using the aggregate "First Value".

Attached is an example.

DEinspanjer
04-25-2009, 10:06 PM
The way I normally do something like this in SQL is with an identity column that you can use to control the self join. I've attached a sample of that transformation.

If you know that you'll only have one row per year/month, and you will not have any missing years or months, then you could use this SQL to wrap over the year and skip the identity column:

SELECT *
FROM cdc_test a
LEFT JOIN cdc_test b ON (
(a.year = b.year AND a.month - 1 = b.month)
OR (a.year - 1 = b.year AND a.month = 1 AND b.month = 12)
)
WHERE a.attribute_1 <> b.attribute_1
OR a.attribute_2 <> b.attribute_2
OR b.id IS NULL

MattCasters
04-27-2009, 03:39 AM
Just for completeness I would like to add that I'm working with the folks from SQLStream (http://www.sqlstream.com/) to add their real-time capabilities to PDI. This integration will allow for transformations that never end and give you access to changes the moment they occur. The results of that work should appear in the course of the next couple of months.