PDA

View Full Version : Business Problem - replace a table



kettle_anonymous
04-14-2006, 10:22 AM
Hey all,
Here is a business problem.


A 'main' database contains all the records for MYTABLE.
A 'cache' database, that renames some fields and modifies some of the content of the fields from the 'main' database to be appropriate for a particular application.



The 'cache' version of MYTABLE needs to be relatively high-availability, I can't have it disappear for very long at all.



The problem - the 'cache' version is just that, a cache, and needs to be duplicated (not updated, duplicated to remove records as well) from the main -- including the minor modifications of the content of some fields.



How can you solve this with kettle/spoon transformation design?



Thanks,
-D

MattCasters
04-14-2006, 10:54 AM
Hi D,

First let me say that Kettle is not a replication tool. I guess there are more specialized replication tools on the market that handle this kind of problem better and easier. (although they might be somewhat more expensive ;-))
(Kettle is a data integration tool hence the name-change to Pentaho Data Integration)

But... especially for this purpose the latest dev snapshot does contain a Delete step that allows you to replicate data for a table T as follows:

1) Extract information into table T_NEW
2) Keep table extraction from previous generation (prev day for example) in T_OLD
3) Put both data through the Merge step with T_OLD as reference: this will flag the records as: new, deleted, modified or identical.
4) Filter the rows and send the deleted rows to the Delete step and the other to a Insert/Update step, both agains table T
5) Truncate table T_OLD, copy the data from T_NEW to T_OLD and truncate table T_NEW. (Use a job for that: only do this last step when the previous ran OK)

There you have your replication. Never tested it though. Someone promised me he would write a paper on it, but if anyone else could post the results here, I would be delighted.
The interesting part of it is that this system should be able to keep data in sync between any 2 different database types. :-)
The bad part is that it takes up a lot of resources as we can't use timestamps in the source table to detect deleted records. (because the record is gone)

All the best & good luck!

Matt