Yesterday, I covered how you can do an initial “replication” of data from MySQL to DynamoDB and how this can improve performance, and save storage space. The follow on question becomes:

That’s Great Nick. But how do I do keep my data up to date?

We’ve got data in our Airline Performance dataset through 31-DEC-2007. I loaded 1 year, all of 2007, for the previous example. What happens when the FAA publishes their 2008 January results, and we’ve loaded the new months worth of data into MySQL?


select count(*) from otp.ontime; 8061223
select count(*) from ontime where FlightDate > ‘2007-12-31′; 605765
select count(*) from ontime where FlightDate DATE ‘2007-12-31′; 605765
In other words, let’s select from MySQL any records whose date is beyond what we have currently (2007-12-31).

select count(*) from FASTER.”ontime”; 8061223
select count(*) from FASTER.”ontime” where “FlightDate” > DATE ‘2007-12-31′; 605765
While the DynamoDB INSERT statement was running, the following SQL was being run on MySQL.

show processlist shows a SQL session with the following SQL:
SELECT * FROM `ontime` WHERE `FlightDate` > DATE ‘2007-12-31′;
A single SQL statement (insert into select * from table where date > last time) has you up to date for reporting! Long term we may look to work with Tungsten to be able to keep our data up to date using replication bin log records but for now, this simple pull based approach.