PDA

View Full Version : Data vs Metadata : Kettle 3.0



MattCasters
05-16-2007, 10:30 PM
A few weeks ago we started work on Kettle (http://kettle.pentaho.org) 3.0.
One of the notable changes is a redesign (http://forums.pentaho.org/showthread.php?t=53268) of the internal data engine.
More to the point, we’re aiming for a strict separation of data and metadata.
The main reason for doing so was the reduction of object allocation and also to allow us to extend the metadata without a performance impact.
We anticipated a performance gain here and there, and initial test-code gave us a 15-20% increase in performance to hope for.
Who could have guessed that reading this file (http://kettle.pentaho.org/svn/Kettle/trunk/experimental_test/org/pentaho/di/run/textfileinput/customers-random-100k.txt), sorting it and writing it back to disk would turn out to become more than 5 times faster? (4.83 seconds!!!) Granted, we take the opportunity to do a full code review and if we spot a possible performance problem, we fix it too.
We’re also making a library of test-cases to run performance and regression tests again version 2.5 code. The result with comparison (speedup calculation) is posted here (http://kettle.pentaho.org/svn/Kettle/trunk/experimental_test/org/pentaho/di/run/RunResults-Matt-20070516.txt).
One of the nice things about the code changes is that although it will break the API, it’s a Good Think(TM) for the project in the long run. It will give us breathing room to keep innovating in the future. Speeding up steps between 15 and 1700% is a good start for the first 20 steps that are converted. It’s also nice to see a lot of test-cases double or triple in speed:

Select Values : x9 (Random select), x6 (Delete field), x5.5 (Change metadata)
Calculations: between x1.8 and 3.6 faster
Add sequence : up to 3x faster
Table Output : 15% faster up to x2
Add constant values: x5
Filter rows: x1.5
Row generator: x2.5 - x3 (up to 1.2M empty rows/s)
Sort rows: x1.15 - x5
Stream Lookup: x1.24 - x1.36 (and this step already got some serious tuning in the past)
Table Input: x1.2
Text File Input: x1.6 - x3.4
Text File Output: x1.75 - x2.9 On top of that, memory usage should have been reduced as well, although we don’t have any exact figures on that.
Until next time,
Matt



More... (http://www.ibridge.be/?p=49)