PDA

View Full Version : Parallel CSV reader



MattCasters
04-01-2008, 11:40 AM
I almost forgot I wrote the code a while back. Someone asked me about it yesterday, so I dusted the parallel CSV reader code off this morning and here are the results:
http://www.kettle.be/images/parallel-csv-reading.png
This test basically reads a file with 10M customer records (generated), sized 919169988 bytes in 18.3 seconds. (50MB/s) Obviously, my poor laptop disk can’t deliver at that speed, so these test results are obtained by utilizing the excellent Linux caching system http://www.ibridge.be/wp-includes/images/smilies/icon_smile.gif
In any case, the caching system simulates faster disk subsystem.
On my computer, the system doesn’t really scale linearly (especially in this case, the OS uses up some CPU power too) , but the speedup is noticeable from 25.8 to 18.3 seconds. (about 30% faster)
The interesting thing is that if you have more CPUs at your disposal (both SMP and clustered setups work) you can probably make it scale to the full extent of your disk speed.
In the case where lazy conversion is disabled (classical database loading situations comes to mind) you can see the read performance increase from around 75krows/s to around 100krows/s :
http://www.kettle.be/images/parallel-csv-reading-no-lc.png
In both scenarios, both CPUs in my machine are 100% utilized (or at least very close to that number) and as such, there is very high hope that this system scales over more than 2 CPUs as well.
You can find this feature in the latest SVN builds (revision 7049 and up) or in a next release of 3.1.0.
Feel free to let us know how you like this new performance enhancement!
Until next time,
Matt


More... (http://www.ibridge.be/?p=101)