PDA

View Full Version : Sort transformation do a bottleneck



aixendri
03-29-2006, 12:51 AM
Hi all,

In my transformation I have a sort step (I have to sort records because the next step is a group by) that process records only at 28 records/second.

The machine I use has two processors, but I think that this step must not be distributed in order to get a correct order.

There is some trick in order to work more faster?

Thanks in advance,

Albert.

MattCasters
03-29-2006, 12:59 AM
Hi Albert,

What we should do really is allow 2 sorted streams to be merged.
That way we could split the load over 2 CPUs. (if the JVM thinks that is the fastest way to go)
In a sense, I guess you could use the Merge functionality, never really thought of using it for that purpose though.

What also greatly helps is the code I commited to development yesterday, allowing you to increase the sort buffer. The default is now a puny 5000 rows which is just awfull in a lot of cases. Imagine sorting 5M rows, then you'd have 1000 temporary files open on disk and that's just stupid really.

So, without further ado, grab that kettle.jar from Documentation / Development packages and update your 2.3.0 or 2.2.2 distribution. I know the build systems is down, but I uploaded a new file a few minutes ago.

Thanks for the feedback, please let us know how you're doing!

Matt

aixendri
03-29-2006, 02:08 AM
Hello Matt,

I have got the new kettle.jar and I have edited my transformation.
About the merge function I have put it but in validation process it told me that is "not yet implemented", and then I deleted this and I left the transformation how it was originally.

About the sort, I modified the "max number" from 5000 to 50000 and now in log we see this behavior:

The speed now is about 44 rec/s, but I think that could be better:
...
2006/03/29 12:36:46 - Java Script Value.0 - linenr 25000
2006/03/29 12:36:46 - Sort rows.0 - Starting quickSort algorithm...
2006/03/29 12:36:47 - Java Script Value.1 - linenr 25000
2006/03/29 12:36:52 - Sort rows.0 - QuickSort algorithm has finished.
2006/03/29 12:47:45 - Sort rows.0 - Linenr 50000
...

The QuickSort seems to be finished at 12:36:52 but all the steps remains locked until 12:47:45 then during these 10 minuts the speed goes down. It's this normal?

About the "max number" in the sort step, I will increase it because now I am processing about 100000 records but I must to process about 900000 ;-)

I will continue testing it.

Regards,

Albert.

MattCasters
03-29-2006, 02:18 AM
Hi Albert,

The problem is that once the rows are sorted in memory, they need to be stored on disk. This serialisation process is expensive both in terms of CPU and I/O.
Also, unlike database systems, Kettle needs to store the complete row (not just the key) as the row itself is not persisting anywhere, it only lives in memory.
Not only that, but after if saves the rows in temp files on your local disk (not too fast either probably) it needs to read the data back and de-serialize back to java objects. This is also costly in CPU and I/O.

The only advice I can give you is to adjust the sort buffer to give you max. 20-30 temp files.
Another lesson learned is that if you source the data from a database, then the data is already stored in there and can be sorted faster.

Hope this explains it a bit.

All the best,
Matt