View Full Version : Why is spoon so slow?

02-13-2006, 04:45 AM
Attachment: Graphic_view.gif (http://forums.pentaho.org/archived_att/kettle/Graphic_view.gif) I have created a transformation like the one in the attached picture. It is processing ca. 50000 rows from the main table (bottom - left). It is taking the whole transformation ca. 2 hours to complete ... can that be right?
Is kettle 2.2.2 really that slow?

02-13-2006, 05:30 AM
I have seen configurations in which Spoon handles this amount of records in a single second.

In this case I really suggest you add a couple of indexes to speed up the JOIN steps.
Look at the last column of the log view when the transormation is runing.
The 3 figures indicated are
- priority
- number of records in input buffer
- number of records in output buffer

So logic indicates to look at steps with a lot of input and little output.
These steps you need to optimise.



02-13-2006, 12:04 PM
Did you set the commit size parameter in your output step different from zero?

If not try to set the commit size to some hundred or thousand rows. I saw databases where this brought a dramatic speed increase.

How many rows are you writing?



02-14-2006, 12:40 AM
Attachment: Graphic_view2.gif (http://forums.pentaho.org/archived_att/kettle/Graphic_view2.gif) I have attached a new image of my transformation.
When I preview part A (ca. 10000 rows) it takes 20.7 sec to process.
When I preview part B (ca. 50000 rows) it takes 464.8 sec to process.
It must therefore be the last join-step that takes time. (ca. 8000 sec)
When the data reaches the join-step my CPU goes 100% the rest of the time ... is that normal?

I have tried to use a commit size of 500, but it is not helping ... I do not think MyISAM uses commit ... I am dumping the result data to a MySQL table.

02-14-2006, 10:46 PM

I guess the performance is slow because "Join Rows" performs a cartesian product of 10000x50000 rows.
The serialisation and de-serialisation of the rows take a lot of I/O and CPU.
The one thing that will make a lot of difference is to use a stream lookup or to set the max cache size to 15000. Also, always set the main step to read from to the largest stream. (most rows to expect).

Another option I had not considered before is to sort both streams on the join keys and use a Merge Rows step. Then only take identical and changed rows. It's a bit of a hack, but it should work.

The problem is I don't know what you're doing with the "Join Rows" step, it also be a lookup and in that case as Stream Lookup might even be faster with in-memory hashtable usage.

Hope this helps!

All the best,

02-15-2006, 01:43 AM
Attachment: Graphic_view3.gif (http://forums.pentaho.org/archived_att/kettle/Graphic_view3.gif) That did the trick ...

I made a version without the "Database Join", but used "Join Rows" instead ... then substituted the "Join Rows" with "Stream Lookup" and voila ...

Total processing time reduced from ca. 8000 sec to 109 sec ...