View Full Version : "Sort Rows" step: explanation for free memory threshold?
Roland.Bouman
02-23-2009, 03:02 PM
Hi!
I was just reading http://wiki.pentaho.com/display/EAI/Sort+rows.
It says:
Free memory threshold (in %)
If the sort algorithm finds that it has less available free memory than the indicated number, it will start to page data to disk.
Can anybody explain me how "paging to disk" relates to the TMP files mentioned for the "Sort directory" and "TMP-file prefix" properties? I mean, from what I understand now the tmp files are created when the buffer size can't hold your rows. But it's a bit cloudy how I should understand this "paging to disk" bit.
TIA,
sboden
02-24-2009, 02:14 AM
I think it's supposed to be a little "cloudy". As I remember when you use the %free thing, sort step will check how much free memory you have (JVM wise) and use up that memory before swapping out. So it's the point at which sort step will write to disk.
Regards,
Sven
fabianS
02-24-2009, 02:22 AM
You already figured it out. Paging to disk means writing data to the HDD that does not fit into memory.
And the threshold defines how much memory (of the Java VM I believe) you dedicate to the sorting.
DEinspanjer
02-24-2009, 11:52 AM
If you have a set number of rows defined to keep in memory (for instance, 5000), the step will take every five thousand rows, sort them, write them to a tmp file, then when there are no more rows, it will do a merge sort on all those temp files.
If you have a % free value set instead, the step will evaluate the amount of free memory the JVM has every 1000 rows. If the % drops below your defined threshold, it will write the current set of sorted records out to a tmp file, freeing up memory for the next batch of rows. Note that it doesn't seem to do a System.gc(), so if you set your threshold low, you might end up spooling a lot of small files to disk until the JVM gets around to freeing some memory. That is just an assumption though. I haven't tested the step to see how it behaves in that situation. If you set the logging level to Detailed, you'll see messages from the step indicating what the free memory is before and after every spooling.
Roland.Bouman
02-24-2009, 03:47 PM
Hi guys!
thank you all for providing you insight. So what was not clear to me is that these configuration options are exclusive
Maria Roldan
12-15-2009, 02:09 PM
It's not clear to me either.
If you have a set number of rows defined to keep in memory (for instance, 5000), the step will ...
If you have a % free value set instead, the step will ...
According to Daniel, the options should be exclusive:
I shouldn't be allowed to set both a sort size and a threshold.
Is that right? Maybe I didn't understand well
thanks!
mc
DEinspanjer
12-15-2009, 11:34 PM
It is rather convoluted code, and I haven't tried to actually step through the code to make sure I fully understand it. What I can see from reading is this:
1. It seems that the author intended the two options to be mostly exclusive, but they didn't really ensure the fact.
2. It seems that if you set both options, the memory percentage option will never take effect until the num rows option has fired at least once. There is a guard block around the first calculation of free memory that won't execute if the num rows option is greater than 0. However, there is a secondary calculation that happens after writing a segment to disk, and once that calculation has happened, I believe that the percentage threshold would start working.
3. Your safest bet would be to treat the options as exclusive and set whichever one you don't wish to use to zero.
4. If my interpretation is correct, we should probably get a bug filed on that, but I'm not sure that a bug would be warranted without proof. :/
Maria Roldan
12-16-2009, 04:13 AM
Thanks, Daniel.
I appreciate your explanation, it is really useful. I'll go for the third.
At least, you're sure about this option, right?
I can't imagine in which circumstances to choose one or the other. Any tip?
As to the second, Matt?
Thanks,
mc
DEinspanjer
12-16-2009, 10:08 AM
At least, you're sure about [the third] option, right?
Yes. I've used both options for a while and I know they behave properly when configured exclusively.
I can't imagine in which circumstances to choose one or the other. Any tip?
I can imagine a scenario where you might have a lot of rows that are usually small but sometimes very large (outliers). In this case, if you wanted to break up the sort buckets into even chunks like every million rows, then it might also be desirable to configure a memory threshold to keep a bucket with several large outlier rows from running out of memory.
There is a similar scenario when configuring log file rollover in an application such as Apache. Frequently, people will configure it to roll over every day or hour, but they will also configure it to roll over if the file grows to a certain size limit like two gigs.
Of course, in that world, these files are durable and persistent. In this world, the sort files are transient so the use case doesn't hold up as well.
One case that I can think of that might be much more useful would be an absolute size limit rather than pct free.
That way you could configure it to serialize one million files or two gigs worth, whichever comes first. That could be very important if your file system doesn't handle larger files, and you can't do it with the current options.
Maria Roldan
12-16-2009, 08:50 PM
Thanks, Daniel.
You explained it clearly,
mc
Maria Roldan
12-17-2009, 04:21 PM
I was thinking of another scenario. If you intend to run the transformation in different machines, the % option would be more "portable". The % always works; the sort size may work in your machine, and may cause problems in another with less memory.
mc
MattCasters
12-18-2009, 03:26 AM
While that is true Maria, please keep in mind that on a Java Virtual Machine, you never know exactly how much memory you have available. That is because of the fact that Java performs automatic memory cleanup operations called Garbage Collects. As such, the % free trick only works reliably on small transformations and/or when there is only one sort going on.
I agree with Daniel that we should create a JIRA case for this to harden the step a bit more.
Maria Roldan
12-18-2009, 06:31 AM
Thanks!
Sort rows options are now clear enough to me,
mc