View Full Version : Stream lookup with Bigint generates huge CPU load
When I use the Stream lookup step with a (MySQL) bigint datatype the CPU load goes through the roof, much more so than with other datatypes.
Is this normal? Apparently, it's very expensive to generate a hash value for a bigint?
jbleuel
02-21-2012, 04:46 AM
I assume you mean BigNumber (what is internally a BigDecimal).
Have a look at http://azagorneanu.blogspot.com/2011/07/javamathbigdecimal-performance.html
there are numbers for tests of Double (2ms) against BigDecimal (309ms to 767ms), so your finding seems correct.
Please let us know, if this is similar to what you found.
Thanks,
Jens
Thank you very much for your reply.
As a matter of fact, I was referring to BigInt; MySQL makes a distinction between an Int (32bit) and a BigInt (64bit) (http://dev.mysql.com/doc/refman/5.0/en/integer-types.html).
Although I would imagine a MySQL BigInt should be mapped to a Java long primitive datatype, apparently this isn't the case resulting in a similar performance penalty as you described in Double vs BigDecimal.
jbleuel
02-21-2012, 07:25 AM
It simply depends if it is signed or not :-)
A Java long is always signed, so a MySQL bigint signed will fit into this. But an unsigned will not fit into this, this is why we map to a BigDecimal / BigNumber (see http://jira.pentaho.com/browse/PDI-316).
When you change your DDL to a signed, it will be mapped to Integer / Long.
If this is not an option and you are sure it will not get into an overflow, you may change the data type within the transformation.
HTH,
Jens
Thank you so much Jens, I will look into it!
jbleuel
02-21-2012, 08:09 AM
It would be great to get feedback about the performance differences you got with the number of rows.
Thanks,
Jens
I would love to give you feedback about the performance difference but so far I haven't got any consistent results.
The flowrate starts high but deteriorates quickly, perhaps due to memory limitations. After that the flowrate varies wildly.
I've tried to forcibly CAST both the main flow and lookup flow to either a signed or unsigned but I can't tell any difference in performance. Then there's the fact that the MySQL CAST/Convert is very limited, for example I can't tell nor specify whether it generates a 32 bit int or 64 bit long :-)