PDA

View Full Version : How to set JVM memory parameters for the nodes?



iShotAlex
04-16-2013, 08:52 AM
Hi,

I read here http://pedroalves-bi.blogspot.com.es/2013/02/pentaho-bigdata-101-to-bit-more.html


After several days changing the plugin and debugging the origin of the problem, I finally discovered that by default mapreduce tasks run with a maximum memory of -Xmx200m [...] that value was clearly insufficient to run the transformation [...] So do yourself a favor - increase the available memory on the cluster.


In kettle I can set the JVM memory parameters in spoon.bat but how do you do so for the MapReduce tasks??

Thanks!

mattb_pdi
04-16-2013, 09:31 AM
There is a property you can set on the Hadoop cluster in mapred-site.xml called "mapred.child.java.opts" which is the -Xmx option:

<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m</value>
</property>

Check this StackOverflow post for more info:

http://stackoverflow.com/questions/8464048/out-of-memory-error-in-hadoop

MattCasters
04-16-2013, 09:40 AM
What Matt said and correct me if I'm wrong...

http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Memory+Management (http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Memory+Management)


Users can choose to override default limits of Virtual Memory and RAM enforced by the task tracker, if memory management is enabled. Users can set the following parameter per job:

Name
Type
Description


mapred.task.maxvmem
int
A number, in bytes, that represents the maximum Virtual Memory task-limit for each task of the job. A task will be killed if it consumes more Virtual Memory than this number.


mapred.task.maxpmem
int
A number, in bytes, that represents the maximum RAM task-limit for each task of the job. This number can be optionally used by Schedulers to prevent over-scheduling of tasks on a node based on RAM needs.




Values can be passed through the user defined settings in the Pentaho Map Reduce settings.

iShotAlex
04-24-2013, 08:42 AM
Hi all,

Thank you very much for your replies.

We changed the mapred.child.java.opts property to -Xmx2048m (it turns out it was set to 1024). Unfortunately we're still facing very serious performance issues.

I have a simple HIVE query that takes aprox. 3 minutes tu run. I reproduced it with PDI and it takes several hours. Both Map and Reduce phases are excruciatingly slow.

I've been developing with Kettle during several years and I'm confident that both the Mapper and Reducer transformation are well designed. And anyway, as I said, they're very simple (the reducer is a mere group by).

Thought it could be a memory issue but now it seems it isn't. Could PDI be REALLY this slow? I find it hard to believe. The cluster's working fine coz Hive jobs run great.

Any idea of what could cause this?

(running PDI 4.4.0 over a 6 node CDH4 cluster)

Thanks again for all your help!

Regards,

Alex