Recently i am using pentaho kettle 5.2.0, i used it with apache community hadoop2.2.0, not the business version(cloudera or hortornworks). But when i do it as the wiki said :
It comes out an error. After i run the job "aggregate_mr.kjb", the libraries in the plugins/pentaho-big-data-plugin/hadoop-configurations/hadoop-2.2.0/lib will be put into the HDFS(/opt/pentaho/mapreduce/ The error is the "/opt/pentaho/mapreduce/ ... is not a valid dfs filename" . Do you know why?
Or do you knnow whether pentaho kettle can work well with apache community hadoop2.x ?
Thank you very much!
Name:  FB)K](2ME2TV{ACTVK440TW.jpg
Views: 154
Size:  7.8 KB

Deploy the connection between Kettle and Hadoop-2.2.0

1. Download open source Kettle from


2. Decompress Kettle to any directory as data-integration.

3. Copy cdh47 in data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations

And re-name it as hadoop-2.2.0

4. Delete all jar files in hadoop-2.2.0/lib/client

Copy all jar files in $HADOOP_HOME/share/Hadoop to data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/hadoop-2.2.0/lib/client

5. Delete the jar package in hadoop-2.2.0/lib/pmr.
Put zookeeper3.4.5.jar in $HADOOP_HOME/share/Hadoop into hadoop-2.2.0/lib/pmr.

Copy all jar packages in cdh51 that are related to hbase (namely, jar packages except zookeeper-3.4.5-cdh5.1.0.jar) to pmr directory.

6. Copy all files in $HADOOP_HOME/etc/Hadoop to data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/hadoop-2.2.0

7. Modify data-integration/plugins/pentaho-big-data-plugin/plugin.properties

Set active.hadoop.configuration=hadoop-2.2.0

8. Put ojdbc14.jar into data-integration/lib.

It is compulsory. With other directory, error that Oracle cannot be connected will be reported when sqoop of Kettle import data of Oracle.

Meanwhile, there should be noojdbc14.jar in any other subdirectory of data-integration. Otherwise, error will be reported.