02-25-2012, 12:34 PM
I've hadoop single instance cluster configured to run with some IP address ( instead of localhost ) on centos linux. I was able to execute example mapreduce job correctly. That tells me that the hadoop setup appears to be fine.I have also addded couple of data files to hadoop databse under "/data" folder and are visible through the "dfs" comand

bin/hadoop dfs -ls /data

I am trying to connect to this HDFS system from PDI/Kettle. In the HDFS File browser, if I put the HDFS connection parameters incorrectly, e.g. incorrect port, it says it can not connect to the HDFS server. Instead, If I put in all parameters correctly ( server,port,user,password ), and click 'connect' it does not give the error, meaning it is able to connect. But in the file list, it shows "/" .Doesnt show data folder. What could be going wrong ?

I've already tried this :
1) tried chmod 777 to the datafiles using "bin/hadoop dfs -chmod -R 777 /data"
2) tried using root and also hdfs linux user in the PDI file browser
3) tried adding the data files in some other location
4) re-formatting hdfs several times and adding data files again
5) copying the hadoop-core jar file from hadoop installable to PDI extlib,

but it does not list files in the PDI browser. I can not see anything in the PDI log either...

02-26-2012, 04:31 PM
If you're not running vanilla Apache Hadoop 0.20.2 you'll need to replace the vanilla Hadoop core jar in PDI with the one from your cluster. Hope this helps!

Update: just re-read your post. Make sure you don't have multiple hadoop-core jars in anywhere in your PDI installation path.

02-27-2012, 11:03 AM
If Jordan's suggestion does not help and you do not have two versions of the Hadoop jar in PDI's extlib or any of the subfolders such as extlib/pentaho, try connecting without a username/password. Leave those fields blank.

Also, what distribution and version of Hadoop are you using?