PDA

View Full Version : HDFS connection problem



Luke2
09-25-2013, 01:00 PM
I am a new Pentaho user, trying to connect to HDFS running in a VM.

I am running Pentaho Data Integration PDI-5.0.0.1-x64, which uses
Kettle - Spoon 5.0.0 GA, build 5.0.0.1, running on a Win-7 PC.

With an Ubuntu (12.04.1 LTS) VM running in VirtualBox (4.2.18).
The VM network setting is Bridged Adapter, Promiscuous Mode (Allow All).

I can use a web browser to examine the VM's HDFS at
http://192.168.25.130:50070/dfshealth.jsp
and also
http://192.168.25.130:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/

I can put files into HDFS, and confirm that they are there using the browser.

But I cannot connect to HDFS using Pentaho, I get the "Unable to connect to DHFS server" error.

I've tried the Hapoop Copy Files Job Entry, using the Browse button on the Oprn File dialog window.
Settings:
Look In: HDFS
Server: 192.168.25.130
User ID: ubuntu
Port: 8020 (I have tried several others)
Password: ****

I understand that The PDI Hadoop interface only supports hadoop version 20 out of the box, so I followed the instructions here
http://funpdi.blogspot.com/2013/03/pentaho-data-integration-44-and-hadoop.html
to create a new plugin under "hadoop-configurations", called "hadoop-101" next to the existing "hadoop-20".

the hadoop-101/lib/client contains:

commons-cli-1.2.jar
commons-codec-1.4.jar (NEW)
commons-configuration-1.6.jar (NEW)
commons-el-1.0.jar
commons-httpclient-3.0.1.jar
commons-logging-1.1.1.jar
commons-net-1.4.1.jar
hadoop-core-1.0.1.jar (NEW)
kfs-0.3.jar
hsqldb-1.8.0.10.jar
jets3t-0.7.1.jar
oro-2.0.8.jar
xmlenc-0.52.jar

In the plugin hadoop-101 directory, the core-site.xml has:

<property>
<name>fs.default.name</name>
<value>hdfs://192.168.25.130:8020</value>
</property>

And in the VM it is:

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>

The VM is using IP 192.168.25.130
Hadoop version 1.0.1
java version 1.6.0_24
Accumulo version 1.4.2

The VM was originally an Accumulo example from
http://blog.sqrrl.com/post/40578606670/quick-accumulo-install

Thank you!

P.S. I have a macbook, and I will try and run HDFS and Pentaho Data Integration on the same box, to eliminate network issues. Unfortunately, the hadoop version on the macbook is 1.2.1, so I will have to try and set up another hadoop-configurations plugin.

Luke2
09-30-2013, 09:48 AM
I have still had no success with getting Pentaho to connect to HDFS, and I am trying to eliminate hadoop plugin version mismatch as the problem.

Has anyone had success connecting Pentaho to HDFS running inside the Yahoo Hadoop Tutorial VirtualBox VM?

The tutorial VM is located here:
http://developer.yahoo.com/hadoop/tutorial/module3.html#vm
which has hadoop 0.18.0 and java 1.6.0_07, running on Ubuntu 8.04.1.

The VirtualBox VM is running on 192.168.25.135, network - bridged adapter, promiscuous mode - allow all. The host PC firewall is off. I can ssh into the VM from the host PC. I can browse the name node from the host PC at
http://192.168.25.135:50070/dfshealth.jsp

But attempting to use Pentaho Hadoop file input always results in the "Unable to connect to HDFS server" error.

Can the default Pentaho hadoop-20 plugin be used with hadoop 0.18.0?

Thank you

ceverett
10-01-2013, 06:40 PM
Hello,

I believe the problem may be with the configuration on your VM. Below you have:

"And in the VM it is:

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>"

You will need to change the value entry to <value>hdfs://[vm-ip]:8020</value> in order for hdfs to listen externally on the vm.

Hope that helps

Luke2
10-02-2013, 09:10 AM
Yes, it was just the "localhost" in the core-site.xml, when I changed it to use the actual IP address, it worked.
thanks


Hello,

I believe the problem may be with the configuration on your VM. Below you have:

"And in the VM it is:

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>"

You will need to change the value entry to <value>hdfs://[vm-ip]:8020</value> in order for hdfs to listen externally on the vm.

Hope that helps