PDA

View Full Version : Linking up PDI to Hadoop



Jasper
11-08-2010, 06:47 AM
Hi,

I am trying to set up a working Hadoop / Pentaho integrated environment. I managed to launch my single node Hadoop cluster on a Ubuntu server image.

Now I have to connect it to a Pentaho 3.7 RC environment which is on another server (image).

I found some useful information here:

http://wiki.pentaho.com/download/att...op_pentaho.pdf (http://wiki.pentaho.com/download/att...op_pentaho.pdf)

In this document one prerequisite is "a Hadoop cluster already properly configured and tested". My question is if that includes the installation of Hive as well.

Is Hive involved when a PDI client is working with the PDI instance directly installed on the name node ?

cboyden
11-08-2010, 11:11 AM
Hi Jasper,

Hive is not involved when PDI is sending a Map/Reduce task to a Hadoop cluster. Only when accessing Hive through JDBC is Hive involved.
The Copy Files and Text File Input/Output steps utilize HDFS (standard to a Hadoop Node).
The Job Executor steps use the Map/Reduce functionality (standard to a Hadoop Node).

All Hive interaction is masked as standard JDBC access and, as of yet, does not have any specific steps called out.

I am not sure if this answers your entire question, please let me know if you need more information.

-Curtis

jganoff
11-08-2010, 11:17 AM
Additionally, if you'd like to work with Hive you'll need to set it up separately. You may want to start here: http://wiki.apache.org/hadoop/Hive/GettingStarted#Installation_and_Configuration.

Jasper
11-08-2010, 11:19 AM
Hi Curtis,

Thanks. I will go ahead and install Hive anyway after I have installed the linux PDI on the name node. Hive will be of good use later I think.

Another question: is the distributable of the Linux PDI client here
http://www.sendspace.com/file/f6mhgu (http://www.sendspace.com/file/f6mhgu)

still good ? I could not get the linux client from the standard download page at www.pentaho.com/hadoop (http://www.pentaho.com/hadoop).

cboyden
11-08-2010, 11:39 AM
The sendspace file is from early on in the Hadoop beta program. You will definately want to use the latest, RC1 at this time.

What problem are you having with the www.pentaho.com/hadoop (http://www.pentaho.com/hadoop) link? You really should be able to get the file and trial license that way.

-Curtis

Jasper
11-08-2010, 04:53 PM
Well, the other day I downloaded the whole BI Suite and I had to fill out the form and received a confirmation email. When I returned today to get the Linux PDI client I got a message that the Linux client is much more complicated, needs special support and that you should download the windows client and worst of all; granting the license for the Linux client needs 2 working days evaluation of Pentaho staff..

Any way, when I fill out the form again, I'm not redirected to the download page as before but I have to fill it out again and again. It's probably that I have already "used up" my evaluation by downloading the BI Suite

Jasper
11-11-2010, 09:02 AM
Hi,
Now I have a PDI client set up on the Hadoop namenode and another client to connect to it.

Is the PDI communication all through port 22 or do you have to open other ones as well ? (like :9000 which I saw somewhere during installation of PDI on the Hadoop side..)

jganoff
11-11-2010, 10:10 AM
Now I have a PDI client set up on the Hadoop namenode and another client to connect to it.


In order to execute transformations through Hadoop, not just on a machine configured as a Hadoop node, you need to install the Pentaho for Hadoop Distribution (PHD) archive into each node's Hadoop installation. This is not a PDI client but the actual PDI execution engine that will allow you to submit transformations as Map/Reduce jobs.

Have you filled out the evaluation request for the Pentaho Data Integration for Hadoop 4.1 RC1 from http://www.pentaho.com/download/? This is the much preferred path to take as the installation process for PDI for Hadoop has improved since the beta program.

Jasper
11-11-2010, 10:56 AM
Hi,

Yes I proceeded according to this doc http://wiki.pentaho.com/download/att...op_pentaho.pdf (http://wiki.pentaho.com/download/att...op_pentaho.pdf)

So I copied the necessary files to /lib and /libext within the /libkettle dir along with the manual changes to hadoop.sh. Would that be enough?

I had a pseudo hadoop cluster up and running but after these steps from the doc my namenode does not start anymore (error: "FSNamesystem initialization failed") so I am not there yet

Jasper
11-12-2010, 07:30 PM
@jganoff
By "into each node's Hadoop installation" you mean each datanode right ?

Would this also work on a Cloudera CDH2 pseudo-distributed cluster? How many PHD installation would that require?

jganoff
11-15-2010, 10:02 AM
By "into each node's Hadoop installation" you mean each datanode right ?
Essentially, yes. They can be referred to as tasktrackers as well since a node is not required to host data (be part of the HDFS).



Would this also work on a Cloudera CDH2 pseudo-distributed cluster? How many PHD installation would that require?
We've verified PDI 4.1.0-RC1, and the soon to be released PDI 4.1.0-GA, against CDH3 Beta 3. The PDI 4.1.0 release contains a Pentaho Hadoop Node Distribution which is designed to simplify the process of installing PDI into each Hadoop tasktracker node.

Jasper
11-15-2010, 06:54 PM
The PDI 4.1.0 release contains a Pentaho Hadoop Node Distribution which is designed to simplify the process of installing PDI into each Hadoop tasktracker node.

That would be very nice. Until then are the instructions that were given in this http://wiki.pentaho.com/download/att...op_pentaho.pdf (http://wiki.pentaho.com/download/att...op_pentaho.pdf)

document (which seems to be off line now) still valid to make it work or are there new instructions?

Jasper
11-17-2010, 07:38 PM
Hi, I have a CDH3 distribution running now and installed the PHD on the (single) Hadoop node, installed the license etc. I've gone through all the steps. All the daemons restarted and I can see them listening on the default Hadoop ports 50030 to 70. I can even inspect HDFS folder contents through a webbrowser on the client side.
But still on the client PDI side (not on the CDH3) I can't see any 'heartbeat' of the PHD on the CHD3 side and my simple CopyFiles job just doesn't work (logging says "folder does not exist").
Few questions:
-Is there any command I can run on CDH3 to prove the PHD is running OK?

-On which ports is on CHD3 is the PHD actually listening? I see port :9000 and :9010 being used in the sample Hadoop jobs in PDI 4.1.0. Are these the standard ports (what happened to :22 ?) or were these configured this way incidently during development testing? 9000 was not listed when I ran a netstat command on the CDH3 so I am confused.

cboyden
11-18-2010, 12:53 PM
Jasper,

One possibility is the file permissions on the HDFS. You will want to make sure for testing that all users have rw access to the HDFS location you are copying the file to.
Can you attach your Job that contains the copy/files to this thread please?

-Curtis

cboyden
11-18-2010, 12:58 PM
Port 9000 should be HDFS. I think this is the default, but I am not positive.

cboyden
11-18-2010, 01:06 PM
I get the following message (which sounds similar to yours) if I try to put a file in a non-existing folder.
Try creating the folder first and see if that helps.

2010/11/18 12:02:40 - Hadoop Copy Files - ERROR (version TRUNK-SNAPSHOT, build 14388 from 2010-11-15 10.39.15 by tomcat) : Folder hdfs://hadoop-vm1.pentaho.com/folder/test/test does not exist !
2010/11/18 12:02:40 - Hadoop Copy Files - ERROR (version TRUNK-SNAPSHOT, build 14388 from 2010-11-15 10.39.15 by tomcat) : Destination folder does not exist!

Jasper
11-20-2010, 03:44 PM
Hi,

I upgraded everything to GA standard now so I have:

-CHD3 + new Pentaho Hadoop Distribution + 2 licenses installed on the Hadoop side
-4.1.0 EE GA client & server + 2 licenses on the client side

Both sides work perfectly in isolation but I still have no prove that the client side managed to do anything on the Hadoop side. I can't find where things go wrong, and. The logging messages give me no clues.

I can browse the HDFS from the client side fine via the Cloudera webapp so I am sure that the folder I try to delete is there (see screenprint)

The very simple job (Delete_folder_in_Hadoop) to delete/empty an existing HDFS folder doesn't work. The job completes and returns "Folder already X already deleted" but that is not the case. The second screen print of the PDI logging is interesting because it even responds with "Folder already X already deleted" whereas in this case I entered a wrong IP for the Hadoop cluster. So I would say that the logging is not accurate.

I have tried the following syntax (and more) for the folder to delete in the "Delete folder" step:
-hdfs://192.168.160.131:9000/user/training/grep_output
-hdfs://192.168.160.131:22/user/training/grep_output
-hdfs://192.168.160.131:8022/user/training/grep_output
-hdfs://192.168.163.131/user/training/grep_output
-hdfs://hadoop:hadoop@192.168.163.131:9000/user/training/grep_output
-hdfs://training:training@192.168.163.131:9000/grep_output

Is the port number necessary? Do you need to enter Hadoop user/PW for CDH3?

The other job (Copy_from_to_Hadoop.kjb) also doesn't work.

Jasper
11-21-2010, 04:26 AM
2010/11/18 12:02:40 - Hadoop Copy Files - ERROR (version TRUNK-SNAPSHOT, build 14388 from 2010-11-15 10.39.15 by tomcat) : Folder hdfs://hadoop-vm1.pentaho.com/folder/test/test does not exist !
2010/11/18 12:02:40 - Hadoop Copy Files - ERROR (version TRUNK-SNAPSHOT, build 14388 from 2010-11-15 10.39.15 by tomcat) : Destination folder does not exist!

No port numbers needed ?

jganoff
11-22-2010, 09:56 AM
...
I have tried the following syntax (and more) for the folder to delete in the "Delete folder" step:
-hdfs://192.168.160.131:9000/user/training/grep_output
-hdfs://192.168.160.131:22/user/training/grep_output
-hdfs://192.168.160.131:8022/user/training/grep_output
-hdfs://192.168.163.131/user/training/grep_output
-hdfs://hadoop:hadoop@192.168.163.131:9000/user/training/grep_output
-hdfs://training:training@192.168.163.131:9000/grep_output

Is the port number necessary? Do you need to enter Hadoop user/PW for CDH3?

The other job (Copy_from_to_Hadoop.kjb) also doesn't work.

It looks like the folder permissions set on hdfs://192.168.163.131/user/training are 755. Please make sure full write access is set on any folder you intend to write to or delete from PDI (chmod 777 is a sledgehammer approach to fixing your issue). PDI 4.1.0 GA doesn't support HDFS user-level security at this time.

You can set the correct permissions by issuing the command:

hadoop fs -chmod 777 /user/training

For more information on HDFS shell see the Hadoop File System Shell Guide (http://hadoop.apache.org/common/docs/r0.20.2/hdfs_shell.html#chmod).



No port numbers needed ?
Port numbers are not required if you're using the HDFS default port (9000).

Jasper
11-22-2010, 01:03 PM
Jordan,

Funny you say that. I already changed the whole tree in HDFS to 777 after I posted the thread. But it made no difference.

When you say "PDI 4.1.0 GA doesn't support HDFS user-level security at this time." , how can you be sure which user is using Hadoop via the PDH?

Still I have this question how you can determine on the Hadoop side that the PDH is working correctly? This is just to eliminate possibilities because obviously mine doesn't work from the PDI client side.

Jasper
11-23-2010, 07:00 PM
It looks like 9090 is the right port for PDI to connect to. The moment I launch my PDI job (Copy files from local to HDFS) now at least my job seems to actually do something. It stalls and this error pops up at the same time in the log of the name_node process on CDH3:

===========================================
Exception in thread "pool-3-thread-2" java.lang.OutOfMemoryError: Java heap space
at org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:296)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:203)
at org.apache.hadoop.thriftfs.api.Namenode$Processor.process(Namenode.java:1166)
at org.apache.hadoop.thriftfs.SanerThreadPoolServer$WorkerProcess.run(SanerThreadPoolServer.java:277)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
==================================================

The $HADOOP_HEAPSIZE is set to 1000MB in the hadoop-env.sh file. The VM has 2000MB of RAM.

Mark
11-25-2010, 10:40 PM
Hi Jasper,

I have this running on Cloudera's vmware virtual machine image of CDH. The default user, on this distribution at least, is the "training" user. In order to get the Hadoop side of PDI working I needed to install the licences for the "hadoop" user as well (su hadoop with password hadoop). The hostname on this virtual machine is "training-vm.local". For good measure I replaced all occurrences of "localhost" in the config files (core-site.xml and mapred-site.xml) in $HADOOP_HOME/conf with "training-vm.local".

Cheers,
Mark.

Mark
11-25-2010, 10:53 PM
I just noticed your latest post :-) Are you sure that port 9090 is correct for HDFS on your copy of CDH? I have CDH 0.3.3 and it came configured with HDFS on 8022 and the job tracker on 8021.

Cheers,
Mark.

Jasper
11-26-2010, 08:04 PM
Hi Mark,

You just made my day! Thanks a lot.

My licenses were OK (installed for both users). Changing "localhost" into "training-vm.local" in both XML's and using port 8022/8021 made the job work partially. But you were right, port 9090 is not meant for HDFS.

So I am now able to do some basic copy jobs via PDI. However when the job arrives at the mapred job executor it still fails with error "unknown host training-vm.local"

==================================================

2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : unknown host: training-vm.local
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : java.net.UnknownHostException: unknown host: training-vm.local
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.ipc.Client.getConnection(Client.java:849)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.ipc.Client.call(Client.java:719)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : $Proxy19.getProtocolVersion(Unknown Source)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:105)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:177)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1373)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1385)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.fs.FileSystem.get(FileSystem.java:191)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:463)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:567)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
==========================================================


It looks to me that the Jobtracker is not actually listening on port 8021 (but it should). Here is from the CDH3:

==========================================================
hadoop@training-vm:/usr/lib/hadoop$ jps
23779 SecondaryNameNode
23971 Jps
21378 TaskTracker
23619 DataNode
23493 NameNode
21261 JobTracker
hadoop@training-vm:/usr/lib/hadoop$ netstat -plten | grep java
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp6 0 0 :::9090 :::* LISTEN 1001 181234 23493/java
tcp6 0 0 :::36579 :::* LISTEN 1001 162027 21261/java
tcp6 0 0 :::50020 :::* LISTEN 1001 182068 23619/java
tcp6 0 0 :::50857 :::* LISTEN 1001 179556 23779/java
tcp6 0 0 :::50090 :::* LISTEN 1001 181248 23779/java
tcp6 0 0 :::9290 :::* LISTEN 1001 163143 21261/java
tcp6 0 0 :::50060 :::* LISTEN 1001 163447 21378/java
tcp6 0 0 :::57229 :::* LISTEN 1001 177902 23493/java
tcp6 0 0 :::50030 :::* LISTEN 1001 162963 21261/java
tcp6 0 0 192.168.91.128:8021 :::* LISTEN 1001 162037 21261/java
tcp6 0 0 :::50070 :::* LISTEN 1001 180717 23493/java
tcp6 0 0 192.168.91.128:8022 :::* LISTEN 1001 177912 23493/java
tcp6 0 0 :::48057 :::* LISTEN 1001 178637 23619/java
tcp6 0 0 :::50010 :::* LISTEN 1001 180889 23619/java
tcp6 0 0 :::50075 :::* LISTEN 1001 181249 23619/java
tcp6 0 0 :::34014 :::* LISTEN 1001 182073 23619/java
tcp6 0 0 127.0.0.1:33086 :::* LISTEN 1001 164419 21378/java
hadoop@training-vm:/usr/lib/hadoop$ netstat - a | grep 8022
tcp6 0 0 192.168.91.128%14:45557 192.168.91.128%819:8022 ESTABLISHED
tcp6 0 0 192.168.91.128%146:8022 192.168.91.128%81:45557 ESTABLISHED
hadoop@training-vm:/usr/lib/hadoop$ netstat - a | grep 8021
hadoop@training-vm:/usr/lib/hadoop$ netstat - a | grep 8021
hadoop@training-vm:/usr/lib/hadoop$
===================================================================


So its 1 down, still 1 to go.

Jasper
12-06-2010, 04:41 PM
Hi,

Well finally I got everything to work and I have to say the Pentaho part wasn't the problem from the beginning. The problem was the (network)connectivity between my VMware image and the local PDI client on the host machine.

A few lessons learned:

-There is no default port for the Hadoop hdfs and mapred. You set them yourself in the core-site.xml and mapred-site.xml files on the namenode. When working with de CDH3 distribution its 8022 / 8021 respectivily.
-The CDH3 out-of-the-box is not configured to take remote calls of the PDI client via the ports mentioned. You have to change "localhost" in core-site.xml and mapred-site.xml to "<some_hostname>.local so it can be resolved over the network.
-Install the Hadoop licenses for both the "training" and the "hadoop" user on the CDH3
-Anything you push to the cluster via de PDI client gets the user of the PDI client as owner in HDFS. The local hadoop users 'training' and 'hadoop' are not involved but if you want to use their HDFS folders you have to chmod 777 them first.

Good luck