Hitachi Vantara Pentaho Community Forums
Page 3 of 3 FirstFirst 123
Results 21 to 25 of 25

Thread: Linking up PDI to Hadoop

  1. #21

    Default

    It looks like 9090 is the right port for PDI to connect to. The moment I launch my PDI job (Copy files from local to HDFS) now at least my job seems to actually do something. It stalls and this error pops up at the same time in the log of the name_node process on CDH3:

    ===========================================
    Exception in thread "pool-3-thread-2" java.lang.OutOfMemoryError: Java heap space
    at org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:296)
    at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:203)
    at org.apache.hadoop.thriftfs.api.Namenode$Processor.process(Namenode.java:1166)
    at org.apache.hadoop.thriftfs.SanerThreadPoolServer$WorkerProcess.run(SanerThreadPoolServer.java:277)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:619)
    ==================================================

    The $HADOOP_HEAPSIZE is set to 1000MB in the hadoop-env.sh file. The VM has 2000MB of RAM.
    Last edited by Jasper; 11-24-2010 at 03:57 PM.

  2. #22
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    Hi Jasper,

    I have this running on Cloudera's vmware virtual machine image of CDH. The default user, on this distribution at least, is the "training" user. In order to get the Hadoop side of PDI working I needed to install the licences for the "hadoop" user as well (su hadoop with password hadoop). The hostname on this virtual machine is "training-vm.local". For good measure I replaced all occurrences of "localhost" in the config files (core-site.xml and mapred-site.xml) in $HADOOP_HOME/conf with "training-vm.local".

    Cheers,
    Mark.

  3. #23
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    I just noticed your latest post :-) Are you sure that port 9090 is correct for HDFS on your copy of CDH? I have CDH 0.3.3 and it came configured with HDFS on 8022 and the job tracker on 8021.

    Cheers,
    Mark.

  4. #24

    Default

    Hi Mark,

    You just made my day! Thanks a lot.

    My licenses were OK (installed for both users). Changing "localhost" into "training-vm.local" in both XML's and using port 8022/8021 made the job work partially. But you were right, port 9090 is not meant for HDFS.

    So I am now able to do some basic copy jobs via PDI. However when the job arrives at the mapred job executor it still fails with error "unknown host training-vm.local"

    ==================================================

    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : unknown host: training-vm.local
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : java.net.UnknownHostException: unknown host: training-vm.local
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.ipc.Client.getConnection(Client.java:849)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.ipc.Client.call(Client.java:719)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : $Proxy19.getProtocolVersion(Unknown Source)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:105)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:177)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1373)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1385)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.fs.FileSystem.get(FileSystem.java:191)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:463)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:567)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
    2010/11/27 00:57:08 - Hadoop Transformation Job Executor - ERROR (version 4.1.0-GA, build 14380 from 2010-11-09 17.25.17 by buildguy) : org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    ==========================================================


    It looks to me that the Jobtracker is not actually listening on port 8021 (but it should). Here is from the CDH3:

    ==========================================================
    hadoop@training-vm:/usr/lib/hadoop$ jps
    23779 SecondaryNameNode
    23971 Jps
    21378 TaskTracker
    23619 DataNode
    23493 NameNode
    21261 JobTracker
    hadoop@training-vm:/usr/lib/hadoop$ netstat -plten | grep java
    (Not all processes could be identified, non-owned process info
    will not be shown, you would have to be root to see it all.)
    tcp6 0 0 :::9090 :::* LISTEN 1001 181234 23493/java
    tcp6 0 0 :::36579 :::* LISTEN 1001 162027 21261/java
    tcp6 0 0 :::50020 :::* LISTEN 1001 182068 23619/java
    tcp6 0 0 :::50857 :::* LISTEN 1001 179556 23779/java
    tcp6 0 0 :::50090 :::* LISTEN 1001 181248 23779/java
    tcp6 0 0 :::9290 :::* LISTEN 1001 163143 21261/java
    tcp6 0 0 :::50060 :::* LISTEN 1001 163447 21378/java
    tcp6 0 0 :::57229 :::* LISTEN 1001 177902 23493/java
    tcp6 0 0 :::50030 :::* LISTEN 1001 162963 21261/java
    tcp6 0 0 192.168.91.128:8021 :::* LISTEN 1001 162037 21261/java
    tcp6 0 0 :::50070 :::* LISTEN 1001 180717 23493/java
    tcp6 0 0 192.168.91.128:8022 :::* LISTEN 1001 177912 23493/java
    tcp6 0 0 :::48057 :::* LISTEN 1001 178637 23619/java
    tcp6 0 0 :::50010 :::* LISTEN 1001 180889 23619/java
    tcp6 0 0 :::50075 :::* LISTEN 1001 181249 23619/java
    tcp6 0 0 :::34014 :::* LISTEN 1001 182073 23619/java
    tcp6 0 0 127.0.0.1:33086 :::* LISTEN 1001 164419 21378/java
    hadoop@training-vm:/usr/lib/hadoop$ netstat - a | grep 8022
    tcp6 0 0 192.168.91.128%14:45557 192.168.91.128%819:8022 ESTABLISHED
    tcp6 0 0 192.168.91.128%146:8022 192.168.91.128%81:45557 ESTABLISHED
    hadoop@training-vm:/usr/lib/hadoop$ netstat - a | grep 8021
    hadoop@training-vm:/usr/lib/hadoop$ netstat - a | grep 8021
    hadoop@training-vm:/usr/lib/hadoop$
    ===================================================================


    So its 1 down, still 1 to go.
    Last edited by Jasper; 11-26-2010 at 08:06 PM.

  5. #25

    Default Finally

    Hi,

    Well finally I got everything to work and I have to say the Pentaho part wasn't the problem from the beginning. The problem was the (network)connectivity between my VMware image and the local PDI client on the host machine.

    A few lessons learned:

    -There is no default port for the Hadoop hdfs and mapred. You set them yourself in the core-site.xml and mapred-site.xml files on the namenode. When working with de CDH3 distribution its 8022 / 8021 respectivily.
    -The CDH3 out-of-the-box is not configured to take remote calls of the PDI client via the ports mentioned. You have to change "localhost" in core-site.xml and mapred-site.xml to "<some_hostname>.local so it can be resolved over the network.
    -Install the Hadoop licenses for both the "training" and the "hadoop" user on the CDH3
    -Anything you push to the cluster via de PDI client gets the user of the PDI client as owner in HDFS. The local hadoop users 'training' and 'hadoop' are not involved but if you want to use their HDFS folders you have to chmod 777 them first.

    Good luck

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.