Hitachi Vantara Pentaho Community Forums
Results 1 to 13 of 13

Thread: Hadoop File Input, basic question

  1. #1

    Default Hadoop File Input, basic question

    I am trying to connect to hadoop/hdfs (pseudo distributed, running as vm) using the Hadoop File Input step. I need to Browse to a file and select it from hdfs in the Open File window that pops up. And am unable to see the file/folder names on hdfs.


    I have tried a few different port number combinations with/without user ID and have been unsuccessful and made sure the filter is set to 'All Files'. Firewall is disabled on the hadoop vm.


    If I use port numbers (50070 or 50075 - I can connect to the hadoop's web interface via these ports), I see only the '/' listed.
    If I try using 8020, which I think is the hdfs default, I get "Unable to connect to HDFS Server". Also, if I use an arbitrary port number (say, 12000) - I get an error message "Unable to connect to HDFS Server".


    I am sort of uncertain what port number I should be using. I am attaching my core-site.xml, hdfs-site.xml. Please let me know if there are other config files I should attach to provide further information.

    Thanks for any direction with this!
    Attached Images Attached Images  
    Attached Files Attached Files
    Last edited by acimi; 12-28-2010 at 12:22 PM.

  2. #2
    Join Date
    Aug 2010
    Posts
    87

    Default

    According to your core-site.xml the "fs.default.name" is set to "hdfs://user1-vm:8020". That's the address HDFS will be hosted at. You need to be able to open a socket connection to "user1-vm:8020". My guess is that your network configuration is preventing user1-vm from being resolved properly on the server. A simple fix could be to make sure /etc/hosts lists your hostname with its actual IP address, not 127.0.1.1 as Ubuntu as recently switched to. For more information on the topic see these mailing list entries:

    http://www.leonardoborda.com/blog/12...ubuntu-debian/
    http://lists.debian.org/debian-boot/.../msg01047.html
    http://lists.debian.org/debian-boot/.../msg00938.html

  3. #3

    Default

    Unfortunately, this didn't help.

  4. #4
    Join Date
    Aug 2010
    Posts
    87

    Default

    Can you open a socket connection to user1-vm:8020 from the machine you're having the problem connecting to HDFS through PDI from?

  5. #5

    Default

    Yes, I can.
    If there are any other ideas, I'd be willing to try out! Thanks.
    Last edited by acimi; 12-29-2010 at 12:41 PM.

  6. #6
    Join Date
    Aug 2010
    Posts
    87

    Default

    What version of Hadoop are you trying to connect to? PDI for Hadoop is currently only compatible with 0.20.x.

  7. #7

    Default 0.20.2+737

    Code:
    user1@user1-vm:~$ hadoop version
    Hadoop 0.20.2+737
    Subversion  -r 98c55c28258aa6f42250569bd7fa431ac657bdbd
    Compiled by root on Mon Oct 11 17:21:30 UTC 2010
    From source with checksum d13991fbc138e18f3b7eb8f60ee708dd

  8. #8

    Default

    Hi Amici,

    I advise you to use something like netstat or wireshark to see what is really going on when your PDI client is reaching out to Hadoop...

  9. #9
    Join Date
    Aug 2010
    Posts
    87

    Default

    Seems like the Java API has subtle changes related to org.apache.hadoop.fs.FileSystem. In order for PDI to be able to talk to that specific build of Hadoop you need to replace all hadoop-core*.jar files in the PDI installation directory with the hadoop-core jar that ships with the version of hadoop you're using.

    I found the Hadoop CDH Beta 3 hadoop-core jar at /usr/lib/hadoop-0.20/hadoop-core-0.20.2+737.jar.

  10. #10

    Default That did it.

    Thanks jganoff

  11. #11

    Default

    Quote Originally Posted by jganoff View Post
    Seems like the Java API has subtle changes related to org.apache.hadoop.fs.FileSystem. In order for PDI to be able to talk to that specific build of Hadoop you need to replace all hadoop-core*.jar files in the PDI installation directory with the hadoop-core jar that ships with the version of hadoop you're using.

    I found the Hadoop CDH Beta 3 hadoop-core jar at /usr/lib/hadoop-0.20/hadoop-core-0.20.2+737.jar.
    Now, my Pentaho to Hadoop CDH3b3 integration is down after upgrading from CDH2. Hadoop is up again, and so is Hive but the connectivity between PDI and Hadoop is down.

    So if I understand this correctly, you mean replacing $HADOOP_HOME/lib/hadoop-core-0.20.0.jar with a copy of hadoop-core-0.20.2+737.jar from $HADOOP_HOME/ ?? So not a symlnk or anything?

  12. #12
    Join Date
    Aug 2010
    Posts
    87

    Default

    Symlink may work. Try it out and let me know how that goes. Thanks!

  13. #13

    Default

    Actually I didn't read your post quite right. I did replace $HADOOP_HOME/lib/hadoop-core-0.20.0.jar by the +737.jar on all the Hadoop nodes (couldn't hurt because they remain after the CDH2 to CHD3b3 upgrade because they are jars from the PHD package) but in fact you have to replace these same jars from data-integration/libext/hive & data-integration/libext/pentaho on your PDI client and/or server machine(s) to make it work...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.