Hitachi Vantara Pentaho Community Forums
Page 1 of 3 123 LastLast
Results 1 to 10 of 25

Thread: Linking up PDI to Hadoop

  1. #1

    Default Linking up PDI to Hadoop

    Hi,

    I am trying to set up a working Hadoop / Pentaho integrated environment. I managed to launch my single node Hadoop cluster on a Ubuntu server image.

    Now I have to connect it to a Pentaho 3.7 RC environment which is on another server (image).

    I found some useful information here:

    http://wiki.pentaho.com/download/att...op_pentaho.pdf

    In this document one prerequisite is "a Hadoop cluster already properly configured and tested". My question is if that includes the installation of Hive as well.

    Is Hive involved when a PDI client is working with the PDI instance directly installed on the name node ?
    Last edited by Jasper; 11-08-2010 at 10:08 AM. Reason: new insights

  2. #2
    Join Date
    Mar 2008
    Posts
    140

    Default

    Hi Jasper,

    Hive is not involved when PDI is sending a Map/Reduce task to a Hadoop cluster. Only when accessing Hive through JDBC is Hive involved.
    The Copy Files and Text File Input/Output steps utilize HDFS (standard to a Hadoop Node).
    The Job Executor steps use the Map/Reduce functionality (standard to a Hadoop Node).

    All Hive interaction is masked as standard JDBC access and, as of yet, does not have any specific steps called out.

    I am not sure if this answers your entire question, please let me know if you need more information.

    -Curtis

  3. #3
    Join Date
    Aug 2010
    Posts
    87

    Default

    Additionally, if you'd like to work with Hive you'll need to set it up separately. You may want to start here: http://wiki.apache.org/hadoop/Hive/G..._Configuration.

  4. #4

    Default

    Hi Curtis,

    Thanks. I will go ahead and install Hive anyway after I have installed the linux PDI on the name node. Hive will be of good use later I think.

    Another question: is the distributable of the Linux PDI client here
    http://www.sendspace.com/file/f6mhgu

    still good ? I could not get the linux client from the standard download page at www.pentaho.com/hadoop.

  5. #5
    Join Date
    Mar 2008
    Posts
    140

    Default

    The sendspace file is from early on in the Hadoop beta program. You will definately want to use the latest, RC1 at this time.

    What problem are you having with the www.pentaho.com/hadoop link? You really should be able to get the file and trial license that way.

    -Curtis
    Last edited by cboyden; 11-08-2010 at 11:46 AM. Reason: Received information on the sendspace.com file

  6. #6

    Default

    Well, the other day I downloaded the whole BI Suite and I had to fill out the form and received a confirmation email. When I returned today to get the Linux PDI client I got a message that the Linux client is much more complicated, needs special support and that you should download the windows client and worst of all; granting the license for the Linux client needs 2 working days evaluation of Pentaho staff..

    Any way, when I fill out the form again, I'm not redirected to the download page as before but I have to fill it out again and again. It's probably that I have already "used up" my evaluation by downloading the BI Suite
    Last edited by Jasper; 11-08-2010 at 05:11 PM.

  7. #7

    Default

    Hi,
    Now I have a PDI client set up on the Hadoop namenode and another client to connect to it.

    Is the PDI communication all through port 22 or do you have to open other ones as well ? (like :9000 which I saw somewhere during installation of PDI on the Hadoop side..)

  8. #8
    Join Date
    Aug 2010
    Posts
    87

    Default

    Quote Originally Posted by Jasper View Post
    Now I have a PDI client set up on the Hadoop namenode and another client to connect to it.
    In order to execute transformations through Hadoop, not just on a machine configured as a Hadoop node, you need to install the Pentaho for Hadoop Distribution (PHD) archive into each node's Hadoop installation. This is not a PDI client but the actual PDI execution engine that will allow you to submit transformations as Map/Reduce jobs.

    Have you filled out the evaluation request for the Pentaho Data Integration for Hadoop 4.1 RC1 from http://www.pentaho.com/download/? This is the much preferred path to take as the installation process for PDI for Hadoop has improved since the beta program.
    Last edited by jganoff; 11-11-2010 at 10:16 AM.

  9. #9

    Default

    Hi,

    Yes I proceeded according to this doc http://wiki.pentaho.com/download/att...op_pentaho.pdf

    So I copied the necessary files to /lib and /libext within the /libkettle dir along with the manual changes to hadoop.sh. Would that be enough?

    I had a pseudo hadoop cluster up and running but after these steps from the doc my namenode does not start anymore (error: "FSNamesystem initialization failed") so I am not there yet
    Last edited by Jasper; 11-11-2010 at 11:39 AM.

  10. #10

    Default

    @jganoff
    By "into each node's Hadoop installation" you mean each datanode right ?

    Would this also work on a Cloudera CDH2 pseudo-distributed cluster? How many PHD installation would that require?

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.