View Full Version : Pentaho Data Integration with Hadoop

02-01-2012, 01:23 PM
Hi all,

I am evaluating Pentaho's integration with Hive. Before proceeding further, I would like to clarify some of my queries.
1) At what stage is the Pentaho's integration with Hive. Is it production-ready?
2) Is Pentaho using the standard Hive JDBC Interface or did Pentaho customize the Interface? Also, does Pentaho supports Hadoop security?
3) Is there any white-paper regarding the scalability of Pentaho with hadoop data.

Any pointers would be very helpful,


02-02-2012, 10:25 AM
Hi Ram,

1) Hive support is production ready. You can define metadata models backed by Hive as well as direct JDBC connections.
2) We have implemented missing functionality in the Apache Hive JDBC driver and contributed these changes back to Apache. These changes were included in the Hive 0.8.0 release. To support previous releases we include our own Apache Hive 0.7.0 JDBC driver that has been tested with Hive 0.7.0 & 0.7.1.
3) None as of yet; however, the Pentaho MapReduce implementation leverages the Hadoop ecosystem to provide scalability. This is a topic we plan to address with case studies and white papers.

Hope this helps!

Best Regards,

02-22-2012, 02:05 PM
I have pentaho data integrator on my laptop(windows).
I have hadoop and hive on my linux sandbox to which i connect using cygwin. i already have ssh keys in place.
i am able to login to my linux sandbox and access hadoop hdfs file system where path for hive tables is exising.
on my sandbox, my local linux file system and hdfs file system are one and same.(/user/dist/hive is path where all my hive tables are located)

Now i would like to access my hive data on my sandbox from my pentaho spoon IDE.
i followed the procedure mentioned in

still i am unable to see my files on my sandbox in pentaho spoon GUI.

can anybody help me in giving more detailed and more illustrative step by step procedure?
Is there any requirement that my pentaho also should be installed on linux box instead of on windows machine?

Thanks a lot.

02-22-2012, 02:11 PM
Hi Siva,

One common issue is a mismatch between the Hadoop core jar PDI uses and the Hadoop core jar your cluster is using. To ensure no issues replace the hadoop-core-*.jar in PDI with the one from your cluster and try to access HDFS again.

Hope this helps!