PDA

View Full Version : Hadoop integration in Pentaho CE



ashokriwaria
05-05-2011, 03:30 AM
Hi,

I am working on to establish communication with Hadoop/Hive data warehouse in Pentaho BI. Can some one guide me on how can that be achieved?

I tried to define connection properties in jdbc.properties file and create a cube specific to my hive data database. but it throws error that the schema name is not bound

12:53:01,941 ERROR [Logger] Error: Pentaho
12:53:01,957 ERROR [Logger] misc-org.pentaho.platform.plugin.services.connections.mondrian.MDXConnection: MDXConnection.ERROR_0002 - Invalid connecti
n properties: PoolNeeded=false; dataSource=pyramid; Provider=mondrian; Catalog=solution:steel-wheels/analysis/pyramid.mondrian.xml; DynamicSchemaProc
ssor=mondrian.i18n.LocalizingDynamicSchemaProcessor; Locale=en_US
org.pentaho.platform.api.data.DatasourceServiceException: javax.naming.NameNotFoundException: Name pyramid is not bound in this Context


Any body who can direct me in this regards would be a great help for me.

thanks,
Ashok Riwaria

Jasper
05-05-2011, 06:07 PM
To my knowledge you can't connect to Hadoop and Hive with the CE of PDI. Not sure about the CE's of the other modules, but it would suprise me if it would be any different there..

ashokriwaria
05-06-2011, 05:45 AM
Is it possible to use Hadoop/Hive in analysis view and analyzer in the Enterprise edition? if yes, can some one assist me with the same?

Ashok

jtcornelius
05-06-2011, 08:05 AM
There has been some great work done by the community to identify, and in some cases work around, areas where Mondrian's SQL generation hits unsupported areas in Hive (examples http://jira.pentaho.com/browse/MONDRIAN-789 and http://jira.pentaho.com/browse/PDI-4355). That said, we do not officially support the use of Mondrian on top of Hive at this point. Beyond the known areas of incompatibility, it is probably not a great idea at this time anyway. Mondrian and Analyzer are designed to let users freely explore data, all the while issuing SQL queries under the covers. The latency of Hive queries at this point are not an ideal match for this use case. We would recommend using PDI/AgileBI to stage subsets of the data from Hadoop (even extracted in a summarized fashion using a Hive query) to build Analysis Cubes from. PDI EE provides a simple way to set this up and even schedule the extracts on a recurring basis.

hth, jake

pmalves
05-06-2011, 09:40 AM
I'll translate what jake said:

IF YOU TRY IT YOU'LL DIE OF OLD AGE AFTER 2 ANALYSIS

jtcornelius
05-06-2011, 09:41 AM
Thanks for the translation... clearly I've been in management for too long ;)

ashokriwaria
05-06-2011, 09:46 AM
Thanks Jake/Pedro

So can you please suggest me some alternate path or solution? or is it targetted in some future releases of Pentaho?

Ashok

pmalves
05-06-2011, 09:57 AM
It's not about pentaho, it's about hive.

I know nick goodman from LucidDb / DynamoBI is has a project that puts lucid in front of hadoop for fast analysis, try to get info there

jtcornelius
05-06-2011, 12:07 PM
Pedro's suggestions is a good one. Otherwise, I would just use PDI to query a slice of data out of Hadoop via Hive, then load it into an RDBMs like LucidDB/MySQL/etc., then build your cubes/dashboards on that. PDI EE can help you schedule the extract/load if you want to periodically refresh the data.