PDA

View Full Version : Hive Connection on Windows - Configuration of Data Source with JDBC driver



fgelbal
08-12-2010, 11:56 AM
Hello,

I am new to Pentaho. And I want to connect Pentaho to our Hadoop/Hive environment. I am using a Windows client. Can you please guide me on how to configure the connection?

We have our own JDBC driver to connect to Hive, which I want to use. (This driver works fine when used with SQuirrel SQL Client). I found Linux configuration on this site but there are no instructions for Windows:
http://wiki.apache.org/hadoop/Hive/HiveJDBCInterface
Do you know how to do this on Windows environment?

I would like to use all reporting and dashboard capabilities of Pentaho with our own data in the Hadoop/Hive environment.

Thanks in advance,

- firat gelbal

jtcornelius
08-13-2010, 08:44 AM
Hello Firat,

Thanks for posting. One of our presales engineers did a blog and video on how to define a connection in the BI Server that may help:

http://blog.pentaho.com/2010/07/23/part-2-configuring-server-side-data-connections-even-easier/

If you want to use Pentaho Report Designer, you will need to:
1) drop your JDBC driver into <install-dir>\report-designer\lib\jdbc before starting Report Designer
2) when defining your connection, use the 'Generic database' option and manually provide your connection URL and the jdbc driver class name

Also, I'm pleased to share that we are wrapping up the first set of our own contributions to the HIVE JDBC driver to make it more compatible with the Pentaho BI Suite. As part of this, we have also added Hive as a selectable option in the connection dialog. Later this week I will be posting our cut of the driver with an updated drop of PDI so that you can test the integration with the common Connections dialog. You can follow the progress of the beta program, access downloads, etc by going to:

http://wiki.pentaho.com/display/EAI/PDI+and+Hadoop+Integration

I'm very interested in getting your feedback on any problems you run into using Hive and the Hive JDBC driver with Pentaho as we are actively contributing to the Apache project to make the driver more robust.

Thanks!
Jake

fgelbal
08-16-2010, 11:09 AM
Hello Jake,

Thanks for the prompt response. The blog link was indeed helpful. I setup the connection, Pentaho can communicate with our Hive server now.

However, I cannot do anything now as Pentaho's Metadata Editor fails to get the metadata from our Hive server. It clearly displays the table names but it fails while 'importing tables'. Here is the error message:

org.pentaho.di.core.exception.KettleDatabaseException:
Couldn't get field info from [SELECT * FROM calendar]

Method not supported

at org.pentaho.di.core.database.Database.getQueryFieldsFallback(Database.java:2315)
at org.pentaho.di.core.database.Database.getQueryFields(Database.java:2251)
at org.pentaho.di.core.database.Database.getQueryFields(Database.java:1928)
at org.pentaho.di.core.database.Database.getTableFields(Database.java:1923)
at org.pentaho.pms.ui.MetaEditor.importTableDefinition(MetaEditor.java:3384)
at org.pentaho.pms.ui.MetaEditor.importMultipleTables(MetaEditor.java:3334)
at org.pentaho.pms.ui.MetaEditor.newConnection(MetaEditor.java:2799)
at org.pentaho.pms.ui.MetaEditor$6.handleEvent(MetaEditor.java:490)
at org.eclipse.swt.widgets.EventTable.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Widget.sendEvent(Unknown Source)
at org.eclipse.swt.widgets.Display.runDeferredEvents(Unknown Source)
at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source)
at org.pentaho.pms.ui.MetaEditor.readAndDispatch(MetaEditor.java:851)
at org.pentaho.pms.ui.MetaEditor.main(MetaEditor.java:3589)
Caused by: java.sql.SQLException: Method not supported
at org.apache.hadoop.hive.jdbc.HivePreparedStatement.getMetaData(HivePreparedStatement.java:132)
at org.pentaho.di.core.database.Database.getQueryFieldsFallback(Database.java:2308)
... 13 more



Perhaps it's better if I wait till you wrap up your Hive JDBC driver. Can you please update me when the new driver is ready with 'how to install' guidelines?

Finally, can you advise me on how to start a project in Pentaho? I checked the knowledge base but the articles I found mainly involves using the demo data. What I want is to use our own data by retrieving it from our Hive servers. I assume the first step should be implementing metadata but as I said, Metadata editor gives error while import and I cannot use Schema Workbench as there is no metadata to work with. I will be happy to get advise on how to evaluate Pentaho with our own data.

Thanks and best,

- firat

jtcornelius
08-17-2010, 10:14 AM
Thanks for the feedback Firat. We just ended our first Sprint on updating the Hive JDBC driver to be more compatible with Pentaho. I will be posting a copy of the driver later this week with guidelines on what works and what doesn't. Currently, the biggest support gap that we're working through is the lack of support in Hive for prepared statements. Look for more info from me in the next couple of days on the availability of the updated driver.

-Jake

fgelbal
08-18-2010, 08:20 AM
Thanks for the support Jake. Our Senior Database Developer has made a quick fix for this. So Metadata Editor was able to get data (I don't how ). But there are now 2 new major issues:

1- The Metadata Editor does not do the proper Inner Join: Hive prefers the 'join's to be in FROM clause while Metadata Editor does them in WHERE Clause. This is strange as the generated SQLs for the outer joins are all in the FROM clause.

I see that this was reported earlier in the discussion forums but no replies. I posted it again:

http://forums.pentaho.com/showthread.php?77921-Metadata-Editor-cannot-properly-do-Inner-Join&p=242483

2- Our fact table is partitioned on a date field (which is basically calendar_id that represents a day in the calendar table). However, because of configuration in Hive, this date field is not displayed as part of the schema so Metadata Editor does not see it.
I am not sure if this an issue yet, I tried a workaround but I cannot test it because of 1st issue.

I will be happy to get as much as help as I can for these issues..

Good luck and success..

- firat

jganoff
08-20-2010, 09:47 AM
Hello Firat,


1- The Metadata Editor does not do the proper Inner Join: Hive prefers the 'join's to be in FROM clause while Metadata Editor does them in WHERE Clause. This is strange as the generated SQLs for the outer joins are all in the FROM clause.

Great feedback. We've just started our next sprint focusing partly on hardening the Hive driver. We are in the process of integrating the Hive-specific JOIN syntax into Metadata Editor to fix this. You can follow the progress here: http://jira.pentaho.com/browse/PDI-4452.


2- Our fact table is partitioned on a date field (which is basically calendar_id that represents a day in the calendar table). However, because of configuration in Hive, this date field is not displayed as part of the schema so Metadata Editor does not see it.
I am not sure if this an issue yet, I tried a workaround but I cannot test it because of 1st issue.

What is your workaround? How to best show the partition columns is something we're still thinking about.

-Jordan

fgelbal
08-23-2010, 03:48 AM
Hello Jordan,

Thanks for the prompt response and, more importantly, for the action to solve these issues. Please let me know when the new driver is available. For now, I cannot use Pentaho unfortunately.

As for the workaround of the partitioning: In the Metadata Editor, I've just added a new column to the fact table and in the 'Calculation - > Formula' part of the Column Setttings, I simply put the physical name of the partition column.
So for me, if you just display the partition column as part of the table, it's already good. Or you might put it directly under 'Connection' (besides the tables), with a different logo perhaps.?

Good luck and success with the work..

- firat