PDA

View Full Version : Is it possible parallel load data into one table using Hadoop on PDI?



afancy
01-02-2011, 12:03 PM
Now the PDI on hadoop only output the data into HDFS by using Dummy construct.
Instead of saving the data into the HDFS, is it possible to load data into the table in the database directly? e.g., parallel loading data into the same fact table using hadoop. Thanks!

jganoff
01-04-2011, 10:33 AM
That's entirely possible by using the Table Output Step as you would in any other transformation. The Output Steps defined in the Hadoop Transformation Job Executor step are still required to designate the step's output that should be passed as output from the Mapper or Reducer but you can do (almost) anything you want in the transformation.

afancy
01-05-2011, 09:06 AM
Thanks!

But I found a problem running the transformation on Hadoop. The transformation construct cannot access database when running on the hadoop. For example, when i get the sequence from database, it always throws Exception. Could you advise? thanks

java.io.IOException: org.pentaho.di.core.exception.KettleException:
We failed to initialize at least one step. Execution can not begin!


at org.pentaho.hadoop.mapreduce.GenericTransMap.map(SourceFile:188)
at org.pentaho.hadoop.mapreduce.GenericTransMap.map(SourceFile:22)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: org.pentaho.di.core.exception.KettleException:
We failed to initialize at least one step. Execution can not begin!


at org.pentaho.di.trans.Trans.prepareExecution(Trans.java:740)
at org.pentaho.hadoop.mapreduce.GenericTransMap.map(SourceFile:39)
... 5 more

jganoff
01-05-2011, 10:55 AM
Do you have the required database driver and accompanying jars required to access your database in the $HADOOP_HOME/lib directory of each node?

afancy
01-12-2011, 05:20 AM
Hi,
About the previous exception, after I copy postgresql jdbc jar into hadoop/lib, the problem solved.

Now i meets a ton of problems when I am using Pentaho hadoop to load data into the database. One of them is described as follows:

I have implemented a transformation to insert data into the database, and this transformation is used as a mapper for PDI hadoop, but I found that for every row the database is opened and closed. (see details at http://dpaste.de/qJ3R/)

PDI hadoop does not provide for the settings of configure, and close of Map and Reducer, which executes only once before and after map and reduce. So, i doubt that the establishing database connection was put into the map and reduce function such that for every row the connection is connected and closed.