PDA

View Full Version : Joining large files via MapReduce with Kettle



better
05-22-2013, 09:52 AM
I have to very large files that are table exports from an RDBMS (tab-delimited), and the files are stored in HDFS. I need to "join" these two files on one of the "columns".

If you are familiar with Pig, you know that it is possible to use MapReduce to join files within HDFS. I was wondering if there is a way from within Kettle to join files using MapReduce. I know you can use a Join Rows step, but I don't think that uses the power of the Hadoop cluster to do the work. I also know that you could layer the files under HBase, but since I'm using this for ETL, the files won't be static.

Any pointers would be appreciated.

-Barry

mattb_pdi
05-22-2013, 10:22 AM
Hemal Govind has a great blog post about this:

http://hgovind.wordpress.com/2013/04/25/how-to-join-big-data-sets-in-using-mapreduce-and-pdi/