View Full Version : Push down data integration workload to MR job

12-09-2010, 05:12 PM

I wonder if it would be possible to connect and join a second input in a mapper transformation which is translated to a executable Hadoop MR jar.

Between the Injector and the Dummy (Output) can you join with a Hadoop File input branch or can PDI only handle a single lineair flow from injector to dummy(output). I already tried with the 'branched' flow like in the ktr attached. The job which calls this mapper fails but I am not sure whether it is the extra file input that causes this or something else.

HIVE is able to pick two files, joins them and translate this to a MR job, so is PDI capable of this too? It would be nice to manage this in PDI only.

01-07-2011, 03:56 PM
Older and wiser now.

It is possible to do a join with a second input stream in the mapper in addition to the injector like in the ktr file attached. The problem is that this mapper turned out to be horribly slow. Hope to find a way to tune this back to acceptable lead times....

01-07-2011, 05:10 PM
Your wordcount-mapper-plus.ktr is a valid transformation. We've been working on some performance optimizations and have a solution that will be included in a future release.

02-01-2011, 06:56 PM
Does this solution involve using the distributed cache option in Hadoop?

There should be a way to leverage the disitributed cache functionality from the PDI Hadoop steps...