Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: Push down data integration workload to MR job

  1. #1

    Default Push down data integration workload to MR job


    I wonder if it would be possible to connect and join a second input in a mapper transformation which is translated to a executable Hadoop MR jar.

    Between the Injector and the Dummy (Output) can you join with a Hadoop File input branch or can PDI only handle a single lineair flow from injector to dummy(output). I already tried with the 'branched' flow like in the ktr attached. The job which calls this mapper fails but I am not sure whether it is the extra file input that causes this or something else.

    HIVE is able to pick two files, joins them and translate this to a MR job, so is PDI capable of this too? It would be nice to manage this in PDI only.
    Attached Files Attached Files
    Last edited by Jasper; 12-09-2010 at 05:17 PM.

  2. #2


    Older and wiser now.

    It is possible to do a join with a second input stream in the mapper in addition to the injector like in the ktr file attached. The problem is that this mapper turned out to be horribly slow. Hope to find a way to tune this back to acceptable lead times....

  3. #3
    Join Date
    Aug 2010


    Your wordcount-mapper-plus.ktr is a valid transformation. We've been working on some performance optimizations and have a solution that will be included in a future release.

  4. #4


    Does this solution involve using the distributed cache option in Hadoop?

    There should be a way to leverage the disitributed cache functionality from the PDI Hadoop steps...

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.