Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: How to partition the input data?

  1. #1
    Join Date
    Dec 2010
    Posts
    10

    Default How to partition the input data?

    I read the document of "Using Hadoop with pentaho Data integration", where it says use "Injector" to pull data from Hadoop cluster. But we know that when there are several tasks running in parallel on Hadoop, each task is processing a chunk of data sets, instead of the full set. This document, however, doesn't mention any about data partitioning, which seems that the "Injector" return the "full set" of the input data to each task. Is that true?
    For example, suppose I have a CSV file containing one line, and I have two Map tasks running in parallel, and 1 reducer to count the appearing frequency of each word.
    "Hello World"

    As "Injector" will return the same line to the 2 tasks, so
    Map Task1: key, value = (Hello, 1), (Word 1)
    Map Task2: key, value = (Hello, 1), (Word 1)

    In reducer, combine the above results, then the final results will be (Hello, 2), (World, 2), but obviously this result is incorrect.

    So, my question is "How to partition the input data, and read by each task?"
    Last edited by afancy; 12-14-2010 at 08:16 AM.

  2. #2
    Join Date
    Aug 2010
    Posts
    87

    Default

    Quote Originally Posted by afancy View Post
    I read the document of "Using Hadoop with pentaho Data integration", where it says use "Injector" to pull data from Hadoop cluster. But we know that when there are several tasks running in parallel on Hadoop, each task is processing a chunk of data sets, instead of the full set. This document, however, doesn't mention any about data partitioning, which seems that the "Injector" return the "full set" of the input data to each task. Is that true?
    The injector step is simply the entry point for the transformation. Transformations used as mappers and reducers are just that - mappers and reducers. We rely on Hadoop to partition the data as it would for any other job and we feed the input from Hadoop to the Injector step of the transformation. From there the transformation executes and any data received by the Output step is passed to Hadoop as the output of said mapper or reducer.

    Quote Originally Posted by afancy View Post
    For example, suppose I have a CSV file containing one line, and I have two Map tasks running in parallel, and 1 reducer to count the appearing frequency of each word.
    "Hello World"

    As "Injector" will return the same line to the 2 tasks, so
    Map Task1: key, value = (Hello, 1), (Word 1)
    Map Task2: key, value = (Hello, 1), (Word 1)

    In reducer, combine the above results, then the final results will be (Hello, 2), (World, 2), but obviously this result is incorrect.

    So, my question is "How to partition the input data, and read by each task?"
    In your example, Hadoop will methodically partition the data automatically from source to mapper, one line per input file. This is how your example will execute if you file contains one line: "Hello World", 2 mappers, and 1 reducer:

    1. Mapper 1 will receive "Hello World".
      Mapper 2 is waiting for data.
    2. Mapper 1 will the emit two key-value pairs: {Hello, 1} and {World, 1}
      Mapper 2 is still waiting.
    3. Reducer 1 receives {Hello, 1}, totals the sum of values and emits {Hello, 1}.
    4. Reducer 2 receives {World, 1}, totals the sum of values and emits {World, 1}.
      * Emitted values from the reducers get written to disk in the file: part-0000 where the 0000 represents the reducer that handled the data.


    Hope this helps,

    Jordan

  3. #3
    Join Date
    Dec 2010
    Posts
    10

    Default

    Thanks!
    So, If my understanding is correct, the integration of Pentaho with Hadoop is like in this picture. Isn't it? thanks

    Name:  pentahoHadoop.jpg
Views: 116
Size:  22.2 KB

  4. #4
    Join Date
    Aug 2010
    Posts
    87

    Default

    You got it! Some of the minor details (such as how the master node is actually the central distributor/coordinator of data) are missing but that's the overall picture.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.