PDA

View Full Version : How to partition the input data?



afancy
12-14-2010, 08:05 AM
I read the document of "Using Hadoop with pentaho Data integration", where it says use "Injector" to pull data from Hadoop cluster. But we know that when there are several tasks running in parallel on Hadoop, each task is processing a chunk of data sets, instead of the full set. This document, however, doesn't mention any about data partitioning, which seems that the "Injector" return the "full set" of the input data to each task. Is that true?
For example, suppose I have a CSV file containing one line, and I have two Map tasks running in parallel, and 1 reducer to count the appearing frequency of each word.
"Hello World"

As "Injector" will return the same line to the 2 tasks, so
Map Task1: key, value = (Hello, 1), (Word 1)
Map Task2: key, value = (Hello, 1), (Word 1)

In reducer, combine the above results, then the final results will be (Hello, 2), (World, 2), but obviously this result is incorrect.

So, my question is "How to partition the input data, and read by each task?"

jganoff
12-14-2010, 08:59 AM
I read the document of "Using Hadoop with pentaho Data integration", where it says use "Injector" to pull data from Hadoop cluster. But we know that when there are several tasks running in parallel on Hadoop, each task is processing a chunk of data sets, instead of the full set. This document, however, doesn't mention any about data partitioning, which seems that the "Injector" return the "full set" of the input data to each task. Is that true?

The injector step is simply the entry point for the transformation. Transformations used as mappers and reducers are just that - mappers and reducers. We rely on Hadoop to partition the data as it would for any other job and we feed the input from Hadoop to the Injector step of the transformation. From there the transformation executes and any data received by the Output step is passed to Hadoop as the output of said mapper or reducer.



For example, suppose I have a CSV file containing one line, and I have two Map tasks running in parallel, and 1 reducer to count the appearing frequency of each word.
"Hello World"

As "Injector" will return the same line to the 2 tasks, so
Map Task1: key, value = (Hello, 1), (Word 1)
Map Task2: key, value = (Hello, 1), (Word 1)

In reducer, combine the above results, then the final results will be (Hello, 2), (World, 2), but obviously this result is incorrect.

So, my question is "How to partition the input data, and read by each task?"

In your example, Hadoop will methodically partition the data automatically from source to mapper, one line per input file. This is how your example will execute if you file contains one line: "Hello World", 2 mappers, and 1 reducer:


Mapper 1 will receive "Hello World".
Mapper 2 is waiting for data.
Mapper 1 will the emit two key-value pairs: {Hello, 1} and {World, 1}
Mapper 2 is still waiting.
Reducer 1 receives {Hello, 1}, totals the sum of values and emits {Hello, 1}.
Reducer 2 receives {World, 1}, totals the sum of values and emits {World, 1}.
* Emitted values from the reducers get written to disk in the file: part-0000 where the 0000 represents the reducer that handled the data.


Hope this helps,

Jordan

afancy
12-15-2010, 10:42 AM
Thanks!
So, If my understanding is correct, the integration of Pentaho with Hadoop is like in this picture. Isn't it? thanks

6592

jganoff
12-15-2010, 10:45 AM
You got it! Some of the minor details (such as how the master node is actually the central distributor/coordinator of data) are missing but that's the overall picture.