PDA

View Full Version : Working around hadoop chunking.



Cyril Shilnikov
05-01-2012, 09:15 AM
I have a fairly complex transformation that takes a CSV file and tries to make an XML file out of it created with Spoon, and I would like to run it in Hadoop, since the CSV file can get rather large.

The transform creates an XML element per column per row, which then need to be grouped up based on which row the data was taken from. One of the things I'm doing in the transform is an XML join to create a top level list of elements, and then join the children elements into it.

It works swimmingly until the file gets bigger than 1000 lines, at which point I get a Kettle XML exception with the message "XPath statement returned no result /parentElement/childElement[rowNumber='1000']". The transformation also works fine if I forgo the XML Join step and just output the list of the child elements (which, even though sorted, if I want to, are not what I'm looking for).

What confuses me is that if Hadoop is doing this transformation in chunks of 1000 lines, wouldn't it do so for both the branches of the transform that lead up to the join? Is there a boundary problem here?

Thank you,
Cyril