PDA

View Full Version : Hadoop TFI - streaming or dumping/extracting from temp files



jtcornelius
07-01-2010, 02:49 PM
First question brought up in the Beta Kickoff Meeting
(Daniel, pardon me for butchering your question)

Does the Hadoop Text File Input (TFI) step stream data from HDFS or use the standard VFS process of writing data to text files and extracting from there?

Hopefully I got that close to correct :)

-Jake

cboyden
07-01-2010, 02:56 PM
The Hadoop Text File In/Output steps as well as the Hadoop Copy Files job entry use VFS to access files. We have submitted a patch to the Hadoop project for the HDFS / VFS driver.

Please watch it and vote for it!
https://issues.apache.org/jira/browse/HDFS-1213

-Curtis

jdixon
07-01-2010, 03:05 PM
I think the question is really about streaming vs temp files.

Does our VFS implementation first copy the HDFS file to a temp file and then read from there. I think this is Daniel's questions.

James

mdamour
07-01-2010, 03:44 PM
The HDFS VFS driver should be streaming the data, the code I put in in place to handle this in the driver is:

protected InputStream doGetInputStream() throws Exception {
FSDataInputStream in = hdfs.open(new Path(getName().getPath()));
return in;
}

This is not ever invoked by a user api, but it is eventually called when you do something like this:

file.getContent().getInputStream();

Hope this helps.
-Mike

DEinspanjer
07-20-2010, 01:03 PM
Cool. The streaming bit is important because some other VFS engines like gzip and tar and zip will extract the contents out to the tmp dir then read the contents from there.

That would be bad in the Hadoop world where you might be working on a file that is larger than your local disk space.

kobeli
08-17-2010, 09:46 AM
thanks I know how to do!