View Full Version : S3 CSV Input - best way to read all files in an S3 folder (EMR files)

09-19-2012, 10:07 AM
I am using Amazon's EMR (Elastic MapReduce) for my Hadoop jobs. The output files from the Hadoop/EMR jobs are stored in S3 (since the nodes are released and the HDFS files are written to S3). The files created are named part-r-00000, part-r-00001, etc. (Currently I'm running 100 reduce tasks, so I have 100 files to process in Kettle).

I have not been able to get the "S3 CSV Input" step to read all the files in the folder by listing just the folder name (i.e. <bucketname>/myfolder/*). I have to specify each file (i.e. <bucketname>/myfolder/part-r-00000).

I can run an "s3cmd" to pull down the files and use the "Text File Input" step to read all the files, but I would like to be able to read directly from S3.

I guess my question are:

Does "S3 CVS Input" support a folder level processing?

Is there another option I should be looking at to process the S3 EMR files without using the "s3cmd get"?

02-01-2018, 12:00 PM
I believe you can use a management method.
Step by the same problem, I decided to change the flow a bit.

- I used a table in the database that stores the name of each file I send to S3 (file name, id and status (processing or not))

- I created a job that calls 2 transformations, one will use the table output to get the file name. (The same is a loop)

- And the second transformation with S3 CSV Input that gets the filename as the parameter, so every time I process a file, I change the status to processed.

I do not know if this is a viable method, but it solved my problem