Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: S3 CSV Input - best way to read all files in an S3 folder (EMR files)

  1. #1

    Post S3 CSV Input - best way to read all files in an S3 folder (EMR files)

    I am using Amazon's EMR (Elastic MapReduce) for my Hadoop jobs. The output files from the Hadoop/EMR jobs are stored in S3 (since the nodes are released and the HDFS files are written to S3). The files created are named part-r-00000, part-r-00001, etc. (Currently I'm running 100 reduce tasks, so I have 100 files to process in Kettle).

    I have not been able to get the "S3 CSV Input" step to read all the files in the folder by listing just the folder name (i.e. <bucketname>/myfolder/*). I have to specify each file (i.e. <bucketname>/myfolder/part-r-00000).

    I can run an "s3cmd" to pull down the files and use the "Text File Input" step to read all the files, but I would like to be able to read directly from S3.

    I guess my question are:

    Does "S3 CVS Input" support a folder level processing?

    Is there another option I should be looking at to process the S3 EMR files without using the "s3cmd get"?

  2. #2
    Join Date
    Mar 2017


    I believe you can use a management method.
    Step by the same problem, I decided to change the flow a bit.

    - I used a table in the database that stores the name of each file I send to S3 (file name, id and status (processing or not))

    - I created a job that calls 2 transformations, one will use the table output to get the file name. (The same is a loop)

    - And the second transformation with S3 CSV Input that gets the filename as the parameter, so every time I process a file, I change the status to processed.

    I do not know if this is a viable method, but it solved my problem

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.