US and Worldwide: +1 (866) 660-7555
Results 1 to 8 of 8

Thread: Perform a foreach file in folder

  1. #1
    Join Date
    Jun 2012
    Posts
    11

    Question Perform a foreach file in folder

    Dear all,

    I have a job that for now does:
    (Start) -----> (load settings) -----> (Execute transformation X)

    X simply loads a csv file into a table. Now I would like to implement something a little more flexible, I want to be able to load multiple files at once (file name pattern FILENAME_YYYYMMDDHMS.csv) by looping over the files that are found in the input folder.
    Problem; kettle does not have a kind of looping step (I am used to the loop container in SSIS ^^); anyway I have looked over the internet and found something talking about the get file names step but I do not hell have any idea on how I can use it.

    Could you please explain to me conceptually how I can implement what I am after?
    PS: I would like to loop at the job's level would that be possible somehow?

    Guys thanks so much in advance for any input

    Miloud B.

  2. #2
    Join Date
    Jun 2012
    Posts
    1,475

    Default

    There is a sample job provided (jobs/process all tables) that shows you how to approach your task.
    It's about tables, but once you have understood the concept of passing results between transformations, you will be able to come up with a file-oriented solution by yourself.
    At least it should give you a start.


    Revisiting:

    You wrote "for now", but just in case there will not be much more, you could do with a transformation and three steps only:


    • Place steps "Get File Names", "CSV Input" and "Table Output" in an empty transformation
    • Add hops and disable the hop to "CSV Input"
    • Configure all three steps properly
    • Enable the hop to "CSV Input" and edit the "CSV Input" step
    • Spend a thought on what became of the previously configured filename
    • Choose "filename" as the filename field
    • Light a candle for the designer of this feature
    Last edited by marabu; 06-17-2012 at 01:21 PM.
    pdi-ce-4.3.0-stable
    OpenJDK IcedTea 2.3.7 (7u21)
    ubuntu 12.04 LTS (x86_64)

  3. #3
    Join Date
    Feb 2011
    Posts
    545

    Default

    wouldn't simply checking "Execute for every input row" on Job/Transformation's properties do the work? It's in the advanced tab.
    Twitter / Google+
    PDI 4.4.0-stable / PostgreSQL 8.2.6 / MS SQL 2000
    Windows XP

  4. #4
    Join Date
    Jun 2012
    Posts
    11

    Default

    Dear guys,

    I have carefully read what you both said but I gotta admit it I do not get it at all. Maybe because I lack grasp of kettle and I am an SSIS guy basically ^^.
    All I want is to be able to give a transformation a bunch of files to load (one at a time) thus by looping in the main job prior to calling the transformation to have something structured; see it's like while programming:

    for each f in File.matches("xxxx.csv"){// some pattern
    MyTransformation.Loadfile (f)
    MyTransformation.ArchiveFile(f) // this is a simple move to a folder
    }

    I can't see how your explanations map on this actually. And in my "coding style" example no matter what MyTransformation does the parent context considers it as a black box that takes as input a csv file. With such a design if in the future I need to do more manipulations on my files I just edit the steps inside myTransformation.

    Am I going the wrong why?

    Thanks again for your patience and your efforts

    Miloud Bel

  5. #5
    Join Date
    Feb 2011
    Posts
    545

    Default

    I guess that what I've said fills what you want, I just failed to explain it better =) here, a pic:


    That first transformation mainly lists the files from a folder and outputs on "Copy rows to result". The other job, which has the advanced setting "Execute for every input row", does all kinds of things, like reading files, inserting into databases, zipping the file, etc...

    So, every file listed in the transformation is a row, a registry. For each row sent from the transformation, the job will run once. If there are 10 files in the folder watched by the transformation, the job will run 10 times. Get it now?
    Twitter / Google+
    PDI 4.4.0-stable / PostgreSQL 8.2.6 / MS SQL 2000
    Windows XP

  6. #6
    Join Date
    Jun 2012
    Posts
    11

    Default

    Me loves pictures
    Thanks so much I got it !

    Edit:

    Just one last question, how do you pass rows from the list arquivos transformation to the last job there?

    Found it: copy rows to result
    Thanks for all guys
    Last edited by Miloud; 06-20-2012 at 03:42 PM.

  7. #7

    Default

    Miloud,

    Kettle does support "looping through files" (I am an ex-SSISer, I know the feature you are referring to). Please see next screenshot of an example of the Text Input:
    Text_File_input.jpg
    Caution!! In regex-lingo ".*" doesn't mean match 'any extension', it means 'any-character' for 'as many times as needed' (also 0).
    so the regex-pattern in the example would match the following:
    c:\somedirectory\filepattern001.txt
    c:\somedirectory\filepattern002.txt
    c:\somedirectory\filepatterns
    c:\somedirectory\filepattern

    Kettle will take all the files from the directory matching the regex and put all the records in one big dataset.
    For flat-files: make sure your files have the same structure.
    For XML-files this restriction is not problem: it will drop fields that are not in the definition, and it will put NULL in the fields that are defined but not there (highly flexible stuff, that is )

    More on regex-patterns: http://www.regular-expressions.info/reference.html
    More on the Kettle "Text file input" step: http://wiki.pentaho.com/display/EAI/Text+File+Input

    Kind regards,
    Cedric

    Self proclaimed Data Ninja
    Last edited by cedricdevroey; 06-20-2012 at 04:19 PM. Reason: additions

  8. #8
    Join Date
    Feb 2011
    Posts
    545

    Default

    Cedric, upgrade your title to Data & Understanding Ninja =p I really didn't realize that was what Miloud was talking about! I thought his problem was AFTER listing the files, not WITH listing the files...
    Twitter / Google+
    PDI 4.4.0-stable / PostgreSQL 8.2.6 / MS SQL 2000
    Windows XP

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •