Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: how to loop through a directory, executing a transform once for each file

  1. #1
    Join Date
    Nov 2007
    Posts
    11

    Default how to loop through a directory, executing a transform once for each file

    I am guessing this one should be easy, but I'm having a hard time approaching it.

    I need to run a pretty simple transformation that uses an XML source file. At the beginning of the Transform I reference the XML file 3 times to get different level information and then use a Merge Join to get these parent and child records together to write to a relational target.

    I have a directory that has 707 xml files that I need to run through.

    I tried referring to the directory instead of the individual files in the XML step and I get kind of a cartesian product because it uses one xml file as a parent with another XML file as a child. What I really want to do is run this transform with one file, then run it again with the next file, etc, etc, until it's done with all 707.

    I tried putting a GetFileNames / CopyRowsToResult transformation before my XML transformation but I can't get it to work right.

    First, am I approaching this correctly? and second, is there an example of a loop through a directory while running the transformation all the way through with the first file before running it again with the next file?

    Thanks so much for any help!!

    Ben.

  2. #2
    DEinspanjer Guest

    Default

    There are lots of examples of that here in the forums.
    You were on the right track. Get File names step, Copy Rows to Result, then send it to another transformation and make sure you checkmark the box on the transformation to execute once for each row.

  3. #3
    Join Date
    Nov 2007
    Posts
    11

    Default

    Ok, I am running but not writing any records, here's what I have:

    job which contains:

    start object

    transform 1 (nothing checked)

    this contains 2 steps
    1) get filenames (points to directory using wildcard)
    2) set files in result ('type of file' to general)
    this all seems to be working

    transform 2 ('copy previous results to args' and 'execute for every input row' are checked)

    this contains several steps
    1) get files from result
    2) 3 'get data from XML' objects that all have 'XML source defined in field' and 'XML source is filename' checked and filename selected from drop down for 'get XML source from a field'
    3) then join data and write to table

    The job runs but doesn't do do anything (no rows written). I'm not sending the rows to the second transform somehow.

    Help! What have I missed?

  4. #4
    DEinspanjer Guest

    Default

    You don't want the copy to args checkbox. That checkbox is for turning the rows into command line arguments instead of rows of input.

    You also don't want the "set files in result" and "get files from result" steps. Those steps are for writing and retrieving information from the log about what files have been read or written in a transformation so that you can do things like audit and e-mail status.

    You want the two steps under the Jobs category, copy rows to result and get rows from result. Then you should be set. (Don't forget to update the get data from XML when you swap out steps.)

    You should check the user guide in the docs directory of your Spoon installation. It has a section that goes through most of the steps and describes them in at least enough detail to avoid mistakes like these.

  5. #5
    Join Date
    Nov 2007
    Posts
    11

    Default

    Thanks for the quick response! I took a look again at the User Guide, it definitely helps by explaining what the objects do, but using them in context is still confusing me. . .

    I am now using the 'copy rows to result' and 'get rows from result' objects.

    Do I need to create a fieldname in the 'get rows from result' object that points to the filenames I am getting in the first transform? If so, how would I create this in that first transformation (since neither 'get file names' or 'copy rows to result' have the option to create a fieldname?

    I looked and looked for an example in the install samples directory and on the forum and couldn't find anything. Is there one you know of out there?

    Thanks for your help, DEinspanjer, it is much appreciated. . .

    Ben.

  6. #6
    DEinspanjer Guest

    Default

    In the first transformation, if you right click on the copy rows to result step, you can see what the field stream looks like by selecting the "show output fields". All of the fields listed here will be available to the next transformation's get fields from result step. Just enter the one field you need, "filename" in that step's dialog so that it can feed that field to the XML step.

    You're really close.


    When you are searching for samples and examples, also don't forget to search the wiki (wiki.pentaho.com) and these forums. Lots of examples get attached to threads and they can frequently give you ideas on how to get stuff done.

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    It's pretty much what the "run all" sample is all about: samples/jobs/run_all

    HTH,

    Matt

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.