Hitachi Vantara Pentaho Community Forums
Results 1 to 18 of 18

Thread: Looping...Jobs...Transformations...General questions

  1. #1

    Default Looping...Jobs...Transformations...General questions

    New to Kettle, so please be gentle, basic questions:

    In my Kettle project I am using VFS to get a list of folders on an FTP site in one Transformation, then I want to loop thru each folder to determine if a file is required for download. I have a Get Properties Transformation to retrieve the FTP site, and folders to begin searching, along with the user/pwd and the lastDownloadDate.

    Questions #1: Should I be using Get Variables / Set Variables or Copy rows to results, Get rows from Results in the Transformations, which is common practice.


    Question #2: Execute for each row? When would I do this? For each folder, or is it in the next Transformation in-line?

    I am going to try to attach the XML for the Job, please have a look and thanks for help, as I am sure there is a simple solution:

    <?xml version="1.0" encoding="UTF-8"?>
    <job-jobentries>
    <entry>
    <name>START</name>
    <description>Special entries</description>
    <type>SPECIAL</type>
    <start>Y</start>
    <dummy>N</dummy>
    <repeat>N</repeat>
    <schedulerType>0</schedulerType>
    <intervalSeconds>0</intervalSeconds>
    <intervalMinutes>60</intervalMinutes>
    <hour>12</hour>
    <minutes>0</minutes>
    <weekDay>1</weekDay>
    <DayOfMonth>1</DayOfMonth>
    <parallel>N</parallel>
    <draw>Y</draw>
    <nr>0</nr>
    <xloc>13</xloc>
    <yloc>227</yloc>
    </entry>
    <entry>
    <name>Success</name>
    <description>Success</description>
    <type>SUCCESS</type>
    <parallel>N</parallel>
    <draw>Y</draw>
    <nr>0</nr>
    <xloc>860</xloc>
    <yloc>259</yloc>
    </entry>
    <entry>
    <name>Get S&amp;P Properties</name>
    <description>Transformation</description>
    <type>TRANS</type>
    <filename/>
    <transname>trans_Get_Snp_Properties</transname>
    <directory>&#47;</directory>
    <arg_from_previous>Y</arg_from_previous>
    <exec_per_row>N</exec_per_row>
    <clear_rows>N</clear_rows>
    <clear_files>N</clear_files>
    <set_logfile>N</set_logfile>
    <logfile/>
    <logext/>
    <add_date>N</add_date>
    <add_time>N</add_time>
    <loglevel>Nothing</loglevel>
    <cluster>N</cluster>
    <slave_server_name/>
    <parallel>N</parallel>
    <draw>Y</draw>
    <nr>0</nr>
    <xloc>372</xloc>
    <yloc>227</yloc>
    </entry>
    <entry>
    <name>IF SNP_PROPERTIES Exists</name>
    <description>Table exists</description>
    <type>TABLE_EXISTS</type>
    <tablename>SNP_PROPERTIES</tablename>
    <connection>myOracle</connection>
    <parallel>N</parallel>
    <draw>Y</draw>
    <nr>0</nr>
    <xloc>174</xloc>
    <yloc>227</yloc>
    </entry>
    <entry>
    <name>Abort job 1</name>
    <description>Abort job</description>
    <type>ABORT</type>
    <message/>
    <parallel>N</parallel>
    <draw>Y</draw>
    <nr>0</nr>
    <xloc>174</xloc>
    <yloc>510</yloc>
    </entry>
    <entry>
    <name>Show Error</name>
    <description>Display Msgbox Info</description>
    <type>MSGBOX_INFO</type>
    <bodymessage>Invalid Schema: mySNPConnection is missing a SNP_Properties table</bodymessage>
    <titremessage>SNP_PROPERTIES table missing</titremessage>
    <parallel>N</parallel>
    <draw>Y</draw>
    <nr>0</nr>
    <xloc>174</xloc>
    <yloc>368</yloc>
    </entry>
    <entry>
    <name>Get Master Folders</name>
    <description>Transformation</description>
    <type>TRANS</type>
    <filename/>
    <transname>trans_Get_Master_Folders</transname>
    <directory>&#47;</directory>
    <arg_from_previous>Y</arg_from_previous>
    <exec_per_row>N</exec_per_row>
    <clear_rows>N</clear_rows>
    <clear_files>N</clear_files>
    <set_logfile>N</set_logfile>
    <logfile/>
    <logext/>
    <add_date>N</add_date>
    <add_time>N</add_time>
    <loglevel>Nothing</loglevel>
    <cluster>N</cluster>
    <slave_server_name/>
    <parallel>N</parallel>
    <draw>Y</draw>
    <nr>0</nr>
    <xloc>585</xloc>
    <yloc>227</yloc>
    </entry>
    <entry>
    <name>trans_Get_Master_Files</name>
    <description>Transformation</description>
    <type>TRANS</type>
    <filename/>
    <transname>trans_Get_Master_Files</transname>
    <directory>&#47;</directory>
    <arg_from_previous>Y</arg_from_previous>
    <exec_per_row>Y</exec_per_row>
    <clear_rows>Y</clear_rows>
    <clear_files>Y</clear_files>
    <set_logfile>N</set_logfile>
    <logfile/>
    <logext/>
    <add_date>N</add_date>
    <add_time>N</add_time>
    <loglevel>Nothing</loglevel>
    <cluster>N</cluster>
    <slave_server_name/>
    <parallel>N</parallel>
    <draw>Y</draw>
    <nr>0</nr>
    <xloc>589</xloc>
    <yloc>349</yloc>
    </entry>
    <entry>
    <name>Copy each file</name>
    <description>Copy Files</description>
    <type>COPY_FILES</type>
    <copy_empty_folders>Y</copy_empty_folders>
    <arg_from_previous>N</arg_from_previous>
    <overwrite_files>N</overwrite_files>
    <include_subfolders>N</include_subfolders>
    <remove_source_files>N</remove_source_files>
    <add_result_filesname>N</add_result_filesname>
    <destination_is_a_file>N</destination_is_a_file>
    <create_destination_folder>N</create_destination_folder>
    <fields>
    <field>
    <source_filefolder>${vfsFileName}</source_filefolder>
    <destination_filefolder>C:\</destination_filefolder>
    <wildcard>.zip</wildcard>
    </field>
    </fields>
    <parallel>N</parallel>
    <draw>Y</draw>
    <nr>0</nr>
    <xloc>589</xloc>
    <yloc>489</yloc>
    </entry>
    </job-jobentries>

  2. #2

    Default

    The basic trouble I am having is how to pass information between the jobs - do I use the Copy to results or Set Vars? Both seem to be not working correctly. In once case I almost have it, but it is the wrong datatype or the value is missing.

    Any general help would be helpful.

    Also I am trying to use a Filter Rows to only work on the file(s) that I want to download. If there is an easier way, please tell me.


    Thanks in advance.

    Marc Pike

  3. #3

    Default

    Hi Marc,

    which version of PDI do you use?

    Thanks

    Rgds

    Samatar

  4. #4

    Default

    Using 3.x Release 2

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Hi Marc,

    Be a good man and reply with a real version number. Help/About or "pan /version".

    All the best,
    Matt

  6. #6

    Default

    My bad: 3.0.0 RC2

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    OK, now perhaps you can attach the job to this tread somewhere.
    The XML you posted above only contains the job entries, not the hops or anything.

    That being said. The general idea is that you copy rows to result in a transformation. (one job entry)
    You can then tick a box in a next job entry to loop over these result rows. That will execute the job entry N times for N rows in the result.

    All the best,

    Matt

  8. #8

    Default

    Ok, still trying to figure this stuff out. Attached is the Parent Job and the three Transformations that I have so far.

    My general problems are:

    1> When to use Get/Set variables, I bet no place that could be threaded correct?

    2> How does the "Execute for every input row" work? Am I calling this appropriately?

    3> Some basics with the Modified Java Script Value: I am using this to evaluate and build dates, not sure if I need to change the exit_status, does the last var have to be a boolean? Should I be using this at all?

    4> Had a lot of trouble using a repository, I was trying to mix this with saving the files via VFS, had all sorts of problems, so I am no longer using a repository and I am saving things on the local drive. Was very confused on why when I saved things, they were not visible, lost several changes due to this, not sure if this is a 3.0.0 bug, but it was quite frustrating.

    5> Am I using the correct logic with regards to Copy Rows to Results in one transformation and then using a Get Rows from Results in the next Transformation?

    My basic case is this, I want to look in a folder on an FTP site, compare the date of each folder to my lastDownloadDate which I have in my database, then I want to filter out the folders that have been processed previously. Upon doing this I then need to check for files in the same way and download each file that is newer than my lastDownloadDate.

    Sorry for the lengthy questions, but this will hopefully sell to our team that ETL would work better than us rolling our own, but I have only a few days to make the sale...

    Thanks to all for the help!

    Marc Pike
    Attached Files Attached Files

  9. #9

    Default

    My newest issue is that I think I have found out a way to get the list of files to be copied using a few Transformations, and my last [Copy rows to result] is working, however I do not know how use these file names in the calling (Parent) job.

    I know that I could save the filename to a variable in the Get Files Transformation, however, with the multi-threading going on, is that even going to work.

    Using VFS in combination with the variable(s) is very powerful, but it seems to me that a Copy file would have to be a Transformation rather than a Job.

    What am I missing here?

    Thanks,

    Marc Pike

  10. #10
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    There are different types of objects that can be passed from one job entry to another:
    - Rows of data (Result rows)
    - Filenames (Result files)
    - various metrics (lines in, out, read, written, errors, etc)

    I'm leaving for Madrid in the morning so I'll let Samatar (hte author of the Copy Files entry) or anyone else step in here.

    Personally, I think that if you have a bunch of result files and you want to convert them to result files, that you can do this with a transformation that has 2 steps: "Get files from result" --> "Copy rows to result".
    From the doc I see that the "Copy Files" entry needs 3 values per row of data : source filename, destination folder/file and a wildcard.

    HTH,

    Matt

  11. #11

    Default

    Hi,

    Matt, Bonne chance pour demain. Il parait que Madrid est une belle ville :-)

    Marc,
    If you need to fech FTP folders on a FTP and then for each one, download wanted files..it's
    quite simple with Copy files job entry.

    The job schema is like :

    Job
    -----

    Start --> Trans A --> Copy files job entry
    with copy files : check "get result from previous entry"

    Trans A
    ---------

    -Get filesname step (put you folder thanks to VFS) and select "get only folders"
    - Add destination folder and wildcard (with Java script set,...)
    - Send rows to result

    You must output 3 fields :

    1)source file (added by getfilesname step)
    2)destination folder (added by Transform steps : Java script, ...)
    3)wildcard (added by Transform steps : Java script, ...)

    I have the same process working fine at work :-)
    ie : Fetch folders and extract files thanks to getfilesname step and then
    pass rows to copy files job entry.

    VFS allow powerful transformations and all depend on your needs.

    If the folders to fetch are always the same, use directly "Copy files" job entry because it allow
    to fetch also sub folders,...

    Hope that help.

    Rgds

    Samatar
    Last edited by shassan2; 11-06-2007 at 05:39 PM.

  12. #12

    Default

    How about a quickie then Matt, oops that came out wrong...

    What is the regular expression to list all .zip file(s) in a root directory, getting all file(s) in the child directories.

    I think I can simplify everything if I could simply use a Get File names and grab everything in one fell swoop.

    Have a good trip, might be going to Rome soon myself and perhaps Brussells in the near future.

    Regards,

    Marc Pike

  13. #13

    Smile

    .*zip$ = all zip
    .*txt$ = all txt
    etc.

    For example in the "copy files" job entry, if you put :

    - source file/folder = c:\temp
    - destination file/folder = c:\temp2
    - wildcard = .*zip$

    PDI will copy all zip files from temp to temp 2.
    If you select "include sub folder" all zip files also in sub folders (for ex c:\temp\subfolder1) will be copied.

    Rgds

    Samatar

  14. #14

    Default

    That's close, but not quite there, i am doing VFS FTP files and I do not want to copy the files down until I check their lastModifiedTime to see if it is greater than my lastDownloadTime.

    Therefore I am using the Get Files names and cannot rely on the Copy Files.

    If there is a way to use a regular expression to list the files within the subfolders that would rock!

    Example:


    \SomeFolder
    \Folder1
    \file_1.zip
    \file_2.zip
    \Folder2
    \file_3.zip
    \Folder3
    \file_4.zip

    In this case I only know about the \SomeFolder because the child folders are dynamic, what I need is a way to find all the zip file(s) within each of the child folders, WITHOUT copying, since most of the files are over 100mb.

    Any ideas?

    Thanks in advance!

    Marc Pike

  15. #15

    Default

    -until I check their lastModifiedTime to see if it is greater than my lastDownloadTime.
    --> Touché..In that case, you need "get filesname" step :-)

    If there is a way to use a regular expression to list the files within the subfolders that would rock!

    --> I have already posted a CRQ for that
    http://jira.pentaho.org/browse/PDI-236

    In this case I only know about the \SomeFolder because the child folders are dynamic, what I need is a way to find all the zip file(s) within each of the child folders,

    --> Tell me, i hope you have a limited number of Folder (Folder1...Foldern).
    If yes, you have to write all folders in getfilenames step wildcard like .*zip$.

    PS: If we suppose that a already downloaded file is in you destination folder, you can copy each time
    all files from source and NOT OVERWRITTE them if the destination file exists.
    This scenario will work fine if you don't have to much files in you source.

    Rgds

    Samatar

  16. #16

    Default

    Thanks Samatar, that would be a useful feature that at least I would benefit from.

    My project basically has a Get Folders transformation then a Get Files transformation that runs under each Get Folder occurance.

    Another issue is how do I pass the lastDownloadDate around the different transformations.

    I have a Get Properties trans that gets the FTP site, user, pwd, and master folder name along with the lastDownloadDate.

    If I do a Set Variables with this (Date) value and try to do a Get Variables in subsequent transformations it thinks it is a String variable rather than a Date.

    So I went the other way and did a Copy Values to result and it is now a field with the correct datatype, but then I cannot use these values as variables.

    So, my next challenge was to do one or all of the following, please pick the best one:

    1> Use a Select Values to re-map to the correct data-type then I think I did a Modify Value in Javascript to do a setVariable.

    2> Instead of using a Copy Values to result I did a Get Variables and created a new variable using Modify Value in Javascript

    3> Used a combination of all 3

    My head is spinning here in Houston, I have been working on this stuff so long that I cannot see straight.

    So, the basic question is, how to pass things between the transformations, so that the datatype is correct and there are no threading issues.

    Thanks again for you help and please let me know what the _experts_ do with these situations.

    Good day!

    Marc Pike

  17. #17

    Default

    Keep in mind that if you use a variable in a trans, you should have defined this variable in a previous entry! (variable creation and variable use must happen in different job entry because
    transformation are running in multi thread and job entry are executed one after one).

    You need to convert in the desired format because the Set variable will output a String.

    I thinks the process is too COMPLEX ....

    So let me ask you some questions:
    - is there one OR several (dynamic) destination folders ?
    - How many source files you deal with ?

    Thanks

    Samatar

  18. #18

    Default

    Thanks again.

    The folders are probably less than 12 and maybe at the most 31 days of files in each folder, although there may be several large files to download/copy.

    About the variables, if multiple transformations are running in different threads, are the variables unique within their own address space?

    If I have several folders and files being processed at the same time, do I run a risk in using variables?

    Also, what would you recommend for turning these string values into something that can be used as a variable inside a [Get file names]? I have tried using Javascript to do the conversion, but I think you may have given me the answer, I was trying to use that variable in the same transformation...

    I have yet to play with the [Get Files from results], does that refer to filename(s) or actual file(s).

    I guess it comes down to this:

    - I have to have the Folder and File names in variables in order to parameterize the Get File Names so how do I do this if multiple transformations are running in parallel?

    - I have to figure out a way to convert these string values into date values in order to compare using a [Filter rows] or is it possible to do it using a [Modified Java script Value]? Since I cannot use the variables within the same transformation do I have to add several additional transformations to accomplish this?

    Additional javascript question(s):

    How is the exit_status used in a javascript object within a transformation, CONTINUE, SKIP and CANCEL?

    Within a job, can I read the result table to get the values and filter out the older folder(s)/file(s)?

    Man, I feel close to cracking this problem and I am learning a lot from my mistakes, so your help is very appreciated.


    Sorry for asking so many questions, but some answers will definitely get me over the hump.

    Regards,

    Marc Pike

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.