View Full Version : Feature enhancements expected

03-08-2006, 05:21 PM
Attachment: chef_sample.jpg (http://forums.pentaho.org/archived_att/kettle/chef_sample.jpg) Hi Matt

I've been trying to evaluate kettle to see if it can solve some of my complex data loading/transformation requirements and got stuck up with some show stoppers! Following are some enhancements that I feel is going add great value.

1. In the input steps (text,excel,xbase..), I noticed that rownumber does not get reset when there are multiple files being read. Is it possible to have an option to reset the rownumber to 1 when the source file changes.

2. XBASE input step. Is it possible to have wildcard input files (*.dbf) from specific directories like you have for the other input steps.

3. Table Output Step: Is it possible to have an option to COMMIT only if a complete source file is successfully processed. ( assume i am reading *.csv in the input step, then commit with every change in source file). This could be implemented as - If the value of input column "_" changes then COMMIT.

4. IN SPOON: There is definitly a requirement to read multiple files (*.csv) from a source directory and move the processed file to a processed or error directory at the end of the transformation. This is better than having to do complex workarounds in CHEF on how to create a shell script that knows which files were successfully processed and which files had errors.

5. In CHEF, Is it possible to have a directory reader step. Lets say get *.csv filenames. For each file found by the step provide the filename to a transformation as an parameter. Once the transformation is completed, I can then have a shell script to move the file to a processed_files or error_files directory.

6. In CHEF, Is it possible for the SMTP step to pick the email addresses (to,cc,bcc) from a database query. If I've got a 100 jobs setup and because the email addresses keep changing, managing the jobs is a nightmare.


03-08-2006, 11:57 PM
Hi Biju,

For your tracking pleasure I added the following trackers:

Change Request - [# 1697] TextFileInput: reset rownumber when a new file is processed.
Change Request - [# 1698] XBASE Input : add support for wildcards for filenames
Change Request - [# 1699] Chef: Create new "For each" job entry
Change Request - [# 1700] SMTP job entry: allow parameterising

As for the other questions:

3) The database handling will change dramaticlly towords the end of the year.
In the mean-time just put the commit size to 99999999 (turn of batch processing)

4) Move processed files to another directory:
- add the filename to the output rows.
- split the stream to a select values/Unique rows combination to get the filenames
- send the filenames to a "Copy rows to result" step.
- In chef, send the filenames to a script that moves the files ONLY if all went well.

Hope this helps,

03-09-2006, 01:08 AM
Hi Matt,

Thanks for all the trackers. Will keep hoping that you get time soon to workon them.

I havent really used "Copy rows to result" as it is not clear in the documentation neither are there any examples.

Could you explain how i can get the row (lets say filename) from a transformation and use it in a shell script. as mentioned in your suggestion.


03-09-2006, 01:42 AM
Attachment: chef_sample.jpg (http://forums.pentaho.org/archived_att/kettle/chef_sample.jpg) Hey Biju,

Why make it so difficult?
Just process files *.csv in a directory and if this goes well in the transformation, move *.csv to another directory.
The solution I mentioned (shown in attachement) is only needed in case you're working in an async environment: when files are continuously being put into the directory.

Take care,

03-09-2006, 02:23 AM
Hi Matt,

Thanks mate, your example has made life easier.

However a new doubt has cropped up. Yes files are added to the directory continously. Lets say at the time of running the Transformation there are 20 files in the input directory. Are you saying the script.bat will be called with all the filenames as on parameter list or will it be called with each parameter seperately.

script.bat 1.csv 2.csv 3.csv .............. 20.csv

or is it

script.bat 1.csv
script.bar 2.csv
script.bat 20.csv

The reason i ask is that I remember from my golden days of DOS batch programming that there is a limit to number of parameters that can be received by a shell script. In any case your solution has opened up new logics to try out...


03-09-2006, 02:37 AM
Hi Biju,

You need to use shift, loop and always refer to %1 or $1.
Something like say...

@echo off

echo Files to be moved: > C:\Temp\script.log


IF "%1"=="" GOTO DONE
echo %1 >> C:\Temp\script.log


Found this on the web, however, i'm not sure what versions of DOS it will work on ;-(
YMMV, no guarantees, I'm not a DOS wizard...

The ForEach job entry (under investigation/construction) will make life easier for the non-script kiddies among us.

As far as timing is converned, things are very busy around here, but I try to do a couple of feature requests here and there. However, you should not worry, things are bound to improve dramatically pretty soon :-)


03-09-2006, 03:17 AM
Fantastic. Thanks again mate.....


03-09-2006, 03:24 AM
Hey Biju & All,

PLEASE be careful when dealing with async processes.
All kinds of race conditions can pop up!
I suggest copying (FTP, SFTP, copy, whatever) the text-files into the directory with extention .temp or something and then doing a rename right after to .csv. The rename should be atomic on most OS-es.
That way, you don't risk Kettle processing partial files.

If you know about these things, then fine, otherwise please say: "Yes Matt, I understand".



03-09-2006, 03:40 AM
If you want to process all files in a directory, create a batch file (.bat or .cmd) with the content:
FOR %%I IN (c:\*.csv) do call c:\kettle\pan.bat [your parameters] %%I
[Call is needed because pan is a batch file, too.]

So you donÂÂ't have a problem with number of parameter limitations (I think 10 parms is the max).

As Matt mentioned above the rename process is the easiest solution in DOS async processing.


03-09-2006, 04:24 AM
Thanks guys for the warnings.

I am currently handling the "partial file problem" as follows

For each file in the source folder :
1. Rename the file in the source directory as ???.dat (if the rename worked then it is a whole file as the OS has no locks on the file)

2. All renamed files are then MOVED into a "work directory" and then used by the transformation process.

3. Processed files need to be then moved into the "Processed directory" and Error files need to be moved to the "Error directory".

4. Error file log is emailed to the administrator and concerned users are alerted regarding the availability of the processed data.

With the batch scriptiing logic you mentioned i guess the full cycle is now accomplishable.


03-09-2006, 04:49 AM
No Problem.

In a few days I'll put the ForEach method into the "Job" job entry.
Two options: "Execute for every result row?"
Copy prev. results to args.