Hitachi Vantara Pentaho Community Forums
Results 1 to 12 of 12

Thread: Parallel loop execution

  1. #1

    Question Parallel loop execution

    Is there a way to execute a "execute for every row" job in parallel? The idea is to launch several copies of a job..similar to the number of step copies in a transformations...



  2. #2
    DEinspanjer Guest


    This was the only way I could find to do it.
    As soon as you throw an "execute for each input row" job entry into the mix, the parallel execution breaks down.

    I hope that maybe this will work for you and I hope that maybe Matt or someone else can take a look and see if they can find a better way.
    Attached Files Attached Files
    Last edited by DEinspanjer; 08-25-2008 at 10:13 AM. Reason: Updated the zip to include the missing transformation

  3. #3
    Join Date
    Aug 2008

    Default Question about parallel lopp expample

    I downloaded the example and I was trying to execute this, but it doesn't work. The example has only two files (example_parallel_looped_streams.kjb and log_received_rows.ktr) although the job has tow differents transformations (log_received_rows and create_channeled_stream -this file isn't in ZIP file).

    Could you send me the create_channeled_stream.ktr file? Perhaps, this file can help me to understand the example.


  4. #4


    Hi Andrés,
    I do not have the file, but the example is understandable to me without it.
    The transformation you are missing produces tuples dividable into 3 (or more) different data streams. That means you use your normal tuple providing transformation and add a last step. This step adds a sequence to your data: starting with 1 increment by 1 up to 3 (or more). This keeps the sequence rolling!
    Hope this helps...
    BTW: This construction does dividing into 3 nearly equal sized amounts of data. But if processing of every exactly 6th tuple needs more time the third consuming transformation is the slowest one.


  5. #5
    DEinspanjer Guest


    Quote Originally Posted by goomer View Post
    create_channeled_stream -this file isn't in ZIP file
    Sorry bout that! I updated the zip. Just download it again.

  6. #6


    Thanks Deinspanjer,
    I think this will be very usefull while kettle team makes some feature for the parallel loop.


  7. #7

    Default Parallel using cgywin

    I think this might be what you're talking about. We process 1033 loads hourly on our production system. Processing in serial was not an option, so I have developed the following script for processing loaders in parallel. We can't process all our loaders in parallel as that slows the system to a crawl. We process in a 7 wide load and have upped the memory in the pan.bat (which dramatically increased our throughput). The loader listing is prepocessed/poputlated in an earlier step from a database query against the r_transformation table.

    Good luck.


    # Command : sh -x REPORTING REPORTING kettle_rep "mydatabase"
    # Initialize Variables
    command_name="pan.bat /rep:admin /user:admin /pass:admin /logging:error /trans:"


    # Exit shell with an error message (ala perl)
    echo "Usage: $0 db_user_id db_user_password use_db_name db_server"
    echo "Example: $0 REPORTING REPORTING REPORTING \"mydatabase""
    echo $errorMsg
    exit -1

    # go into a wait state if there $max_commands number of processes running
    while [ `ps -elfW | grep "pan.bat" | wc -l` -ge $max_commands ]
    echo "sleeezzzping"
    sleep $delay

    #Shell starts here

    if [ $# -lt 4 ]; then
    errorMsg="Three comand line parameters required"

    #check for existence of the directory passed from the command line
    if [ ! -d "$loader_listing_dir" ]; then
    errorMsg="$loader_listing_dir is not a valid directory"
    elif [ -f $loader_listing_dir/loader_listing.txt ]; then
    rm $loader_listing_dir/loader_listing.txt

    # builds the where clause for the loader list

    # if the SQL_WHERE is not empty then process the files
    if [ "$SQL_WHERE" != "" ]; then


    sqlcmd.exe -U "$db_user_id" -P "$db_user_password" -d "$use_db_name" -S "$db_server" -h -1 -W<< @@@ > $loader_listing_dir/loader_listing.txt
    Set NoCount on
    select name
    from [kettle_rep].[dbo].[r_transformation]
    order by 1

    echo "IMPLEMENT_DELETES" >> $loader_listing_dir/loader_listing.txt

    dos2unix $loader_listing_dir/loader_listing.txt
    files=`cat $loader_listing_dir/loader_listing.txt`

    #Loop through the files and spawn each with the pan command in background
    #The wait_max_commands will sleep when max_commands are running
    for file in $files
    ./$command_name$file &

    #Do not proceed until the last pan is completed
    while [ `ps -elfW | grep "pan.bat" | wc -l` -ge 1 ]
    echo "waiting for pan.bat to complete"
    sleep $delay


    #Return a failed status
    errorMsg="Three comand line parameters required"

    #Return a success status
    exit 0

  8. #8


    Yeah, well that is an option..replace the kettle job with custom bash or perl thread code...but for kettle glory should be good if you can do the same in a pdi job.

    Tks for the bash example

  9. #9

    Default Wide-ness

    I did try to work within PDI, but I need to set how many parallel executions I want to have running. With the 3.1 release you can have the job run multiple transformations in parallel, but you can't easily control that. 3 wide is a different job from 4 wide et al. I know the script is a bit big, but it does have the flexibility for controlling the jobs/transformations/system load.

  10. #10
    DEinspanjer Guest


    I completely agree that the only method we've found so far for doing parallel loops in PDI is not very flexible as far as changing the number of instances. For that, currently, you need a shell script like what you have so kindly demonstrated.

    I hope that in the future, there might be a simple configuration parameter to the Job and Transformation entries on a Job allowing you to specify the number of instances to run in parallel. That would deliver the best of both worlds I believe.

    We really need to check and see if there is a JIRA for this already and make one if there isn't.

  11. #11

    Default seems you already reported it on Apr 24 :


  12. #12
    DEinspanjer Guest


    yay me! Now everyone who cares about it go vote for it.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.