Hitachi Vantara Pentaho Community Forums
Results 1 to 12 of 12

Thread: Parallel loop execution

  1. #1

    Question Parallel loop execution

    Hi,
    Is there a way to execute a "execute for every row" job in parallel? The idea is to launch several copies of a job..similar to the number of step copies in a transformations...

    Thanks,

    Andrés

  2. #2
    DEinspanjer Guest

    Default

    This was the only way I could find to do it.
    As soon as you throw an "execute for each input row" job entry into the mix, the parallel execution breaks down.

    I hope that maybe this will work for you and I hope that maybe Matt or someone else can take a look and see if they can find a better way.
    Attached Files Attached Files
    Last edited by DEinspanjer; 08-25-2008 at 10:13 AM. Reason: Updated the zip to include the missing transformation

  3. #3
    Join Date
    Aug 2008
    Posts
    1

    Default Question about parallel lopp expample

    Hi,
    I downloaded the example and I was trying to execute this, but it doesn't work. The example has only two files (example_parallel_looped_streams.kjb and log_received_rows.ktr) although the job has tow differents transformations (log_received_rows and create_channeled_stream -this file isn't in ZIP file).

    Could you send me the create_channeled_stream.ktr file? Perhaps, this file can help me to understand the example.

    Thanks
    Goomer

  4. #4

    Default

    Hi Andrés,
    I do not have the file, but the example is understandable to me without it.
    The transformation you are missing produces tuples dividable into 3 (or more) different data streams. That means you use your normal tuple providing transformation and add a last step. This step adds a sequence to your data: starting with 1 increment by 1 up to 3 (or more). This keeps the sequence rolling!
    Hope this helps...
    BTW: This construction does dividing into 3 nearly equal sized amounts of data. But if processing of every exactly 6th tuple needs more time the third consuming transformation is the slowest one.

    Christoph

  5. #5
    DEinspanjer Guest

    Default

    Quote Originally Posted by goomer View Post
    create_channeled_stream -this file isn't in ZIP file
    Sorry bout that! I updated the zip. Just download it again.

  6. #6

    Default

    Thanks Deinspanjer,
    I think this will be very usefull while kettle team makes some feature for the parallel loop.

    Andrés

  7. #7

    Default Parallel using cgywin

    I think this might be what you're talking about. We process 1033 loads hourly on our production system. Processing in serial was not an option, so I have developed the following script for processing loaders in parallel. We can't process all our loaders in parallel as that slows the system to a crawl. We process in a 7 wide load and have upped the memory in the pan.bat (which dramatically increased our throughput). The loader listing is prepocessed/poputlated in an earlier step from a database query against the r_transformation table.

    Good luck.

    Matt

    #!/bin/sh
    # Command : sh -x run_pan_loader_bat.sh REPORTING REPORTING kettle_rep "mydatabase"
    # Initialize Variables
    delay=5
    command_name="pan.bat /rep:admin /user:admin /pass:admin /logging:error /trans:"
    max_commands=7
    REPORTING_HOME="D:/REPORTING_HOME"
    loader_listing_dir="$REPORTING_HOME"

    db_user_id=$1
    db_user_password=$2
    use_db_name=$3
    db_server=$4

    # Exit shell with an error message (ala perl)
    die()
    {
    echo "Usage: $0 db_user_id db_user_password use_db_name db_server"
    echo "Example: $0 REPORTING REPORTING REPORTING \"mydatabase""
    echo $errorMsg
    exit -1
    }


    # go into a wait state if there $max_commands number of processes running
    wait_max_commands()
    {
    while [ `ps -elfW | grep "pan.bat" | wc -l` -ge $max_commands ]
    do
    echo "sleeezzzping"
    sleep $delay
    done
    }

    #Shell starts here

    if [ $# -lt 4 ]; then
    errorMsg="Three comand line parameters required"
    die
    fi


    #check for existence of the directory passed from the command line
    if [ ! -d "$loader_listing_dir" ]; then
    errorMsg="$loader_listing_dir is not a valid directory"
    die
    elif [ -f $loader_listing_dir/loader_listing.txt ]; then
    rm $loader_listing_dir/loader_listing.txt
    fi

    # builds the where clause for the loader list
    SQL_WHERE=`./sql_where_clause.sh`

    # if the SQL_WHERE is not empty then process the files
    if [ "$SQL_WHERE" != "" ]; then

    cd "$REPORTING_HOME/PENTAHO/PDI/252"

    sqlcmd.exe -U "$db_user_id" -P "$db_user_password" -d "$use_db_name" -S "$db_server" -h -1 -W<< @@@ > $loader_listing_dir/loader_listing.txt
    Set NoCount on
    select name
    from [kettle_rep].[dbo].[r_transformation]
    $SQL_WHERE
    order by 1
    go
    exit
    @@@

    echo "IMPLEMENT_DELETES" >> $loader_listing_dir/loader_listing.txt

    dos2unix $loader_listing_dir/loader_listing.txt
    files=`cat $loader_listing_dir/loader_listing.txt`


    #Loop through the files and spawn each with the pan command in background
    #The wait_max_commands will sleep when max_commands are running
    for file in $files
    do
    ./$command_name$file &
    wait_max_commands
    done

    #Do not proceed until the last pan is completed
    while [ `ps -elfW | grep "pan.bat" | wc -l` -ge 1 ]
    do
    echo "waiting for pan.bat to complete"
    sleep $delay
    done

    else

    #Return a failed status
    errorMsg="Three comand line parameters required"
    die
    fi

    #Return a success status
    exit 0

  8. #8

    Default

    Yeah, well that is an option..replace the kettle job with custom bash or perl thread code...but for kettle glory should be good if you can do the same in a pdi job.

    Tks for the bash example

  9. #9

    Default Wide-ness

    I did try to work within PDI, but I need to set how many parallel executions I want to have running. With the 3.1 release you can have the job run multiple transformations in parallel, but you can't easily control that. 3 wide is a different job from 4 wide et al. I know the script is a bit big, but it does have the flexibility for controlling the jobs/transformations/system load.

  10. #10
    DEinspanjer Guest

    Default

    I completely agree that the only method we've found so far for doing parallel loops in PDI is not very flexible as far as changing the number of instances. For that, currently, you need a shell script like what you have so kindly demonstrated.

    I hope that in the future, there might be a simple configuration parameter to the Job and Transformation entries on a Job allowing you to specify the number of instances to run in parallel. That would deliver the best of both worlds I believe.

    We really need to check and see if there is a JIRA for this already and make one if there isn't.

  11. #11

    Default

    Deinspanjer...it seems you already reported it on Apr 24 :
    http://jira.pentaho.com/browse/PDI-1077

    Andrés

  12. #12
    DEinspanjer Guest

    Default

    yay me! Now everyone who cares about it go vote for it.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.