Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: Batch processing in kettle

  1. #1
    Join Date
    Jun 2013
    Posts
    10

    Default Batch processing in kettle

    Hi,

    I would like to know support for batch processing in kettle. As far as I know, kettle supports batching for some steps only. At transformation/job level data is streamed from source to target. Is it possible divide data in batches, e.g. step 1 will process batch 1 and after that will go for next batches. At the same time batch 1 will be going through step 2, and so on. It is useful as you don't have to process entire data in case of failure, we can start again from batch where transformation has failed.

    Thanks.

  2. #2
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    You can filter data and pass only a bunch of records at the time.

    One option would be to use the option "process one row at the time" in a job.
    For each row in a data grid (or file) you define the record numbers (dates? id?) that you want to process and then you can use the same job for each of those bunch of records.

    Hope my explanation is clear enough!
    Mick

  3. #3
    Join Date
    Jun 2013
    Posts
    10

    Default

    @Mick - Processing one row at a time is useful when your result data depends on result of transformation of previous row. That's not the case here.
    In data streaming, rows are transformed one after the other, for batch processing it is expected batch is group of rows which are processed in one by one.

  4. #4
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Sorry, but obviously I did not explain myself properly.
    Processing one row at the time is done only to DEFINE which "batch" of rows to process.
    Basically the first row states "rows 1 to 20", second row states "rows 21 to 50" and so on.
    This information is passed to the job which executes the necessary transformation for all those rows at once.

    Basically, your Job process a bunch of rows using the row from a data grid (or text file, or db table) to define which set of rows to process.
    If your job fails during the process, you can see which "batch of rows" - ie which row in your initial data grid - caused the failure.

    Mick
    If it's still coinfusing I'll try to make up a quick example for you - but I'm quite busy :-(

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.