Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: Duplicating steps for performance?

  1. #1
    Join Date
    Jul 2007
    Posts
    1,013

    Default Duplicating steps for performance?

    This transformation (see attached picture) represents the main process in my Kettle Job. If the database is blank it takes 16 hours for it to finish (a blank database would mean collecting all call data since May 1st, with an average of 4000 calls a day).

    The four "Get" steps are necessary because even though they all get call data from the same source (an Asterisk CDR table), the queries involve completely different filters.

    What I'm wondering is if I would benefit from duplicating any of the rest of the steps, and either grouping them at some later point, or keeping four different paths that would each end on an "Insert to f_cdr".

    I will be running some tests for performance, but I was hoping that someone with knowledge of Kettle's inner workings would have an easy time shedding some light on the benefits (or not) of duplicating steps.

    Cheers!
    Last edited by tdidomenico; 11-13-2008 at 02:28 PM.

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    This sort of transformation is actually not very "legal".
    You should put the steps you now run in parallel behind each other.

    There would be no need to copy the data in that case either.

    As for your actual question: yes, you can start multiple copies of a step (right click on a step to set this).
    For database lookups for example this can be beneficial to lower average latency.
    If you are consuming a lot of CPU in a step you can start multiple copies to make use of multiple CPUs.

    HTH,

    Matt

  3. #3
    Join Date
    Jul 2007
    Posts
    1,013

    Default

    I see. So in order to run the steps in parallel I should be using four different transformations?

    If that were so, and I decided to I run the processes one after the other, I'm assuming I would also have to remove three links from the "Get Max(calldate)..." step and leave only one link, to the first of the four. But since these four steps are using the output from the Max(calldate) step as a parameter for the query, it doesn't seem to work anymore. Is there a way around it?

    Thank you!

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I see. So in order to run the steps in parallel I should be using four different transformations?
    I never said such a thing. The steps run in parallel no matter what.
    What I'm saying is that the layout of all rows entering a step should be the same. You JavaScript step breaks that rule.

    Note: to start multiple identical copies of a step, you can click right on a step, select "Change number of copies to start...".

    Matt

  5. #5
    Join Date
    Jul 2007
    Posts
    1,013

    Default

    1) Excellent! I was hoping that everything would be running in parallel anyway (and yes, I now you didn't say what I implied, but I just assumed it so that you could correct me if wrong... Seemed shorter than asking! )

    2) Got the "Multiple copies" tip. Thanks!

    3) All the four steps generate the exact same row schema, from field names to data types. Does that change the scenario in any way? Just in case it helps, I'm attaching an XML export of the transformation.

    Many thanks for your time, Matt!
    Last edited by tdidomenico; 11-13-2008 at 02:28 PM.

  6. #6
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    If all 4 steps generate the same row, there is no technical problem.
    You get 4 times as much rows as on input, but that might be the idea ;-)

    Matt

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.