Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: Throttling Step

  1. #1
    Join Date
    Aug 2009
    Posts
    2

    Default Throttling Step

    Is there a way to creating a 'throttling' step that will limit that number of rows being processed to a specified number? In other words, if you set the limit to 100, it would still pipeline 100 rows at a time, but would delay processing once you reached that limit until more rows had been processed. A generic throttling step seems like it could generally useful for regulating processing in a variety of areas. In my case, I want to stop the system from making too many HTTP calls in parallel. I like the fact that I can make one HTTP call per row, because my service provider probably has multiple machines handling the processing. So that probably results in a speed up. However, if I start making thousands of calls in parallel, I am afraid that my local computer may overuse resources and slow down performance locally or on the remote machines.

    The Blocking step waits until all rows before moving on, which would not work. Delay row might work if you specify milliseconds, but it seems like it would be hard to control in practice. In other words, coming up with the right delay would require a lot of experimentation. I might be able to distribute the data into several duplicate steps and then set up Blocking depencies between those steps, but that also would be very difficult to control if it worked at all. Maybe a Javascript step could programmatically call some Java routines to throttle the processing?

    Thanks for your thoughts!

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    There's one in construction, but don't hold your breath until then

    Sven

  3. #3
    Join Date
    Aug 2009
    Posts
    2

    Default Workarounds

    Ok, thanks for the update.

    For now, I am limiting the amount of input data that feeds into the system at one time and then repeating the entire transformation multiple times until all the data is processed.

    That should prevent system overload although it will probably not pipeline as effectively.

  4. #4
    DEinspanjer Guest

    Default

    This is another use-case that my UserDefinedJavaClass plug-in step will serve well.
    I'm on vacation this week and am having a hard time getting enough computer time to finish it up, but if it isn't ready by the end of the week, it should be ready next week.

    When it is released, you'll be able to write just the small piece of code necessary to handle this new queue size without having to write a whole dedicated plug-in. I'm thinking that a secondary blocking queue will fit the bill nicely. I'll try to post back on this thread as a proof of concept.

  5. #5

    Default

    I agree that there is a need for a throttling step. I get an OutOfMemoryError Exception after reading about 140k rows from 300 files. An ETL tool would need to manage better its memory if it is expected to process several hundred-thousand rows (even more!).

  6. #6
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I can assure you that Kettle is perfectly capable of handling billions of rows at once without breaking into a sweat.
    In fact, you can only run out of memory if you explicitly load too much data into memory, for example with caching, stream lookup, sorting and so on.

    And if you want to throttle throughput, just put a delay in the data stream. 10ms means max 100 rows/s, and so on.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.