US and Worldwide: +1 (866) 660-7555
Results 1 to 2 of 2

Thread: records per sec slowly decreasing as the transformation progresses

  1. #1

    Default records per sec slowly decreasing as the transformation progresses

    Hi, I know other people have commented on this issue but I don't know if it was ever resolved for them.

    I have a large transformation (175 icons) that ran for 2.5 hours and converted 4 million rows of data.
    after making a few more changes I cannot get it to run to completion

    I notice the records per second starts around 550r/s and slowly decreases to about 300r/s after approx 600,000 rows
    is this to be expected?
    at this point the Spoon application seems to freeze and I cannot get it to respond.

    I am running 4.10 stable

    The input database table and the output database table(s) are on the same box with a lot of transforming steps between them
    Data movement is set to Copy data to next step for EVERY step
    If I am running out of memory I must be very close, because I have ran versions successfully.

    I am just wondering if there are any tips or tricks to reduce the amount of memory ?
    Thanks Rod

  2. #2
    Join Date
    Nov 2008
    Posts
    199

    Default

    Hi rdmclure,

    In a 175-step transformation, there's possibly a not trivial number of bottlenecks. Just to name a few:

    - reading/writing filenames (I/O processes slows down the transformation)
    - update tables with indexes enforced (indexes have to be managed continously)
    - sorting steps (they stopped the usual stream flow, to do the sort)
    - javascript step (code is interpreted during transformation instead of being compiled just once, during initialization )

    Chapter 15 of Pentaho Kettle Solutions is an excellent survey of performance tuning tricks in kettle.


    Besides, I don't know if we can speak here of best practice, but 175-step sounds to me as a poorly decomposed etl process. I wonder if you can make a better usage of hardware resources (memory and cpu) breaking it in a few transformations/jobs wrapped in a main job: rows can still fly along the result, without touching the disk; debugging/logging will be easier; heap/gc limit are less likely to be hit... and so on.

    HTH
    Andrea Torre
    twitter: @andtorg

    join the community on ##pentaho - a freenode irc channel

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •