Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: 118 rows pr second = 32 min for a small data text file....why?

  1. #1
    Join Date
    Nov 2013
    Posts
    13

    Default 118 rows pr second = 32 min for a small data text file....why?

    Hi, what's the best way to find out why Kettle is taking 32 min to do what bash grep does in about 3 seconds:

    open a file, look for any line containing the word "Canada", write to new file. I'm using a text input step, applying a filter, outputting to a csv file, that's about it. I have a brand new $3000 dollar 6 core desktop with 32 gigs of RAM.

    I know I must be doing something wrong, I just don't know where to get meaningful feedback from Kettle about what it's actually doing in each step.

  2. #2
    Join Date
    Nov 2008
    Posts
    271

    Default

    Hi Pondus,
    first of all, where are the text files (either the input and the output)? Are they on the same machine where kettle resides?

    Second, care to send a sample of input file and trans/job, so that we can give a closer look?

    BR
    Andrea Torre
    twitter: @andtorg

    join the community on ##pentaho - a freenode irc channel

  3. #3
    Join Date
    Nov 2013
    Posts
    13

    Default

    I appreciate your kinds offer of assistance, but I should have emphasized that my question is really more "When you run into a problem, what are the best tools for debugging".

    I see now that he way I phrased my question makes it too much about the actual problem rather than about a methodology on how to debug "per row" performance problems in general, which is really what I wanted to learn about.

  4. #4
    Join Date
    Nov 2008
    Posts
    271

    Default

    Well,
    in order to find bottlenecks first thing is to isolate each step. Starting from the last one, normally a kind of output, replace the step with a dummy and look what happens to performance. Usually performace drops because of i/o operations, above all if the stream goes through a network.

    Besides, look at the input/output column in Execution Results pane. It tells you how the buffers between steps behave. For instance, if your text input runs towards maximum output capacity (default at 10000 rows), then your filter is too slow at consuming the buffer.

    Another way consists of proper monitoring: enable it in transformation properties dialog -> monitoring. Then look at the Performance graph generated in the Execution Pane.

    Finally, a good resource for better understanding and dealing with performance issue is this book. It features a whole chapter about tuning.

    Regards
    Last edited by Ato; 01-07-2014 at 02:56 PM. Reason: fixed typo
    Andrea Torre
    twitter: @andtorg

    join the community on ##pentaho - a freenode irc channel

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.