Hitachi Vantara Pentaho Community Forums
Page 1 of 3 123 LastLast
Results 1 to 10 of 21

Thread: Out of heap space...

  1. #1
    Join Date
    Apr 2007
    Posts
    2,009

    Default Out of heap space...

    Hi,

    I have an out of heap space problem, and i dont think adding more memory will fix it.

    We have the default 512mb at the moment.

    The transform is selecting approximately 8,000 rows from the database. Then it goes through 3 separate steps, all of which call off to a java class, do some work and then return an array of results.

    The 3rd step is very slow. So it's a natural bottleneck, but thats ok, we're not too fussed about performance in this transform.

    The problem is, that the transform only handles about 80 rows before blowing up. So; Doubling the memory or more, isnt going to get me to the 8,000 that is required

    Is there any way i can identify say row sizes within kettle at given points between steps , or work out where the memory is being used? At least then i may be able to tune the usage down in my transform..

    I do have other transforms which do similar things, and they handle 12k+ records within minutes.

    Thanks,
    Dan

  2. #2
    Join Date
    Apr 2007
    Posts
    2,009

    Default

    I've been nosying around in jmeter and have noticed that the vast majority of the memory is used in java.util.concurrent.locks

    Shall dig some more..

    So it's in java.util.concurrent.locks.AbstractQueueSynchronizer$node where i have 14Million objects!! Craziness

    thats ~1500 per row. Strange...
    Last edited by codek; 03-24-2009 at 12:39 PM.

  3. #3
    DEinspanjer Guest

    Default

    By any chance have you turned off the thread priority management in the transformation options? That could cause an unnecessary amount of locking.

    I haven't used JMeter, but if I were in this situation, I'd take a look at running with JDK 6 and using VisualVM to profile the memory allocation stacks to find out where all those locks are coming from. That kind of profiling is very expensive, so try to trim down your transformation to work on a small subset of data or you'll be sitting around all day waiting for it to finish.

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I guess it depends on what kind of java classes you call :-)

    We only use the concurrent classes to pass rows from one step to another. The maximum size of the buffer is controlled by the "Rowset size" parameter in the Misc Settings of your transformation.

    Barring any caching done in the steps, nothing else is kept in memory in a transformation.

  5. #5
    Join Date
    Apr 2007
    Posts
    2,009

    Default

    Doh! I meant jprofiler, not jmeter

    The rowset size is 10,000 which i guess is right.

    Step performance monitoring is on too, i guess thats on by default? I've never looked at this, maybe i need to take a look now.

    Also the manage thread priorities is enabled too.

    The java classes dont do anything fancy other than call through to salesforce. They certainly dont have anything synchronised, and dont use any static variables.

    I couldnt get the heap walker working in jprofiler as i think i left it too late - like you say it becomes very expensive! so i'll try that again tomorrow and see what i can find - i may try visualvm too.

  6. #6
    Join Date
    Apr 2007
    Posts
    2,009

    Default

    what happens if the rowset size is reduced? where are things buffered? does it crash?

  7. #7
    Join Date
    May 2006
    Posts
    4,882

    Default

    Once running it's not supposed to change... they are allocate at start time

    Regards,
    Sven

  8. #8
    DEinspanjer Guest

    Default

    The short answer to your question is, if you reduce the rowset size, you'll have fewer rows of data in memory at any one time, and if you have slower steps in your transformation, the rowsets will eventually fill up and the Input steps that are generating rows will begin sleeping waiting for the slower steps to catch up.

    If it is useful, a more technical and lengthy answer is below:

    For every connection of one copy of a step to one copy of a different step, there is a "bucket" or rowset. When the first step finishes processing a row that it received or generated, it takes that row and puts it into the output bucket. The step that it is connected to will poll that rowset to retrieve new rows upon which to perform its processing.
    If you have one Text File Input step connected with a distribute hop to an Insert step that is running in 3 copies, you will have three rowsets. The TFI will distribute the rows it generates to each of the three rowsets. If the insert step was significantly slower to poll rows out of the rowset than the TFI is pushing them in, then eventually, the rowset size will reach the limit you have configured. When that happens for all three rowsets in this example, the TFI thread will sleep, waiting for one of the rowsets to drain so it has room to put new rows in.

    The opposite of this scenario is one in which the first step, TFI again, is much slower than the next step, say a Calculator step running in just one copy. In this case, the TFI is putting rows into the rowset as fast as it can, but the Calculator is always emptying the rowset faster than they are going in. When this happens, if you have thread priority management turned on, the Calculator step will slow down and it will start waiting for a little while so that the TFI has enough time to put a reasonable amount of rows into the rowset. Then the Calculator will wake up and consume all those rows before sleeping again. Without this sleep, the Calculator is lock thrashing... constantly tapping the TFI on the shoulder to say, "Do you have more work for me?".

    When thinking about the right rowset size to use, you need to think about how big your rows are, i.e. how much memory they will consume. If your rowset size is huge, then you'll have a large number of those rows in memory at any given time increasing your memory requirements. If it is too small, then your steps will frequently exhaust the rowset buckets and be waiting for more rows, which can be inefficient or possibly even lead to thread starvation.

    Matt used to have some guidelines in older versions of Kettle, but with the thread priority management, I'm not sure how applicable they are now so I'll let him comment if he has anything to add there.

  9. #9
    Join Date
    Apr 2007
    Posts
    2,009

    Default

    Ok this is excellent, we definately have a very slow step stuck at the end of the chain causing issues. I'll play with rowset and see if i can improve things.

    Thanks,
    Dan

  10. #10
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I haven't checked Dan, but there should be a an article in the PDI knowledge base on this subject. (or arriving soon, can't check right now)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.