Hitachi Vantara Pentaho Community Forums
Results 1 to 12 of 12

Thread: Memory leak with Blocking Step or Table Output?

  1. #1
    Join Date
    Jun 2009
    Posts
    13

    Default Memory leak with Blocking Step or Table Output?

    I have a very simple transformation which reads data from the source database table (approx. 3.5M rows), spools to a file, and then outputs them to a staging table on the warehouse database. The reason for the spool step is that without it, I start getting some weird database errors (I don't remember the details, but can try removing the step if needed).

    The Cache Size on the Blocking Step is 10,000 and the Commit Size on the Table Output is 1000

    In any case, I watch the memory usage using Jconsole and the console output of the job as well. The Table Input step takes about a minute to read 3.5M rows and at the end, about 200M of memory is consumed in the heap. At that point, I start seeing entries for the Blocking Step and Table Output on the console - 5 BS entries for each TO entry. At the same time, the memory usage starts spiking and goes all the way up to 1.5GB. At the end of the process of Table Output which takes about a minute, the memory usage stays at 1.5GB and the next transformation is started.

    My assumption was that for every 1000 rows that are committed, the memory used for those 1000 rows should be released and expected to see a sort of sawtooth chart for memory usage. That is obviously not the case, and I'm wondering if there is a bug in the Table Output step where memory is not being released.

    One of the columns I am copying is a LARGEBLOB (MySQL 5.0 database), and it seems like that is the column whose memory is not being released. When I run jmap -histo for my process, I see that the top memory consumer is:

    num #instances #bytes class name
    ----------------------------------------------
    1: 35337022 850736200 [B

    I'm assuming that "[B" represents my bytearray blob and the number of instances roughly matches the number of rows and the size is about 850MB which also seems about right.

    My transformation is:

    Table Input -> Blocking Step -> Table Output

    This transformation is the first one called from a job which has two other transformations that are called after this one.

    Is there something I am doing wrong, or is there a bug in Table Output when there are binary columns involved?

    Thanks,
    Kaushal

  2. #2
    Join Date
    Jun 2009
    Posts
    13

    Default

    I did some further testing, and have confirmed that the leak is definitely in the Table Output step. Additionally, the leak only seems to happen if I turn on "Use batch update for inserts".

    I removed the Blocking Step because I think that was a red herring - my original problem was that my commit size on Table Output was too large, and along with the memory leak, it caused me to run out of memory. Once I turned off the batch update flag I was able to successfully copy the 3.5M rows without the Blocking Step.

    In any case, while I can turn off using batch updates to avoid the memory leak, the copying of 3.5M rows takes more than 5x the time it took with batch updates turned on. It would be great to get a fix for the memory leak when using batch updates. My guess is it should be fairly easy to find with the information I have provided. Its possible the leak only occurs if binary columns are being copied with batch updates turned on? I have not noticed any leaks in other places where I use Table Output.

    Matt, you seem to be the superuser on this forum with respect to fixing bugs. Any chance you could look into this and let me know if this is easy to fix?

  3. #3
    Join Date
    Jun 2009
    Posts
    13

    Default

    Has anyone else experienced this?

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Sorry, if there is a leak, it's in the Caché driver. (that would not surprise me one bit)

    Please note that it is natural for the JDBC driver to keep all rows of the batch in memory until there are enough rows to send over the wire. Lowering the 10,000 to a few hundred would show a difference if that is your problem.

  5. #5
    Join Date
    Jun 2009
    Posts
    13

    Default

    I actually removed the Blocking Step completely, so the cache size for the is irrelevant. The Commit Size on my Table Output step is 1,000 rows, which seems fairly reasonable (I would actually prefer it to be higher). My transformation at this point is very simple:

    Table Input -> Table Output

    Input reads a set of tables from source database, Output writes data to a staging table on the warehouse.

    When I turn on the "Use batch updates for inserts" checkbox for the Table Output step, I see the memory spike as it starts writing the rows, and it never gets released.

    When I turn this checkbox off, I see the expected sawtooth graph for memory usage - it seems to correctly release the memory after each 1,000 rows is committed.

    I was using JConsole to monitor the heap memory.

    This definitely points to some sort of leak with batch updates in the Table Output step.

    I'm fairly new to Pentaho and this forum, so I apologize if this is not the correct place to post this. If there is a bug database to which I can submit this information, please point me to the right place and I will do so.

    Thanks,
    Kaushal

  6. #6
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    People load massive amounts of data all the time.
    So I'm fairly sure it's a problem in the Caché JDBC driver.
    I would suggest you contact them for an update to the driver.

  7. #7
    Join Date
    Jun 2009
    Posts
    13

    Default

    We're not using the Cache JDBC driver unless you're saying that Pentaho by default ships with it. I looked in the $PDI_HOME/libext/JDBC directory and saw that the MySQL JDBC driver is mysql-connector-java-3.1.14-bin.jar which seems pretty old. We are currently using mysql-connector-java-5.1.7-bin.jar in the rest of our app, so I'll try replacing the JDBC driver with this newer version and give it a try.

  8. #8
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I thought you said you were loading data into Cache.

    Oh well.

  9. #9
    Join Date
    Jun 2009
    Posts
    13

    Default

    So I switched the JDBC driver to version 5.1.7 (from 3.1.14), and it seems like the memory leak issue has been addressed. However, this driver performs even slower than using 3.1.14 with batch updates turned off!!!

    All I did was remove the 3.1.14 jar file from $PDI_HOME/libext/JDBC and added the 5.1.7 jar in the same location. I kicked off kitchen.sh after that to test the performance.

    Am I missing something here? Is there something else I need to do as well? I can't believe that the newer driver would be SLOWER than the older one.

  10. #10

    Default

    last time I heard PDI shipped with 3.x cause of 5.x 'issues'
    This is a signature.... everyone gets it.

    Join the Unofficial Pentaho IRC channel on freenode.
    Server: chat.freenode.net Channel: ##pentaho

    Please try and make an effort and search the wiki and forums before posting!
    Checkout the Saiku, the future of Open Source Interactive OLAP(http://analytical-labs.com)

    http://mattlittle.files.wordpress.co...-bananaman.jpg

  11. #11
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    AND I'm very sorry to say: not backward compatible.

  12. #12
    Join Date
    Jun 2009
    Posts
    13

    Default

    Ugh! I guess I'll stick with the 3.x driver with batch updates turned off.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.