Hitachi Vantara Pentaho Community Forums
Results 1 to 14 of 14

Thread: Processing source tables with million+ rows (memory issues)

  1. #1
    Join Date
    Jul 2010
    Posts
    10

    Default Processing source tables with million+ rows (memory issues)

    I am trying to figure out a way to process a source table (6 million+ rows) into smaller batches/groups so that we can get around memory and crashing issues within Kettle.

    We have an action/event table that represents different customer events happening across the lifecycle of a customer. The 6 million rows in that action table span across 0.5 million+ unique customers.
    We have 15 or so different transformations that consume this data about customer actions/events. If we run these transformations (from within Spoon) for the entire data set it crashes. So I wanted to request some help/guidance on how to best address this situation:

    Here is one approach we are considering:
    1. Create a customersToProcess table with two columns - customer_key, group_no/batch_no. Divide the 0.5 million records into groups of 15000 each and process each transformation for a batch of 15000 customers or so, and then keep repeating the run for all the batches (36-37). Why 15000? Because in our manual testing we have succeeded loading a batch of this size consistently. We need to process the entire customer's events together and hence processing in batches of customers and not batches of events.
    Some helpful hints I found on the web are about passing parameters and looping...
    http://type-exit.org/adventures-with...-for-each-row/
    http://type-exit.org/adventures-with...n-kettle-jobs/

    Is the best way to approach this? Or I need to be looking at this differently? If I can get some pointers I would really appreciate it. H

    Another option we are considering is also upgrading the 32 bit to a 64 bit kettle server and potentially beefing up the memory, but it might be a while until we requisition that and get there. Meanwhile I wanted any other possible ideas. Thanks

  2. #2

    Default

    First of all I would suggest to run the whole job in kitchen and not with Spoon. Why do you want to run such a long running job in the GUI?

    Spoon 's purprose is to develop jobs/transformations. The GUI is IMHO not suited to run long-running jobs in stable manner.

    As far as i can remember there are options log buffers which may have big impact on Spoon/kitchen memory behaviour. I can't remember the exact name of those options but I'm sure Google is your friend. Try to set a limit to the buffers. Maybe this will help.

    kind regards Max

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    There was a memory leak that crept into 4.1.0 actually. It's been fixed in 4.1.1.

    Like maxh said, you can configure the logging back-end (reduce retention, time-out rows and so on) using the KETTLE_ variables (see the kettle properties editor for more information).

  4. #4
    Join Date
    Jul 2010
    Posts
    10

    Default

    Maxh and Matt:
    Thank you so much for your responses.

    I have started running transformations using Kitchen and Pan and it is definitely much faster than spoon. So thanks for the tip (not sure why I did not use them earlier). For my test load of 15000 random customers and 200K action records (read and transformed via 15-16 transformations) the process ran approximately 25-30 mins. So for the total population I am looking at 15-20 hours or so. While that is a not a problem since this is a Historical load, it would be nice to bring the number down to a few hours.

    MaxH- Would you please elaborate on what you mean by limit buffers?

    Matt- We are going to request a 64 bit and try to move to 4.1.1, but meanwhile can you also elaborate on your suggestions also please?

    Meanwhile, I will try to search for performance/memory issues on the forums and google. Thanks again.

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Look among the KETTLE_ options in the kettle.properties dialog in Spoon.
    There are ways to limit the maximum number of rows stored for logging and/or specify a time-out (maximum age in minutes) for a log-row.

  6. #6
    Join Date
    Jul 2010
    Posts
    10

    Default

    Matt, Thanks again.

    1. Your book indicates that the options (KETTLE_LOG_SIZE_LIMIT, KETTLE_MAX_LOG_SIZE_IN_LINES) will impact spoon execution only and talks about modifying the slave server's configuration file.

    I can set the params in spoon for future gains when executing in spoon, but since I want to fasten the execution while using Kitchen, will it help to modify the same variables in the configuration file, and if so, is the location of the file as below ? or something else?:

    C:\Program Files\pentaho\server\data-integration-server\pentaho-solutions\system\kettle\slave-server-config.xml

    2. The historical job that calls all the transformations only uses Basic Logging right now, so I am curious to see how much gain we will get by changing logging params

  7. #7
    Join Date
    Jul 2010
    Posts
    10

    Default Additional stress test

    Further update to this post:

    Yesterday, I increased the stress test and processed ~10% data (60K customers) of our total customers data (~550K) from Kitchen (command line) on the ETL server, and got a memory issue.

    So I am thinking I really need to create a priortized list of applicable items (for our ETL jobs/transformations) from Kimball's book for Increasing ETL throughput. Certainly need to upgrade the server to 64 bit and more memory; and optimize the source extract queries but, meanwhile if anybody has more recommendations, I would really appreciate it. Thanks.

  8. #8
    Join Date
    Nov 2008
    Posts
    143

    Default

    Have you tryed this thread?
    Sounds like it has some hints for you.

  9. #9
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    The options I mentioned will work in Spoon, Kitchen, Pan and Carte. You can indeed also configure Carte through it's config XML file.

    Take note that these options are only a small part of controlling memory consumption. You incorrectly assume that the readers of this forum are clairvoyant so post more information if you want to get better answers.

  10. #10

    Default

    What is your source data? File or database? If database, which type? MS SQL?

    Also, can you post, your output of the your log file with the error?

    We process tables with 6m + rows daily in an hour or so.

  11. #11
    Join Date
    Nov 2008
    Posts
    143

    Default

    Quote Originally Posted by benjaminleroux View Post
    What is your source data? File or database? If database, which type? MS SQL?

    Also, can you post, your output of the your log file with the error?

    We process tables with 6m + rows daily in an hour or so.
    Yeah, but you have quite a machine in your desk.

  12. #12
    Join Date
    Dec 2010
    Posts
    3

    Default a quick solution...

    a quick solution is to increase the amount of memory allocated to the JVM instanciated by spoon during the launch.

    Open ..\pentaho\design-tools\data-integration\Spoon.bat and modify the both values "-Xmx512" & "-XX:MaxPermSize=512"

    I.E. :

    before
    Code:
    set OPT="-Xmx512m" "-XX:MaxPermSize=512m" "-Djava.library.path=%LIBSPATH%" "-DKETTLE_HOME=%KETTLE_HOME%" "-DKETTLE_REPOSITORY=%KETTLE_REPOSITORY%" "-DKETTLE_USER=%KETTLE_USER%" "-DKETTLE_PASSWORD=%KETTLE_PASSWORD%" "-DKETTLE_PLUGIN_PACKAGES=%KETTLE_PLUGIN_PACKAGES%" "-DKETTLE_LOG_SIZE_LIMIT=%KETTLE_LOG_SIZE_LIMIT%"
    after
    Code:
    set OPT="-Xmx1024m" "-XX:MaxPermSize=1024m" "-Djava.library.path=%LIBSPATH%" "-DKETTLE_HOME=%KETTLE_HOME%" "-DKETTLE_REPOSITORY=%KETTLE_REPOSITORY%" "-DKETTLE_USER=%KETTLE_USER%" "-DKETTLE_PASSWORD=%KETTLE_PASSWORD%" "-DKETTLE_PLUGIN_PACKAGES=%KETTLE_PLUGIN_PACKAGES%" "-DKETTLE_LOG_SIZE_LIMIT=%KETTLE_LOG_SIZE_LIMIT%"
    Work only in test env.

  13. #13
    Join Date
    Mar 2010
    Posts
    2

    Default

    I've actually been encountering a similar sounding memory issue. In my case, I'm running lots of (fairly small) batch through a somewhat large transform using the "Execute for every input row" option within the job. This seems to cause me to run out of memory even with a few dozen small batches.

    I've spent some time troubleshooting, and discovered that even if I run a simple scenario with a job that just 1) creates some dummy batches, then 2) runs a transform for each batch (the transform doesn't do anything but log the row # to the console. No DBs involved), I'll still run out of memory if I run a lot of batches through. I almost get the sense that the various instances of prior transforms are not being "freed", but that's just my speculation. I encountered this originally in PDI 4.1.0, and also saw it in 4.1.2. Updating to the latest Java version didn't seem to help (1.6.0_24). PDI 3.2.5 runs my test job fine without exceeding ~140MB of heap, whereas PDI 4.1.2 was running out of memory even with 1 GB of heap.

    Has anyone else been seeing this problem when running a transform 'using execute for each input row'? It sounds like the original poster here wasn't able to run the large dataset when broken up into smaller batches. I can't tell if there's something off with my environment, or if this is a PDI bug.

  14. #14

    Default

    just an idea....

    I noticed that you said "We have 15 or so different transformations that consume this data". I'm not sure if your transforms are chained together in a job, but i will assume that they are.

    When I have large amounts of data to pass from one transform to another, I use the "Serialize to File" and "De-Serialize from file" steps to pass the data. The beauty of these steps is that they are extremely easy to use and the "de-serialize" step includes the stream metadata so it is easy to develop. This is probably my #1 work around to prevent memory errors.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.