Hitachi Vantara Pentaho Community Forums
Results 1 to 23 of 23

Thread: Java Heap Space OutOfMemoryError in a Sort step

  1. #1
    Join Date
    Mar 2007
    Posts
    216

    Smile Java Heap Space OutOfMemoryError in a Sort step

    Hi,

    I have a sort step that stops after reading 97 095 675 lines.
    The error message is :
    Code:
    2007/11/26 15:03:51 - Sort rows.0 - ERROR (version 3.0.0, build 500 from 2007/11/14 14:59:11) :     at java.io.BufferedInputStream.<init>(Unknown Source)
    (...) at org.pentaho.di.trans.steps.sort.SortRows.getBuffer(SortRows.java:206)
    (...) at org.pentaho.di.trans.steps.sort.SortRows.processRow(SortRows.java:370)
    (...) at org.pentaho.di.trans.steps.sort.SortRows.run(SortRows.java:503)
    I changed the -Xmx paramter from 256 to 512M but It stops after reading the same number of lines. The sort step have "Only pass unique rows" enabled and "Compress TMP Files" disabled. Should I change that ?

    a+, =)
    -=Clément=-

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Either give it even more memory or lower the amount of rows to sort at once. (in the sort step)

    Matt

  3. #3
    Join Date
    May 2006
    Posts
    4,882

    Default

    97.095.675 lines <sigh> ... although there's no hardcoded size limitation in sort step, I think you hit your memory constraint. You could try playing with the sort size but I would try to do the sorting via the database if possible.

    Regards,
    Sven

  4. #4
    Join Date
    Mar 2007
    Posts
    216

    Smile

    Hi,

    I have changed 512m to 1024m. I also changed the java io temp directory to a place it does can have it's required 10GB. I changed the pagefile.sys (windows swap file) to be 3.4GB large. I still have the error. I will now try with 5 000 instead of 10 000 'rows in memory' in the Sort step. I will see the result tomorrow as it takes 2 hours of cpu time. I would not use the database sorter as my goal is to pass only unique rows before insert.

    a+, =)
    -=Clément=-

  5. #5
    Join Date
    Jul 2007
    Posts
    2,498

    Default

    Quote Originally Posted by clement View Post
    I would not use the database sorter as my goal is to pass only unique rows before insert.
    Isn't select distinct an option?
    Pedro Alves
    Meet us on ##pentaho, a FreeNode irc channel

  6. #6
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    http://jira.pentaho.org/browse/PDI-516

    You should see a more "sustainable" solution in version 3.0.1.
    I'm running it with great success on my machine. I'll be able to commit the code it in a few days.

    Matt

  7. #7
    Join Date
    Mar 2007
    Posts
    216

    Smile

    Hi,

    Please see the attached file : this transformation generates 5001 rows and send them to a "Sort rows" step with 5000 rows in memory and "pass only unique rows" checked. In 1 second, you should be able to get a Java Heap Space OutOfMemoryError.
    There is something there that I do not understand. Can someone enlight me ?

    @Matt : It seems to be a good idea to rethink the "Sort rows" step, thanks again for what you and your team are doing with PDI.

    @pmalves : You're right, it's an option. I was thinking until now that selecting the "pass only unique rows" option would allow me not to use a "Unique Rows" step behind my "Sort" step. After reading the "balloon" that appears when mouse cursor stands over the option, I do not know anymore if "pass only unique rows" have the same behaviour than chaining a "Sort" step and an "Unique Rows" step.


    a+,=)
    -=Clément=-

    EDIT : Without understanding why the attached sample is now working well.
    Please standby for further test results.
    Attached Files Attached Files
    Last edited by clement; 11-28-2007 at 10:38 AM. Reason: forgot the attached file

  8. #8
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Using your sample...
    My Spoon instance with 2048M allocated can easily sort 10M rows in memory.
    Since I have the memory statistics here with me, it only used around 30% of the 2048M = 620M.

    Suffice it to say I didn't see an Out of memory.

    Matt

  9. #9
    Join Date
    Jan 2008
    Posts
    16

    Default

    I am running into a similar issue with Heap space and would like to know how to set the xmx parameter from within Spoon.

    I don't see any mention of 'heap' or 'xmx' in the 'Spoon User Guide'... Also the feature added as result of PDI-516 is not documented in the Spoon user guide.

    Thanks,
    Shane

  10. #10
    Join Date
    Nov 1999
    Posts
    9,729

    Thumbs up How to set the maximum memory size

    It depends on the situation:

    1) if you run on the Pentaho platform, you need to give JBoss or Tomcat or the container you are running on enough memory
    2) if you are running spoon/kitchen/pan using the provided shell scripts (*.sh/*.bat), change the -Xmx parameter in those shell scripts
    3) if you are running spoon 3.0.1 or later on Windows using the kettle.exe starter (default installer), you can change the -Xmx parameter in file Kettle.l4j.ini. If the file is not present, create it in the same directory as kettle.exe. In versions 3.0.2 or above, this is the content:

    -----------------------snip------------------------------------
    # JVM command line options, one per line.
    # To increase the max memory limit, change the -Xmx parameter.
    #
    # Don't forget : M is for mega-bytes, k is for kilobytes.
    # If you don't specify either it's bytes!!
    #
    -Xmx256M
    -----------------------snap------------------------------------

    4) On OSX I really don't have an idea yet if you use the spoon launcher, otherwise, see 2)

    As you can see, the value after the -Xmx parameter is the maximum heap size in memory.

    HTH,

    Matt

  11. #11
    Join Date
    Jan 2008
    Posts
    16

    Default

    Matt,
    Thanks for the prompt reply and helpful information. Is this documented somewhere and I just missed it? I have the spoon user guide, I just want to make sure that I am not overlooking a helpful document.

    Thanks,
    Shane

  12. #12

    Default

    Try to avoid the sort if you can, i.e. be smart about purposefully choosing to use that step.

    If you are getting it from a file, make sure you REALLY want to sort during a transformation, and are not just sorting to make it look 'pretty'. Have a good reason.

    If you are getting data from a database, use an ORDER BY in obtaining the data, instead of pushing the sorting to a transformation.

    If you are already utilizing TEMP style tables, depending on your data flow, you may want to dump the data to a TEMP table, then query with ORDER BY and start re-streaming your data in an ordered fashion for the rest of your transformation.

    Having said all that, there are definitely scenarios where you need to use the PDI/Kettle Sort and/or group step, but make sure you purposefully choose it, not just because it is convenient.

    Just from my own hard-earned experience :-)

    p.s. Very cool change on the sort step to size by memory percentage, thanks Matt, I'll check that out soon!

  13. #13
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Sort size by % is in 3.0.2 and works very well for my purposes. Can't seem to trick it anymore although there will probably be people out there that will do a better job in that respect than I ever can.

    If I can add a pointer...

    If you really have to sort, consider doing it in parallel, maybe on a cluster or maybe using different step copies if you have multiple CPUs available.
    A sort is very easy to do in parallel and only requires a sorted merge to keep the data sorted on re-merging the streams: a cheap operation.

    Matt

  14. #14
    Join Date
    Feb 2008
    Posts
    216

    Default Sort Rows in memory

    I've read through the posts in this thread and I'm running into a similar situation. First off, I can't use the database to sort this data, unfortunately, because of limitations there where I run into an error. My source is a Progress database (using ODBC to access it).

    My goal was to make a really simple transformation to move the data from my Progress source, sort it and pass unique rows out against one field of data.

    I modified the Xmx parameter to 2048. I reduced the rows in the Sort transformation to 5000. What I don't understand is, if 5000 is the limit of rows you can have it sort in memory (as per the Spoon guide), for some reason, my transformation is still building temp files.

    I just tried running it with these settings and I received a fatal error (not just a failure) and the program stopped responding.

    I'm really in total beta mode right now, so this is running on my workstation.

    Does anyone have any recommendations on resolving this? Do I need to just point the temp files off to a server somewhere and this will resolve the error? I'd really like to know why it isn't just sorting in memory though.

    Any ideas?



    This might be a dumb javascript question, but in the sort transformation, there is a field for sort directory that is set to: %%java.io.tmpdir%%

    Where does one change the value of this variable if you want to set it globally? I'm a total newbie to javascript/java so I'm learning as I go here.

    Thanks!
    Last edited by DebbieKat; 02-07-2008 at 02:56 PM.

  15. #15
    Join Date
    Feb 2008
    Posts
    216

    Default More information on Fatal Error

    OK, it looks like my transformation is failing AFTER it is done reading all the data from the source and building the sort temp files. The step where it is dying is when it is attempting to open all of the temp files it has built.

    I'm now getting a Failed to execute runnable (java.lang.OutOfMemoryError: Java Heap Space).

    I'm attaching a screenshot of what I am seeing.

    This error is killing the application altogether and I wind up having to restart it.

    At this point, my transformation settings are as follows:

    compress tmp files checked
    sort rows: 4000

    I haven't changed the default temp directory yet.
    Last edited by DebbieKat; 06-02-2008 at 12:52 PM.

  16. #16
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Hi Debbie,

    "5000 rows" means that the sort step will sort 5000 rows at a time and spool those to disk if there are more input rows.
    It does so in the standard temporary files directory called "%%java.io.tmpdir%%". However you can change that with another value if you like (just enter the directory). If you want to create a new global variable, you can do so in file $HOME/.kettle/kettle.properties.

    It would have to be really big rows if you would have a problem with 2GB of RAM though.

    I'm almost certain that this is not the problem that you're having. If the software crashes (core-dumps), there is usually a crash log file that explains what happened. And in 99.9% of the cases that get reported on this forum, it's the ODBC driver doing something weird. You solve it typically by upgrading the ODBC driver or by using a JDBC alternative (much safer) if there is one.

    To verify that this is the problem, replace the Sort step with a dummy and just try to read all the rows from the source database. Then see what happens.

    All the best,

    Matt

  17. #17
    Join Date
    Feb 2008
    Posts
    216

    Default Crash Information

    Hi Matt -
    Here's the end of the crash file output. Does this help at all?

    2008/02/07 11:14:28 - sortRowsLedgerCurrencySource.0 - Opening tmp-file: [C:\DOCUME~1\dshapiro\LOCALS~1\Temp\out_0543507e-d5b0-11dc-98c4-255c3066a3e1.tmp]
    2008/02/07 11:14:29 - sortRowsLedgerCurrencySource.0 - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Unexpected error :
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : An unexpected error occurred in Spoon:
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Failed to execute runnable (java.lang.OutOfMemoryError: Java heap space)
    2008/02/07 11:14:34 - sortRowsLedgerCurrencySource.0 - Finished processing (I=0, O=0, R=18097869, W=0, U=0, E=0)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : org.eclipse.swt.SWTException: Failed to execute runnable (java.lang.OutOfMemoryError: Java heap space)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at org.eclipse.swt.SWT.error(Unknown Source)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at org.eclipse.swt.SWT.error(Unknown Source)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Unknown Source)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at org.eclipse.swt.widgets.Display.runAsyncMessages(Unknown Source)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at org.eclipse.swt.widgets.Display.readAndDispatch(Unknown Source)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at org.pentaho.di.ui.spoon.Spoon.readAndDispatch(Spoon.java:841)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at org.pentaho.di.ui.spoon.Spoon.start(Spoon.java:5589)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at org.pentaho.di.ui.spoon.Spoon.run(Spoon.java:5685)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at org.pentaho.di.ui.spoon.Spoon.main(Spoon.java:371)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at java.lang.reflect.Method.invoke(Unknown Source)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at org.pentaho.commons.launcher.Launcher.main(Launcher.java:116)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Caused by: java.lang.OutOfMemoryError: Java heap space
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at java.util.Arrays.copyOfRange(Unknown Source)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at java.lang.String.<init>(Unknown Source)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at java.lang.StringBuffer.toString(Unknown Source)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at org.pentaho.di.ui.spoon.trans.TransLog$2$1.run(TransLog.java:321)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : at org.eclipse.swt.widgets.RunnableLock.run(Unknown Source)
    2008/02/07 11:14:34 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : ... 12 more
    2008/02/07 11:14:35 - Spoon - ERROR (version 3.0.1, build 534 from 2007/12/12 12:28:23) : Fatal error : org.eclipse.swt.SWTException: Failed to execute runnable (java.lang.OutOfMemoryError: Java heap space)


    I'll report back on using the dummy transformation in a bit.


    THANKS!!!

  18. #18
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    No help at all, but at least we can rule out an OBDC crash ;-)

  19. #19
    Join Date
    Feb 2008
    Posts
    216

    Default

    Quote Originally Posted by MattCasters View Post
    Hi Debbie,

    "5000 rows" means that the sort step will sort 5000 rows at a time and spool those to disk if there are more input rows.
    It does so in the standard temporary files directory called "%%java.io.tmpdir%%". However you can change that with another value if you like (just enter the directory). If you want to create a new global variable, you can do so in file $HOME/.kettle/kettle.properties.

    It would have to be really big rows if you would have a problem with 2GB of RAM though.

    I'm almost certain that this is not the problem that you're having. If the software crashes (core-dumps), there is usually a crash log file that explains what happened. And in 99.9% of the cases that get reported on this forum, it's the ODBC driver doing something weird. You solve it typically by upgrading the ODBC driver or by using a JDBC alternative (much safer) if there is one.

    To verify that this is the problem, replace the Sort step with a dummy and just try to read all the rows from the source database. Then see what happens.

    All the best,

    Matt
    Hi Matt -
    I tried running the transformation with just the table input and dummy transform and it worked fine.

    Here's the last part of the output log:
    2008/02/07 13:02:07 - tblInputLedgerCurrencySource.0 - linenr 18050000
    2008/02/07 13:02:07 - Dummy (do nothing).0 - Linenr 18050000
    2008/02/07 13:02:21 - tblInputLedgerCurrencySource.0 - Finished reading query, closing connection.
    2008/02/07 13:02:21 - Dummy (do nothing).0 - Finished processing (I=0, O=0, R=18097869, W=18097869, U=0, E=0)
    2008/02/07 13:02:21 - syteline_training - Connection to database closed!
    2008/02/07 13:02:21 - tblInputLedgerCurrencySource.0 - Finished processing (I=18097869, O=0, R=0, W=18097869, U=0, E=0)
    2008/02/07 13:02:21 - Spoon - The transformation has finished!!

    So, I think the source is working as expected. This is a good thing, because I haven't heard of any JDBC drivers available for Progress.

    I'm only sorting on a single field that is varchar(6), so the size of the rows shouldn't be giving me any grief.

    I was able to sort the data at the database level, but I know I won't be able to use distinct. So it looks like I'm not having much luck with the sort transformation.

  20. #20
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    It's pretty simple then, sort on the database and put a "Unique" step to work.
    That gives you a unique set of rows to work with AND you don't have to spend time sorting data.

    How does that sound?

    Matt

  21. #21
    Join Date
    Feb 2008
    Posts
    216

    Default

    Quote Originally Posted by MattCasters View Post
    It's pretty simple then, sort on the database and put a "Unique" step to work.
    That gives you a unique set of rows to work with AND you don't have to spend time sorting data.

    How does that sound?

    Matt
    This is working for the time being, but our source database can be flaky at times, which is why I was hoping to avoid sorting directly against it (we get memory block errors and such). Honestly, it took just as much time against the database as it did with the sorter (until it blew up trying to reopen the temp files, that is). It's not the beefiest of database servers. Thankfully, we're upgrading off of Progress within the next couple of months!

  22. #22
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Try version 3.0.2 of Kettle tomorrow.(**) It has a max % of memory option in the "Sort Step".
    That should prevent you from running out of memory.

    All the best,

    Matt

    (**) or try this build.

  23. #23

    Default memory

    I have got the same problem.

    I have edited a sh file and increased the memory from 256 mb to 512 mb.

    I have also changed some sort % from 25% to 75%.

    Now it runs the process. The other processes are running using less memory and even the cpu usage drops 2 or 3 points below the 100% allowing me to do something else.

    thank you,

    Paulo Santos

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.