Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: Reading 4Mb Excel xlsx file -- Heap out of memory

  1. #1
    Join Date
    Sep 2013
    Posts
    1

    Post Reading 4Mb Excel xlsx file -- Heap out of memory

    Hi,

    I'm trying to read a xlsx file in Pentaho. For this, I created a transformation with only a "Microsoft Excel Input" step, with the "engine" set to "Excel 2007 XLSX (Apache POI)". [the transformation file is attached -- unfortunately, I can't upload the xlsx file]

    When I read it for the first time, everything goes well: I can read as many as the 1000 lines I normally use to preview the lines.
    If I try to read it for the second time, I get the following error:

    2014/04/28 15:26:46 - dummy.0 - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : UnexpectedError:
    2014/04/28 15:26:47 - dummy.0 - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : java.lang.OutOfMemoryError: GC overhead limit exceeded
    2014/04/28 15:26:47 - dummy.0 - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(Unknown Source)
    2014/04/28 15:26:47 - dummy.0 - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
    2014/04/28 15:26:47 - dummy.0 - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : at java.util.concurrent.ArrayBlockingQueue.poll(Unknown Source)
    2014/04/28 15:26:47 - dummy.0 - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : at org.pentaho.di.core.BlockingRowSet.getRowWait(BlockingRowSet.java:114)
    2014/04/28 15:26:47 - dummy.0 - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : at org.pentaho.di.trans.step.BaseStep.getRow(BaseStep.java:1544)
    2014/04/28 15:26:47 - dummy.0 - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : at org.pentaho.di.trans.steps.dummytrans.DummyTrans.processRow(DummyTrans.java:52)
    2014/04/28 15:26:47 - dummy.0 - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : at org.pentaho.di.trans.step.RunThread.run(RunThread.java:50)
    2014/04/28 15:26:47 - dummy.0 - ERROR (version 4.3.0-stable, build 16786 from 2012-04-24 14.11.32 by buildguy) : at java.lang.Thread.run(Unknown Source)


    If I want to read it again, I must restart Pentaho Data Integration.

    I also tried previewing 6500 lines just after starting Pentaho (which I was expecting to work), but I get the following exception:

    Unable to open dialog for this step
    java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOf(Unknown Source)
    at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source)
    ...

    Googling for the problem, it seems that the API Pentaho uses to read xlsx files eats too much memory. I tried using both PDI 4.3.0 and PDI 5.0.1. I'm using Windows 7 Professional. The Excel file was built using Office 2010 (but I'm possibly changing to Office 2013 in the near future)

    Is there any workaround to this? I thought I could convert the xlsx file to a csv, but I wasn't sure if there wasn't a better solution (I'd prefer keep using xlsx if I could).

    Thanks =)
    Attached Files Attached Files
    Last edited by Gamboa; 04-28-2014 at 02:47 PM. Reason: Adding infos of my environment...

  2. #2
    Join Date
    Nov 2008
    Posts
    271

    Default

    Did you try to allocate additional memory to the jvm? I assume you're launching Spoon on a Windows platform, so edit your Spoon.bat file and locate the snippet
    Code:
    if "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS="-Xmx512m" "-XX:MaxPermSize=256m"
    Change 512m to 1024m (or whatever size you like and your system is willing to give you): this will assign a larger portion of your memory to the jvm instance used by the spoon process.
    Andrea Torre
    twitter: @andtorg

    join the community on ##pentaho - a freenode irc channel

  3. #3
    Join Date
    Jun 2009
    Posts
    28

    Default

    i am also experiencing the error. i changed 512m to 1024m but it did not work.
    i tried to go to 2g but it did not take it.

  4. #4
    Join Date
    Jun 2009
    Posts
    28

    Default

    For me, the real solution is to use the CSV file input step instead of the Microsoft Excel Input.
    the Excel Input file was 6 MBs, the CSV equivalent was 8 MBs
    However, the CSV Input step just ate it up with no real hit to resources.
    Kimball is right - Excel Input files just suck overall. Go with CSV input.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.