Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: Trouble reading xlsx produced by PDI

  1. #1

    Default Trouble reading xlsx produced by PDI

    I created an xlsx Excel file using PDI/Spoon using Microsoft Excel Writer, where Extension = "xlsx [Excel 2007 and above]"

    The opens fine in Excel, but getting an error when trying to read it with Microsoft Excel Input, where Spread sheet type (engine) = "Excel 2007 XLSX (Apache POI Streaming)".

    The error is: "I was unable to find any fields in the Excel file(s)." This happens when I try Get fields from header row.

    If I try Spread sheet type (engine) = "Excel 2007 XLSX (Apache POI)", I get Unable to open dialog for this step java.lang.OutOfMemoryError: GC overhead limit exceeded.

    I found on a smaller file where I don't hit the out of memory error, using "Excel 2007 XLSX (Apache POI)" allows the fields to be read.

    Another interesting thing. If I open the file in Excel 2013 and save the file, then I can read the fields no problem using "Excel 2007 XLSX (Apache POI Streaming)". So somehow PDI does not like the xlsx file created using PDI to be read using "Excel 2007 XLSX (Apache POI Streaming)". I will see if I can increase my java memory settings, but curious if there is anything I can do to get "Excel 2007 XLSX (Apache POI Streaming)" working? Don't want to have to use Excel to get the file in a readable state.

  2. #2
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Without having that Excel file is difficult to understand what is causing the issue.
    If you think that it could be a bug, you should try to get a newer version of Apache POI.
    Or have a look at their website and see if soemone has the same issue.
    -- Mick --

  3. #3

    Default

    Here is a sample file where I see the problem happening. Regarding a newer version of Apache POI...not sure where to look what version I have that came with PDI (6.0.1.0-386) and or how to independently install a newer version.

    https://www.dropbox.com/s/6500zantx2...file.xlsx?dl=0

  4. #4
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Hi.
    I have downloaded your file and could open it using Spread sheet type (engine) = "Excel 2007 XLSX (Apache POI)".
    Note that I have a PC with Java 64bit (Win 7 64bit) and have allowed Java to use 4GB of memory.
    Maybe you need to increase the memory allocated to Java?
    -- Mick --

  5. #5
    Join Date
    Dec 2009
    Posts
    332

    Default

    The error message you listed may be due to a lack of memory allocated to your process. This is adjusted by altering the Xmx and Xms values in the batch or shell script file
    I run with a lower amount than Mick_data mentioned normally just 1 Gig:
    -Xmx1g -Xms1g
    Except for instances where I know there will be large files involved in which case I use 16 Gig:
    -Xmx16g -Xms16g

    Try altering these in your spoon.sh on linux or spoon.bat on Windows and see if that resolves your issue.

  6. #6

    Default

    Thanks for the replies. Yes, one workaround is to increase the memory allocation until Excel 2007 XLSX (Apache POI) is able to open the file. The real xlsx file I'm working with is 47MB and was able to open it allocating 8GB RAM. However, it seems that Excel 2007 XLSX (Apache POI Streaming) is much more efficient in how you can work with large files without allocating huge amounts of memory which is why I wish that one would work. I still think there is something odd going on with the Excel Writer, because as mentioned opening the file in Excel and saving it allows Excel 2007 XLSX (Apache POI Streaming)to work. One other workaround I found is to instead have PDI write to a text file. The text file is much larger than the xlsx file, but PDI is able to read it fine without large amounts of memory being allocated.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.