Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Excel input Step- Spread type engine

  1. #1

    Default Excel input Step- Spread sheet type engine

    Hi All,

    I have 3 excel files with same format. I am doing some transformations on it. I execute it in loop.
    I have selected Spread sheet Type engine as "Excel 2007 Apache POI streaming" in excel input step. It works fine for 2 excel files. But for 1 excel file it does not return any rows.
    For that 1 excel file it works with "Excel 2007 Apache POI" Spread sheet Type engine. why so?
    What is the diffrence between "Excel 2007 Apache POI streaming" and "Excel 2007 Apache POI" Spread sheet Type engines?
    I noticed "Excel 2007 Apache POI streaming" works faster.
    So How can I make Spread sheet Type engine as "Excel 2007 Apache POI streaming" work for all 3 excels?
    I use penhato 7.1. Windows 7. JRE 1.8

    please help
    Thanks
    Last edited by Ajinkya; 08-23-2018 at 07:57 AM.

  2. #2

    Default

    Anyone out there who can help me on this?

  3. #3
    Join Date
    Nov 2009
    Posts
    688

    Default

    What is the difference between the Excel files? Can you share those files?

  4. #4

    Default

    Due to security issues I cannot share the files with you. But all files are xlsx files. All have same columns with different data in it.

  5. #5

    Default

    Old "XLS" format files used a binary storage format. New "XLSX" format files are a series of XML files, compressed into a ZIP file format.

    Because the newer formats are XML based, the Apache POI library has two options for reading these spreadsheets.

    1. Load the XML document into memory and then parse and extract the data for PDI (similar to how web browsers render HTML)
    2. Read the XML document in "streaming" mode (it doesn't store the full document in memory, just the minimum needed to read the next part)

    When dealing with large files, the first option may lead to the application to freeze/crash, as the in-memory representation of XML is much larger than the file size that is used to save it on your hard drive. Because of this, most people will end up using the streaming option for large files, as it will use much less memory to process an Excel file.

    In your case, since two of the three are working in streaming mode, it likely comes down to some specific differences with the 3rd file why it won't read the data in it. I'd recommend reviewing the configuration settings in the step against what is in the Excel file (sheet names, row/column start points, hidden columns in the file etc.). If that doesn't solve it, you may want to reach out to Pentaho Support if you have an Enterprise license, so they can provide more help.

  6. #6

    Default

    Is it mandatory to use Apache streaming mode only if xlsx file is large?What if xlsx file is not too large?
    May be it contains say 500 rows only. In that case Apache Streaming mode will work?

  7. #7

    Default

    The streaming mode works equally well with any size file, and may be the only option for larger files.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.