Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Excel Input step. "Current state not START_ELEMENT" exception

  1. #1
    Join Date
    Sep 2015
    Posts
    3

    Default Excel Input step. "Current state not START_ELEMENT" exception

    Hello,

    When I tried to load a large Excel file with Microsoft Excel Input step I got an exception "Current state not START_ELEMENT".
    Detailed exception info:

    Code:
    ERROR (version 5.4.0.1-130, build 1 from 2015-06-14_12-34-55 by buildguy) : Error processing row from Excel file [...] : java.lang.RuntimeException: java.lang.IllegalStateException: Current state not START_ELEMENT
    ERROR (version 5.4.0.1-130, build 1 from 2015-06-14_12-34-55 by buildguy) : java.lang.RuntimeException: java.lang.IllegalStateException: Current state not START_ELEMENT
        at org.pentaho.di.trans.steps.excelinput.staxpoi.StaxPoiSheet.getRow(StaxPoiSheet.java:167)
        at org.pentaho.di.trans.steps.excelinput.ExcelInput.getRowFromWorkbooks(ExcelInput.java:607)
        at org.pentaho.di.trans.steps.excelinput.ExcelInput.processRow(ExcelInput.java:441)
        at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
        at java.lang.Thread.run(Unknown Source)
    Caused by: java.lang.IllegalStateException: Current state not START_ELEMENT
        at com.ctc.wstx.sr.BasicStreamReader.getAttributeValue(BasicStreamReader.java:641)
        at org.pentaho.di.trans.steps.excelinput.staxpoi.StaxPoiSheet.getRow(StaxPoiSheet.java:137)
    The file has approximately 200mb and 600000 rows in it. The error occurs in the middle of the file.
    File is good, and can be opened with Excel.
    Spread sheet type (engine) property of the Excel Input Step is "Excel 2007 XLSX (Apache POI Streaming)".

    Could you please help me with this issue?

    Thanks,
    Ivan.

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    You see Kettle reporting a parser error, that Excel seemingly doesn't care about.
    So apparently it's Excel ignoring an XML error or staxpoi having a bug.
    You can try to find out using a different parser, e.g. your XML capable browser.
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Sep 2010
    Posts
    18

    Default

    It happens to me also. I've been trying to work around it but:
    XLS ==> Row limit 65535
    XLSX ==> It opens the whole Excel which produces an GC overhead
    XLSX (streaming) ==> Great but it doesnt get some sheets whatever you name it or "save it as" many times you want

    Nowadays working with an Excel in PDI is a pain in the neck but "Ohh god lets add big data connectors" instead of fixing this type of problems"

    And this post will never be read, im just bored

  4. #4
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Well, Excel is not a DATA application and should not be used that way.
    Having 600000 records in Excel is crazy (IMHO).
    With that many records, something more appropriate should be used, such as text file or a DB (even sql lite if not a full DB server).
    Unfortunately the ease of use of Excel turned a spreadsheet into a "DB"!

    But in your shoes I would simply convert the Excel file into a text file using Excel itself (Save as) or some third party software.
    -- Mick --

  5. #5
    Join Date
    Sep 2010
    Posts
    18

    Default

    Quote Originally Posted by Mick_data View Post
    Well, Excel is not a DATA application and should not be used that way.
    Having 600000 records in Excel is crazy (IMHO).
    With that many records, something more appropriate should be used, such as text file or a DB (even sql lite if not a full DB server).
    Unfortunately the ease of use of Excel turned a spreadsheet into a "DB"!

    But in your shoes I would simply convert the Excel file into a text file using Excel itself (Save as) or some third party software.
    Firstly , Apache streaming API fails on any number of rows, if PDI "doesn't like" sheet name, so the number of rows is irelevant.
    Secondly, many projects and more precisely finaltial ones require large Excels with many sheets, asking a client to manually having to "Save As" as text every day is simply unaceptable.

    I just wanted to mean that If I were in PDI team I would not release an step with such limitations. Moreover I wouldn't dedicate resources for connectors like big data ones when an Excel basic one is not working properly.

    Greetings

  6. #6
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Secondly, many projects and more precisely finaltial ones require large Excels with many sheets, asking a client to manually having to "Save As" as text every day is simply unaceptable.
    There are third party software that can convert Excel files into text files.

    Moreover I wouldn't dedicate resources for connectors like big data ones when an Excel basic one is not working properly.
    I understand your opinion and I would prefer to have a better Excel Input step, but if devs have decided to dedicate more resources to connectors for big data maybe it's because their customers have made more noise (requests) for those features.

    Finally, as far as I know, PDI uses the Apache API, so it could be worth checking their website for a newer version (or even a beta) and do some testing and bugs reporting directly to them.
    -- Mick --

  7. #7
    Join Date
    Sep 2010
    Posts
    18

    Default

    Quote Originally Posted by Mick_data View Post
    There are third party software that can convert Excel files into text files.
    We cannot sell a client a tool for integration and then require another because of bug. Its really hard to justify to the client.
    Anyways thank you for your answer. Can you give a name of one of those tools (open source I guess)?

    Quote Originally Posted by Mick_data View Post
    I understand your opinion and I would prefer to have a better Excel Input step, but if devs have decided to dedicate more resources to connectors for big data maybe it's because their customers have made more noise (requests) for those features.

    Finally, as far as I know, PDI uses the Apache API, so it could be worth checking their website for a newer version (or even a beta) and do some testing and bugs reporting directly to them.
    We did replace the Apache POI streamming but we realised that its pentaho fault because library works ok "outside" of pentaho.

    Greetings

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.