Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Pentaho Text file output limitation

  1. #1
    Join Date
    Oct 2014
    Posts
    18

    Default Pentaho Text file output limitation

    Hi,
    I have input .xlsx file which have 1 million records. I am dumping this xlsx rows into csv file using "Text file output" with 50K rows each.

    Now on running PDI, it dumps only 10 files means .5 million records and transformation get exit successfully.
    PDI touched 3.4GB memory (6gb configured in spoon.bt)

    Please tell me how to check
    - why PDI stopped after .5 million records.
    - how to check the error condition.
    - how to resolve?

    Thanks in advance.

    Yuvam

  2. #2
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Any error message?
    Reading a big Excel Excel can use plenty of memory, so maybe it was a memory issue?
    -- Mick --

  3. #3
    Join Date
    Oct 2014
    Posts
    18

    Default

    Hi

    there is no error message coming on PDI ui. It terminates successfully by saying that " files read finished" and "0.5" million records
    records write on the file.

    IMPORTANT: It is dumping exactly the half of the total input records.
    On analyzing the output files, I found that, it contains alternate rows as coming from input file.


    Thanks

  4. #4
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Show your design.
    Better yet, attach it.
    So long, and thanks for all the fish.

  5. #5
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by yuvam View Post
    IMPORTANT: It is dumping exactly the half of the total input records.
    On analyzing the output files, I found that, it contains alternate rows as coming from input file.
    I bet you have multiple hops somewhere set to "Distribute" rather than "Copy"

    But as marabu said, post your transform, the community can help you sort out what's going on with it.

  6. #6
    Join Date
    Oct 2014
    Posts
    18

    Default

    I found the problem cause.

    In my excel input , I was using "Excel 2007 XLSX (Apache POI Streaming)". with this, PDI was giving me alternate rows and merging values in rows with each other.
    The preview was also showing me wrong data and missing rows.

    When I choose "Excel 2007 XLSX (Apache POI)" , I received exact same count as available in input. It means there might be some issues with "Apache POI Streaming" engines
    which causes the merging of two rows and skipping alternate rows.

    IMPORTANT: Now with new engine, PDI is taking very high memory. Earlier it was running with 3 gb while with new engine it is taking 7 gb (7gb configured in spoon.bat).

    I would like to suggest the pentaho developers to test this condition with excel input of 1 million resources and 60 columns.
    I have PDI version 5.2.0.0

    Thanks

  7. #7
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    I would like to suggest the pentaho developers to test this condition with excel input of 1 million resources and 60 columns.
    There's a JIRA ticket opened about this.
    1. http://jira.pentaho.com/browse/PDI-13086
    and also:
    2. http://jira.pentaho.com/browse/PDI-5269

    As far as I know, Pentaho devs have been working on it for a while.
    But I *think* that they are using a Apache component and therefore they are probably waiting for them to improve performance/memory issues
    -- Mick --

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.