Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: S3 file output limit

  1. #1
    Join Date
    Mar 2015
    Posts
    1

    Default S3 file output limit

    Hey Everyone,

    I've been working over the past few days to get transformations setup to move data from our Postgres server to Redshift using a free tier S3 bucket as an intermediary and PDI 4.4 for the transfer. Currently I am using a table input to pull from Postgres and an S3 file output step to get a file that I can then use in a copy statement in an execute sql statement step to pull the data into Redshift. Everything has been working as expected until I started working with some of the bigger tables. I cannot output a file to S3 bigger than 128MB. If I limit the rows in the input it will work until I hit a number that results in 128MB in the output on S3 at which point the transformation errors out and fails. error log is below. Anyone run across any issues like this before?

    Thanks.

    Tim

    2015/03/16 16:03:55 - S3 File Output.0 - ERROR (version 5.3.0.0-213, build 1 from 2015-02-02_12-17-08 by buildguy) : Unexpected error
    2015/03/16 16:03:55 - S3 File Output.0 - ERROR (version 5.3.0.0-213, build 1 from 2015-02-02_12-17-08 by buildguy) : org.pentaho.di.core.exception.KettleStepException:
    2015/03/16 16:03:55 - S3 File Output.0 - Error writing line
    2015/03/16 16:03:55 - S3 File Output.0 -
    2015/03/16 16:03:55 - S3 File Output.0 - Error writing field content to file
    2015/03/16 16:03:55 - S3 File Output.0 - Read end dead
    2015/03/16 16:03:55 - S3 File Output.0 -
    2015/03/16 16:03:55 - S3 File Output.0 -
    2015/03/16 16:03:55 - S3 File Output.0 - at org.pentaho.di.trans.steps.textfileoutput.TextFileOutput.writeRowToFile(TextFileOutput.java:275)
    2015/03/16 16:03:55 - S3 File Output.0 - at org.pentaho.di.trans.steps.textfileoutput.TextFileOutput.processRow(TextFileOutput.java:197)
    2015/03/16 16:03:55 - S3 File Output.0 - at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
    2015/03/16 16:03:55 - S3 File Output.0 - at java.lang.Thread.run(Unknown Source)
    2015/03/16 16:03:55 - S3 File Output.0 - Caused by: org.pentaho.di.core.exception.KettleStepException:
    2015/03/16 16:03:55 - S3 File Output.0 - Error writing field content to file
    2015/03/16 16:03:55 - S3 File Output.0 - Read end dead
    2015/03/16 16:03:55 - S3 File Output.0 -
    2015/03/16 16:03:55 - S3 File Output.0 - at org.pentaho.di.trans.steps.textfileoutput.TextFileOutput.writeField(TextFileOutput.java:437)
    2015/03/16 16:03:55 - S3 File Output.0 - at org.pentaho.di.trans.steps.textfileoutput.TextFileOutput.writeRowToFile(TextFileOutput.java:251)
    2015/03/16 16:03:55 - S3 File Output.0 - ... 3 more
    2015/03/16 16:03:55 - S3 File Output.0 - Caused by: java.io.IOException: Read end dead
    2015/03/16 16:03:55 - S3 File Output.0 - at java.io.PipedInputStream.checkStateForReceive(Unknown Source)
    2015/03/16 16:03:55 - S3 File Output.0 - at java.io.PipedInputStream.awaitSpace(Unknown Source)
    2015/03/16 16:03:55 - S3 File Output.0 - at java.io.PipedInputStream.receive(Unknown Source)
    2015/03/16 16:03:55 - S3 File Output.0 - at java.io.PipedOutputStream.write(Unknown Source)
    2015/03/16 16:03:55 - S3 File Output.0 - at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
    2015/03/16 16:03:55 - S3 File Output.0 - at java.io.BufferedOutputStream.write(Unknown Source)
    2015/03/16 16:03:55 - S3 File Output.0 - at org.apache.commons.vfs.util.MonitorOutputStream.write(Unknown Source)
    2015/03/16 16:03:55 - S3 File Output.0 - at org.pentaho.di.core.compress.CompressionOutputStream.write(CompressionOutputStream.java:36)
    2015/03/16 16:03:55 - S3 File Output.0 - at java.io.OutputStream.write(Unknown Source)
    2015/03/16 16:03:55 - S3 File Output.0 - at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
    2015/03/16 16:03:55 - S3 File Output.0 - at java.io.BufferedOutputStream.write(Unknown Source)
    2015/03/16 16:03:55 - S3 File Output.0 - at java.io.FilterOutputStream.write(Unknown Source)
    2015/03/16 16:03:55 - S3 File Output.0 - at org.pentaho.di.trans.steps.textfileoutput.TextFileOutput.writeField(TextFileOutput.java:417)
    2015/03/16 16:03:55 - S3 File Output.0 - ... 4 more
    2015/03/16 16:03:55 - S3 File Output.0 - ERROR (version 5.3.0.0-213, build 1 from 2015-02-02_12-17-08 by buildguy) : Exception trying to close file: java.io.IOException: Read end dead
    2015/03/16 16:04:04 - S3 File Output.0 - ERROR (version 5.3.0.0-213, build 1 from 2015-02-02_12-17-08 by buildguy) : Unexpected error closing file
    2015/03/16 16:04:04 - S3 File Output.0 - ERROR (version 5.3.0.0-213, build 1 from 2015-02-02_12-17-08 by buildguy) : org.apache.commons.vfs.FileSystemException: Could not close the output stream for file "s3://s3/tletson-tracking/ingest/a_a_a.txt".
    2015/03/16 16:04:04 - S3 File Output.0 - at org.apache.commons.vfs.provider.DefaultFileContent$FileContentOutputStream.close(Unknown Source)
    2015/03/16 16:04:04 - S3 File Output.0 - at org.pentaho.di.trans.steps.textfileoutput.TextFileOutput.dispose(TextFileOutput.java:845)
    2015/03/16 16:04:04 - S3 File Output.0 - at org.pentaho.di.trans.step.RunThread.run(RunThread.java:96)
    2015/03/16 16:04:04 - S3 File Output.0 - at java.lang.Thread.run(Unknown Source)
    2015/03/16 16:04:04 - S3 File Output.0 - Caused by: java.io.IOException: Read end dead
    2015/03/16 16:04:04 - S3 File Output.0 - at java.io.PipedInputStream.checkStateForReceive(Unknown Source)
    2015/03/16 16:04:04 - S3 File Output.0 - at java.io.PipedInputStream.receive(Unknown Source)
    2015/03/16 16:04:04 - S3 File Output.0 - at java.io.PipedOutputStream.write(Unknown Source)
    2015/03/16 16:04:04 - S3 File Output.0 - at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
    2015/03/16 16:04:04 - S3 File Output.0 - at java.io.BufferedOutputStream.flush(Unknown Source)
    2015/03/16 16:04:04 - S3 File Output.0 - at org.apache.commons.vfs.util.MonitorOutputStream.flush(Unknown Source)
    2015/03/16 16:04:04 - S3 File Output.0 - at java.io.FilterOutputStream.close(Unknown Source)
    2015/03/16 16:04:04 - S3 File Output.0 - at org.apache.commons.vfs.util.MonitorOutputStream.close(Unknown Source)
    2015/03/16 16:04:04 - S3 File Output.0 - ... 4 more
    2015/03/16 16:04:04 - S3 File Output.0 - Finished processing (I=0, O=975092, R=975093, W=975092, U=0, E=1)
    2015/03/16 16:04:04 - a_a_a - ERROR (version 5.3.0.0-213, build 1 from 2015-02-02_12-17-08 by buildguy) : Errors detected!
    2015/03/16 16:04:04 - a_a_a - ERROR (version 5.3.0.0-213, build 1 from 2015-02-02_12-17-08 by buildguy) : Errors detected!
    2015/03/16 16:04:04 - a_a_a - ERROR (version 5.3.0.0-213, build 1 from 2015-02-02_12-17-08 by buildguy) : Errors detected!
    2015/03/16 16:04:04 - a_a_a - Transformation detected one or more steps with errors.
    2015/03/16 16:04:04 - a_a_a - Transformation is killing the other steps!

  2. #2
    Join Date
    Apr 2014
    Posts
    18

    Default

    I am having the exact same error, except that I am using PDI Ver 5.3.0 It is really hard to believe that there is a limit in the extract. Is there a configuration for in Kettle to allow bigger extracts? (In the 100MM)

  3. #3

    Default

    Hi Tim,

    In researching the same issue, the issue is that S3 requires to know the expected file-size before the upload can begin, which causes issues when trying to stream data like this. See PDI-13628 for more details.

    For small datasets, it's ok to use the S3 File Output step, as the jetS3t library can briefly fit the data in-memory and then calculate the file-size.

    For larger datasets, your best bet is to use a Text File Output (or similar), and then use the AWS CLI utilities or something else to perform the S3 upload after the file is written.

    Regards,

  4. #4
    Join Date
    Sep 2015
    Posts
    2

    Default

    We also had this problem and solved it in two steps:
    - Include access codes in URL like this: s3://secret_id:secret_key@s3/bucket_name/folder_name/file_name
    - Try to move your bucket (for us Frankfurt didn't work, due to different security settings, so moved to Ireland)

  5. #5
    Join Date
    Dec 2015
    Posts
    1

    Default

    i am also having the same issue can you help me.


    2015/12/15 12:40:34 - Table output.0 - ERROR (version 5.3.0.0-213, build 1 from 2015-02-02_12-17-08 by buildguy) : Because of an error, this step can't continue:
    2015/12/15 12:40:34 - Table output.0 - ERROR (version 5.3.0.0-213, build 1 from 2015-02-02_12-17-08 by buildguy) : org.pentaho.di.core.exception.KettleException:
    2015/12/15 12:40:34 - Table output.0 - Error batch inserting rows into table [tbl_fact_activity].
    2015/12/15 12:40:34 - Table output.0 - Errors encountered (first 10):
    2015/12/15 12:40:34 - Table output.0 - ERROR: Out of memory (seg1 dwhhadoop3:40000 pid=12950)
    2015/12/15 12:40:34 - Table output.0 - Detail: VM Protect failed to allocate 16384 bytes, 0 MB available
    2015/12/15 12:40:34 - Table output.0 -
    2015/12/15 12:40:34 - Table output.0 -
    2015/12/15 12:40:34 - Table output.0 - Error updating batch
    2015/12/15 12:40:34 - Table output.0 - Batch entry 175 INSERT INTO tbl_fact_activity (MatriId, DomainId, TimeId, ActivityId) VALUES ( 'M3221843', 1, '2014-01-24 08:11:00', 2) was aborted. Call getNextException to see the cause.
    2015/12/15 12:40:34 - Table output.0 -
    2015/12/15 12:40:34 - Table output.0 -
    2015/12/15 12:40:34 - Table output.0 - at org.pentaho.di.trans.steps.tableoutput.TableOutput.writeToTable(TableOutput.java:342)
    2015/12/15 12:40:34 - Table output.0 - at org.pentaho.di.trans.steps.tableoutput.TableOutput.processRow(TableOutput.java:118)
    2015/12/15 12:40:34 - Table output.0 - at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
    2015/12/15 12:40:34 - Table output.0 - at java.lang.Thread.run(Thread.java:745)
    2015/12/15 12:40:34 - Table output.0 - Caused by: org.pentaho.di.core.exception.KettleDatabaseBatchException:
    2015/12/15 12:40:34 - Table output.0 - Error updating batch
    2015/12/15 12:40:34 - Table output.0 - Batch entry 175 INSERT INTO tbl_fact_activity (MatriId, DomainId, TimeId, ActivityId) VALUES ( 'M3221843', 1, '2014-01-24 08:11:00', 2) was aborted. Call getNextException to see the cause.
    2015/12/15 12:40:34 - Table output.0 -
    2015/12/15 12:40:34 - Table output.0 - at org.pentaho.di.core.database.Database.createKettleDatabaseBatchException(Database.java:1351)
    2015/12/15 12:40:34 - Table output.0 - at org.pentaho.di.trans.steps.tableoutput.TableOutput.writeToTable(TableOutput.java:289)
    2015/12/15 12:40:34 - Table output.0 - ... 3 more
    2015/12/15 12:40:34 - Table output.0 - Caused by: java.sql.BatchUpdateException: Batch entry 175 INSERT INTO tbl_fact_activity (MatriId, DomainId, TimeId, ActivityId) VALUES ( 'M3221843', 1, '2014-01-24 08:11:00', 2) was aborted. Call getNextException to see the cause.
    2015/12/15 12:40:34 - Table output.0 - at org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2778)
    2015/12/15 12:40:34 - Table output.0 - at org.postgresql.core.v3.QueryExecutorImpl$1.handleError(QueryExecutorImpl.java:395)
    2015/12/15 12:40:34 - Table output.0 - at org.postgresql.core.v3.QueryExecutorImpl$ErrorTrackingResultHandler.handleError(QueryExecutorImpl.java:280)
    2015/12/15 12:40:34 - Table output.0 - at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1912)
    2015/12/15 12:40:34 - Table output.0 - at org.postgresql.core.v3.QueryExecutorImpl.flushIfDeadlockRisk(QueryExecutorImpl.java:1105)
    2015/12/15 12:40:34 - Table output.0 - at org.postgresql.core.v3.QueryExecutorImpl.sendQuery(QueryExecutorImpl.java:1126)
    2015/12/15 12:40:34 - Table output.0 - at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:329)
    2015/12/15 12:40:34 - Table output.0 - at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2959)
    2015/12/15 12:40:34 - Table output.0 - at org.pentaho.di.trans.steps.tableoutput.TableOutput.writeToTable(TableOutput.java:285)
    2015/12/15 12:40:34 - Table output.0 - ... 3 more
    2015/12/15 12:40:34 - Text file input 2.0 - Finished processing (I=1954934, O=0, R=15, W=1954934, U=0, E=0)
    2015/12/15 12:40:34 - Dummy (do nothing).0 - Finished processing (I=0, O=0, R=1934000, W=1934000, U=0, E=0)
    2015/12/15 12:40:34 - Table output.0 - Finished processing (I=0, O=1934999, R=1935000, W=1934000, U=0, E=1)
    2015/12/15 12:40:34 - Greenplum_LOAD_CAR_DATA - Transformation detected one or more steps with errors.
    2015/12/15 12:40:34 - Greenplum_LOAD_CAR_DATA - Transformation is killing the other steps!
    2015/12/15 12:40:34 - Select values.0 - Finished processing (I=0, O=0, R=1945001, W=1945001, U=0, E=0)
    2015/12/15 12:40:34 - Pan - Finished!
    2015/12/15 12:40:34 - Pan - Start=2015/12/15 11:31:58.160, Stop=2015/12/15 12:40:34.153
    2015/12/15 12:40:34 - Pan - Processing ended after 1 hours, 8 minutes and 35 seconds (4115 seconds total).

  6. #6
    Join Date
    Sep 2016
    Posts
    1

    Default

    Hi All,

    I know this thread is a bit old and you've all likely figured a way around this - I just wanted to let everyone know how I got around this issue with a few more details - I could've used this myself when I encountered the issue, so I am hopeful it's useful to anyone who is new to the problem:

    This error message, as outlined in the bug recorded by Pentaho, appears in the java logging when attempting to run a Transformation that includes a step that gets data from a source and places it in S3.

    While I've not seen a direct fix from Pentaho, the following details will allow a workaround that should be equally fast to execute (maybe faster), but also should be less consuming on Memory and System Resources as it puts a good chunk of the work on the O/S.

    To complete this process, you must use a local file and have enough room!
    This process will ask that you dump to your local file system / mount / accessible first before moving to S3. This means that you will need to have enough room in the "dump" space to be able to store the file before moving this up to Amazon. Of course, you can easily delete the file after the file has been uploaded as part of your Pentaho process.

    Software Installation is needed for this to work
    In order for this to work, your Pentaho System must have the appropriate AWS S3 software installed, and you must have AWS credentials. You must also have all necessary Database Clients installed, drivers connected and ready to go.

    To work around the "Read End Dead" issue:

    • Create a new job (NOT Transformation) and place it in the appropriate place in the Repository.
    • Add a Start Task.
    • In the Job, create a Shell Script Task and add a hop from the Start Task to this new Shell Script Task.
      1. The task will have one or more commands that will be executed from command line. The commands will take data from the source DB and dump it to a local file. They will do so by executing SQL via the locally installed Client that is able to tap in to the source DB.
      2. Each source is of course different, however, the following is an example of a command that can be used (note the password is removed, and that multiple commands can be placed in to one script). This command takes data from a MYSQL Source through a query and dumps it to a local file called "test_mysql_dump.txt".

    /usr/bin/mysql -e "SELECT * FROM TABLENAME" DATABASENAME -h HOSTNAMEORIP -P PORT -u USERNAME -pTHISISAMYSQLPASSWORD > /data/test_mysql_dump.txt


    • In the same Job, create another Shell Script Task, tie it to the first task through a hop.
      1. This task will take the data that has been downloaded to the "test_mysql_dump.txt" file and upload it to S3. Details of the script command:

    aws s3 cp /data/test_mysql_dump.txt s3://your_repo_name/your_folder_path/test_from_mysql.txt


    • Optionally, add a Task in that removes the file /data/test_mysql_dump.txt, or any other task you would like. Tie it/them to the second Shell script task through a hop as necessary.
    • Add a success task, and tie it to the last functional task in the Hop.
    • Save the Job.


    We found this works for us - hope it does for you too.
    Last edited by sarsippius; 09-07-2016 at 01:23 PM.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.