Hitachi Vantara Pentaho Community Forums
Results 1 to 8 of 8

Thread: BLOBs and OutOfMemory problem

  1. #1
    Join Date
    Oct 2011
    Posts
    14

    Default BLOBs and OutOfMemory problem

    Hi guys,
    I'm been trying to solve my problem for the last 2-3 days but with no success.

    I have an MSSQL table with 140k records with the following fields: full_row_md5 (nvarchar(4000)), id (bigint), ADDRESS (CLOB), OVERVIEW (CLOB), NAME (CLOB), FN_PICTURE (nvarchar(4000)), PICTURE (BLOB) - each row has a picture in the field (jpeg).

    The main goal is to export this data and write it to Excel file and write the BLOBs (pictures) in a sub-directory called file_PICTURE.

    If i don't select the PICTURE field in the sql query, everything is ok and 140k records are written into 3 xls files for about 40 seconds (which is great).
    But when i want to export also the images, the transformation runs about 1 minute and throws OutOfMemoryException.

    I'm attaching the transformation - i run it with Carte, and use "c:\temp\bt" as a value of the "file" parameter.

    Is there any improvements i can do with the transformation in order to fix this issue? What is the main strategy of handling OutOfMemory issues? I tried to play with the GC parameters (even used several methods of GC) but no luck. The -Xmx is set to 1024MB and increasing it i think is not an option because tomorrow i may have 500k records.

    Any help will be appreciated. Thanks



    Here is a log of the Carte.bat execution:


    C:\Users\iyotov\Desktop\pdi-ce-4.3.0-stable\data-integration>Carte.bat 127.0.0.1 8081
    DEBUG: Using JAVA_HOME
    DEBUG: _PENTAHO_JAVA_HOME=D:\Apps\Java\jdk1.6.0_21
    DEBUG: _PENTAHO_JAVA=D:\Apps\Java\jdk1.6.0_21\bin\java.exe
    "-Xmx1024m" "-Dorg.mortbay.util.URI.charset=UTF-8" -Xms512m -XX:PermSize=128m -X
    X:MaxPermSize=256m "-Djava.library.path=C:\Users\iyotov\Desktop\pdi-ce-4.3.0-sta
    ble\data-integration\libswt\win64" "-DKETTLE_HOME=" "-DKETTLE_REPOSITORY=" "-DKE
    TTLE_USER=" "-DKETTLE_PASSWORD=" "-DKETTLE_PLUGIN_PACKAGES=" "-DKETTLE_LOG_SIZE_
    LIMIT=100" "-DKETTLE_MAX_JOB_ENTRIES_LOGGED=100" "-DKETTLE_MAX_JOB_TRACKER_SIZE=
    100" "-DKETTLE_MAX_LOGGING_REGISTRY_SIZE=100" "-DKETTLE_MAX_LOG_SIZE_IN_LINES=10
    0" "-DKETTLE_CARTE_OBJECT_TIMEOUT_MINUTES=3" "-DKETTLE_MAX_LOG_TIMEOUT_IN_MINUTE
    S=3"
    INFO 14-09 13:31:46,176 - Carte - Installing timer to purge stale objects after
    3 minutes.
    INFO 14-09 13:31:46,237 - Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mort
    bay.log) via org.mortbay.log.Slf4jLog
    INFO 14-09 13:31:46,324 - Carte - Created listener for webserver @ address : 12
    7.0.0.1:8081
    INFO 14-09 13:31:46,324 - jetty-6.1.21
    INFO 14-09 13:31:46,412 - Started SocketConnector@127.0.0.1:8081
    INFO 14-09 13:31:48,776 - blob_transformation - Dispatching started for transfo
    rmation [blob_transformation]
    INFO 14-09 13:31:49,919 - generate value - Optimization level set to 0.
    INFO 14-09 13:31:50,019 - Add temp fields - Optimization level set to 9.
    INFO 14-09 13:31:50,052 - write blob to file - Optimization level set to 9.
    ERROR 14-09 13:33:12,932 - Table input - UnexpectedError:
    ERROR 14-09 13:33:17,722 - Table input - java.lang.OutOfMemoryError: Java heap s
    pace
    at net.sourceforge.jtds.util.BlobBuffer.getBytes(BlobBuffer.java:959)
    at net.sourceforge.jtds.jdbc.BlobImpl.getBytes(BlobImpl.java:77)
    at org.pentaho.di.core.database.BaseDatabaseMeta.getValueFromResultSet(B
    aseDatabaseMeta.java:1877)
    at org.pentaho.di.core.database.DatabaseMeta.getValueFromResultSet(Datab
    aseMeta.java:2613)
    at org.pentaho.di.core.database.Database.getRow(Database.java:2816)
    at org.pentaho.di.core.database.Database.getRow(Database.java:2796)
    at org.pentaho.di.trans.steps.tableinput.TableInput.processRow(TableInpu
    t.java:153)
    at org.pentaho.di.trans.step.RunThread.run(RunThread.java:50)
    at java.lang.Thread.run(Thread.java:619)

    INFO 14-09 13:33:23,238 - Table input - Finished reading query, closing connect
    ion.
    INFO 14-09 13:33:49,961 - Excel Output Remove - Finished processing (I=0, O=0,
    R=19515, W=19515, U=0, E=0)
    INFO 14-09 13:33:50,755 - Remove fields - Finished processing (I=0, O=0, R=2951
    7, W=59034, U=0, E=0)
    INFO 14-09 13:33:50,755 - Regex filter - Finished processing (I=0, O=0, R=39518
    , W=39518, U=0, E=0)
    INFO 14-09 13:33:50,755 - Block until finished - Finished processing (I=0, O=0,
    R=19515, W=0, U=0, E=0)
    INFO 14-09 13:33:53,766 - Switch / Case - Finished processing (I=0, O=0, R=1951
    5, W=19515, U=0, E=0)
    Attached Files Attached Files

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Your JS code in step "write blob to file" seems to be a memory hog.
    The better part of that code should be converted to standard Kettle steps.
    Then try to cut back the remaining lines, shamelessly borrowing from here:

    http://type-exit.org/adventures-with-open-source-bi/2010/06/quick-kettle-hint-writing-blobs-to-files/
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Oct 2011
    Posts
    14

    Default

    Hi marabu,
    Thanks for the help, i will try to rewrite the "Write blob to file" as it is in the Slawo's blog.
    What do you mean by "The better part of that code should be converted to standard Kettle steps"?

    Thanks

  4. #4
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    That means, roughly 20 percent of your lines of code actually deal with writing the blob file. The remaining lines seem to do string manipulation and file management. These are operations well supported by dedicated PDI building blocks.

    From a maintenance point of view, I always try to stick to the visual programming paradigm as close as possible without bleeding.

    Since I felt you would have to rebuild your code from scratch, I thought you might be open to rethinking the overall design of that code.
    Every line of code not written is a line without errors, after all.
    So long, and thanks for all the fish.

  5. #5
    Join Date
    Oct 2011
    Posts
    14

    Default

    Hi marabu,
    I changed the code in the "write blob to file" step the same as in the Slawo's blob and added "PICTURE" field to be removed in the "Remove fields 2" step. I'm getting the java.lang.OutOfMemoryError: GC overhead limit exceeded.

    The modified transformation is attached. Any other ideas?


    Btw i'm running Carte with the following parameters: "-Xmx1024m" "-Dorg.mortbay.util.URI.charset=UTF-8" -Xms512m -XX:PermSize=128m -XX:MaxPermSize=256m "-Djava.library.path=C:\Users\iyotov\Desktop\pdi-ce-4.3.0-stable\data-integration\libswt\win64" "-DKETTLE_HOME=" "-DKETTLE_REPOSITORY=" "-DKETTLE_USER=" "-DKETTLE_PASSWORD=" "-DKETTLE_PLUGIN_PACKAGES=" "-DKETTLE_LOG_SIZE_LIMIT=100" "-DKETTLE_MAX_JOB_ENTRIES_LOGGED=100" "-DKETTLE_MAX_JOB_TRACKER_SIZE=100" "-DKETTLE_MAX_LOGGING_REGISTRY_SIZE=100" "-DKETTLE_MAX_LOG_SIZE_IN_LINES=100" "-DKETTLE_CARTE_OBJECT_TIMEOUT_MINUTES=3" "-DKETTLE_MAX_LOG_TIMEOUT_IN_MINUTES=3"
    Attached Files Attached Files

  6. #6
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Some code should be executed only once.
    Separate it from the row processing code.
    Use the context menu of the tabs and the tree nodes on the left side to arrange and classify your code.

    Name:  onStart.jpg
Views: 164
Size:  27.5 KB

    Name:  onTransform.jpg
Views: 149
Size:  26.3 KB

    When memory allocations are still running ahead of the garbage collector, setting variables to null at the end of row processing might help.
    You even could try to slow down row processing by introducing a Delay Row step, thus giving the GC more time to reclaim allocated memory.
    So long, and thanks for all the fish.

  7. #7
    Join Date
    Oct 2011
    Posts
    14

    Default

    Hi marabu,
    Adding a Delay of 3 ms after the "Table input" and splitting the "Write blob to file" code onStart and onTransform did the job - the transformation executes successfully for about 11 minutes without exceptions (the strange thing is that if i start Carte.bat with the specified parameters in the previous post, it still throws OutOfMemory but if i run the default Carte.bat everything is ok). The problem now is that if i remove the "PICTURE" field from the SQL and comment the code in the "write blob to file" step, the transformation (now without any BLOBs and with the added delay step) is really slow - it executes for 27 minutes while without the delay step it executes for less than a minute.

    Are there any other optimizations/performance tunings that come to your mind?

    Also, i wanted to ask you if the "data movement" is correct from the "Table Input" step - currently it is set to "Copy data" instead of just "Distribute" - which is the correct? The same question for "Remove fields 2" step too.

    Thanks again for your help, i really appreciate it.


    Ivo
    Attached Files Attached Files

  8. #8
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Sorry for butchering your transformation, but I felt very uncomfortable with your version 3.
    You must replace my Datagrid with your Table Input step.
    Since I didn't test at all, it's your job to make it fly, will you?

    If the minimal transformation runs without memory problems, we have a good starting point.
    I feel so much better with this bottom-up approach.

    BTW: What size are the pictures?

    OFF-TOPIC: Have you ever tried to set Grid size = 16 (Spoon - Tools - Options - Look & Feel)?
    Attached Files Attached Files
    So long, and thanks for all the fish.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.