Hitachi Vantara Pentaho Community Forums
Results 1 to 8 of 8

Thread: Jobs and large result row sets

  1. #1

    Default Jobs and large result row sets

    Hey all,
    Simplifying the problem I'm running into:

    I have a transformation that reads a file (100,000 rows of 30 fields each), and then Copy Rows to Results to the parent job.

    Next step in the parent job does NOT clear the list of result rows, as I want to read those results and, in this example case, output to a text file.

    The system crashes with heap/memory issues when trying to go from the first transformation to the second transformation. First transformation completed fine, it appears to be the --transition-- up to the job level and then to the next transformation that the problem occurs.

    This is an EXAMPLE of the problem that I was able to create - my real-world is a bit more complicated (level 1 validation transformation going to a level 2 validation, modularized approach), but the transition of a large result set from one transformation to the job level to be used in the next transformation is the problem.

    Any pointers on how to resolve this problem?

    *changing NR of rows in Rowset on transformation file does not have an impact (didn't think it would, but in case someone recommends it already tested).


    edit: kettle 2.5.0

    thanky,
    -D
    Attached Files Attached Files
    Last edited by dhartford; 09-17-2007 at 08:47 AM.

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    Write to file (serialized e.g.), there's a special step for it. And in the other transformation use that as input.
    Or the more traditional way, land to files and read back in from files.

    Regards,
    Sven

  3. #3

    Default

    Hmm...is there some caveat with de-serialize when running in a Job?

    log
    =====================
    2007/09/17 09:46:19 - Text file input.0 - Signaling 'output done' to 1 output rowsets.
    2007/09/17 09:46:19 - Text file input.0 - Finished processing (I=492568, O=0, R=0, W=492568, U=0, E=0)
    2007/09/17 09:46:19 - Serialize to file.0 - Signaling 'output done' to 0 output rowsets.
    2007/09/17 09:46:19 - Serialize to file.0 - Finished processing (I=0, O=492568, R=492568, W=0, U=0, E=0)
    2007/09/17 09:46:19 - largefile - Starting entry [outputlargefile]
    2007/09/17 09:46:19 - Thread[largefile (largefile (Thread-18)),5,main] - exec(2, 0, outputlargefile.0)
    2007/09/17 09:46:19 - outputlargefile - Opening filename : [C:\projects\kettle\outputlargefile.ktr]
    2007/09/17 09:46:19 - outputlargefile - Opening transformation: [C:\projects\kettle\outputlargefile.ktr]
    2007/09/17 09:46:19 - outputlargefile - Loading transformation from XML file [C:\projects\kettle\outputlargefile.ktr]
    2007/09/17 09:46:19 - SharedObjects - Reading the shared objects file [file:///C:/Documents and Settings/dhartford/.kettle/shared.xml]
    2007/09/17 09:46:19 - be.ibridge.kettle.trans.TransMeta - We have 1 connections...
    2007/09/17 09:46:19 - be.ibridge.kettle.trans.TransMeta - Looking at connection #0
    2007/09/17 09:46:19 - be.ibridge.kettle.trans.TransMeta - Reading 2 steps...
    2007/09/17 09:46:19 - be.ibridge.kettle.trans.TransMeta - Looking at step #0
    2007/09/17 09:46:19 - StepMeta() - looking for the right step node (Text file output)
    2007/09/17 09:46:19 - StepMeta() - specifics loaded for Text file output
    2007/09/17 09:46:19 - StepMeta() - end of readXML()
    2007/09/17 09:46:19 - be.ibridge.kettle.trans.TransMeta - Looking at step #1
    2007/09/17 09:46:19 - StepMeta() - looking for the right step node (De-serialize from file)
    2007/09/17 09:46:19 - StepMeta() - specifics loaded for De-serialize from file
    2007/09/17 09:46:19 - StepMeta() - end of readXML()
    2007/09/17 09:46:19 - be.ibridge.kettle.trans.TransMeta - We have 1 hops...
    2007/09/17 09:46:19 - be.ibridge.kettle.trans.TransMeta - Looking at hop #0
    2007/09/17 09:46:19 - outputlargefile - nr of steps read : 2
    2007/09/17 09:46:19 - outputlargefile - nr of hops read : 1
    2007/09/17 09:46:19 - largefile - Finished jobentry [outputlargefile] (result=true)
    2007/09/17 09:46:19 - largefile - Finished jobentry [loadlargefile] (result=true)
    =============


    It looks like the serialize file is being created (from where it was executed), but the job/2nd transformation does not seem to work.

    However, running the 2nd transformation --outside-- the job (with the serialized file still there) DOES work fine.

    edit: confirmed on Kettle 2.5.1 as well

    edit: added attachments for change to serialize/de-serialize
    Attached Files Attached Files
    Last edited by dhartford; 09-17-2007 at 11:16 AM.

  4. #4
    Join Date
    May 2006
    Posts
    4,882

    Default

    screwy garbage collection probably. I'll think of a tracker later today.

    Regards,
    Sven

  5. #5

    Default

    Well, once it is working, is there some way to pass a unique ID to the two different transformations running the same job -- something like the Job ID/Batch ID so that you can

    1) both transformations have the same serialize filename

    2) run the same job concurrently (i.e. on BI Platform) without them stepping on each other.

    3) use the temp directory (if feasible)
    Last edited by dhartford; 09-17-2007 at 11:22 AM.

  6. #6
    Join Date
    May 2006
    Posts
    4,882

    Default

    Currently I would set a variable at the start of a job (using time e.g.) and use variable substitution to get a unique name then.

    Regards,
    Sven

  7. #7

  8. #8

    Default

    Following your recommendation, created a variable to uniquely identify the serialized file.

    The attached transformation creates the variable using the JOB_BATCH_ID, but if you do not use logging, or are not using a Job, defaults to the time in milliseconds.

    Make sure to use the Delete File step at the end of the job to clean up, and modify what is placed in Serialize/De-Serialize Steps if you have multiple serializations for one job.

    Enhancement: Have the De-Serialize step have a checkbox to 'delete serialized file' upon completion.

    edit: confirmed 2.5.2-snapshot does not require a seperate System.gc() call. However, I do still have concerns regarding calling System.gc() when used on the BI Platform.
    Attached Files Attached Files

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.