Hitachi Vantara Pentaho Community Forums
Results 1 to 11 of 11

Thread: How to loop over a transformation until a table is empty?

  1. #1

    Question How to loop over a transformation until a table is empty?

    I have a simple ETL that's supposed to move data from one table to another.

    Records are marked with a special status on the source table after they land on the destination (the status is actually on a joined table, but the concept is the same).

    I'm running into a slight problem, Out Of Memory (Heap) when there are too many unprocessed records on the source table.

    So I set a LIMIT of 10,000 records to do this a chunk at a time.

    Now how can I setup the transformation to automatically loop until the source table has no unprocessed records left?

    Thanks

  2. #2
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    How many steps?
    How much memory is allocated to PDI (-Xmx from the Kitchen or Spoon file that you are using to launch the transform)?
    Can you post your transforms?
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  3. #3
    Join Date
    Nov 2008
    Posts
    777

    Default

    Quote Originally Posted by archi View Post
    I have a simple ETL that's supposed to move data from one table to another.
    How simple? Is it as simple as Table Input -> Table Output -> Update? Can you attach it to a post?

    Quote Originally Posted by archi View Post
    ...Out Of Memory (Heap) when there are too many unprocessed records on the source table.
    What makes you think that too many unprocessed records in the source table is causing the memory issue?

    Quote Originally Posted by archi View Post
    Records are marked with a special status on the source table after they land on the destination (the status is actually on a joined table, but the concept is the same).
    It's probably not the same but can you explain your "the status is actually on a joined table" side note a little further?
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  4. #4

    Default reply

    Quote Originally Posted by darrell.nelson View Post
    How simple? Is it as simple as Table Input -> Table Output -> Update?
    There are also some Look-ups, Select Values, and Modified Javascript steps.

    Can you attach it to a post?
    Yes maybe a little later.

    What makes you think that too many unprocessed records in the source table is causing the memory issue?
    Very simple, I just add a "limit 10000" to the query which reads unprocessed records and the problem goes away. If I remove the limit and try to process all 1,000,000 records, the transformation runs out of memory.

    It's probably not the same but can you explain your "the status is actually on a joined table" side note a little further?
    The concept is that the records can be in one of two states: processed or unprocessed. But instead of having an actual 'process status' column on the table, I just process records sequentially. This originally was a subquery to another table but for testing purposes I'm now just passing in a parameter:
    select * from QUEUED_RECORDS where RECORD_ID > ${last_rec_parameter}

  5. #5
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Repeating initial questions...

    Quote Originally Posted by gutlez View Post
    How many steps?
    How much memory is allocated to PDI (-Xmx from the Kitchen or Spoon file that you are using to launch the transform)?
    Can you post your transforms?
    Additional questions:
    What is the "No of rows in rowset" value of the transformations?
    What is the DB type of the source DB?
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  6. #6
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Quote Originally Posted by archi View Post
    I have a simple ETL that's supposed to move data from one table to another.
    So your transformation consists of a Table Input and a Table Output step?
    So long, and thanks for all the fish.

  7. #7

    Default

    Quote Originally Posted by gutlez View Post
    Repeating initial questions...
    Ooops, sorry about that, here we go:

    How many steps?
    15 steps right now, down from 20, as I've been disabling some steps to troubleshoot

    How much memory is allocated to PDI (-Xmx from the Kitchen or Spoon file that you are using to launch the transform)?
    I was running with 1GB in PDI 4.3, right now I'm on 5.0 and it's still on the default, because I forgot to set it to 1GB.

    Can you post your transforms?
    see screenshot below, I hope this helps, I'm not sure if I can post the source I'll have to check if there's any sensitive info

    Additional questions:
    What is the "No of rows in rowset" value of the transformations?
    It was set to the default of 10,000, I just changed it to 100 and it made no difference.

    What is the DB type of the source DB?
    MySQL 5.6. I develop on 64-bit Windows 7 and deploy to 64-bit Ubuntu 12.04 LTS. The problem is reproducible on both.

    Here's the transformation
    Name:  pdiOOM.jpg
Views: 231
Size:  24.2 KB

  8. #8
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Stream Lookup can be a real memory hog.
    You'll be better off with Database Lookup.
    Attached Images Attached Images  
    So long, and thanks for all the fish.

  9. #9

    Default

    Quote Originally Posted by marabu View Post
    Stream Lookup can be a real memory hog.
    You'll be better off with Database Lookup.
    Thanks, you were right on, the Stream Lookups were the problem.

    Database Lookup is to awkward to use because I need an additional key (all code-value definitions are in a single metadata table).

    So now I'm thinking to dump all the values into a JSON encoded file and use a Modified Java Script to do the lookups.

  10. #10
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    You can supply the same key to Database Lookup as you supplied to Stream Lookup.
    With a JavaScript step doing the lookup you might be trading time for memory - and not necessarily so.
    So long, and thanks for all the fish.

  11. #11
    Join Date
    Nov 2008
    Posts
    777

    Default

    Quote Originally Posted by archi View Post
    Database Lookup is to awkward to use because I need an additional key (all code-value definitions are in a single metadata table).
    Can't you simply read the code values from the single metadata table as your first Database lookup and then use them in subsequent lookups? Manually parsing a giant JSON file with JavaScript seems like a monster to me.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.