Hitachi Vantara Pentaho Community Forums
Results 1 to 8 of 8

Thread: is it possible to iterate the JOB using PDI

  1. #1
    Join Date
    Mar 2015
    Posts
    190

    Default is it possible to iterate the JOB using PDI

    Hi ,

    I have 80 millions of records, which is trying to retrieve data from OLTP to Warehouse. generally i can load data using Table input to insert/update step but this step is very slow because of that i used Synchronize after merge step, somewhat is better than insert/update.

    Few times i am facing java heap space issues to avoid this i am trying to use chunk wise iteration process, for this i am using Java Script. is it possible to iterate the job without any java script? please help me.

    first iteration : select * from abc limit 0,1000
    next iteration : select * from abc limit 1000,2000
    next iteration : select * from abc limit 2000,3000

    Java 1.7
    MySQL
    Pdi-ce-5.3

    Thank you

  2. #2
    Join Date
    Oct 2013
    Posts
    216

    Default

    You can use MySQL Bulk Loader which is very fast to transfer data, Or You can use Table Output with Exception Handling of Update step.....
    -- NITIN --

  3. #3
    Join Date
    Mar 2015
    Posts
    190

    Default

    Thanks for your input.

    sorry to mention my OS details now, i am using Windows. as you suggested MySQL Bulk Loader step is not applicable for MS Windows. Please see the comment in pentaho wiki.

    and mostly you will get java heap space issue whenever you want to read huge data(in my case 80 millions records) using table input step -> table output
    with Exception Handling of Update step.

    i increased java heap space (
    PENTAHO_DI_JAVA_OPTIONS="-Xmx4096m" "-XX:MaxPermSize=2048m") in spoon.bat file and it is okay to increase this much size why because my RAM size is 15gb


    Last edited by ranala; 11-04-2015 at 03:06 AM.

  4. #4
    Join Date
    Oct 2013
    Posts
    216

    Default

    You can also use Carte.bat to run Transformations or jobs as it is much faster than spoon.....
    -- NITIN --

  5. #5
    Join Date
    Jul 2009
    Posts
    476

    Default

    Here's the method that works for me. Right-click the Start step in your job and check the Repeat checkbox. In the Type dropdown list, choose Interval, and specify the rest interval between job runs in the "Interval in seconds" or "Interval in minutes" boxes.

  6. #6
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Another option but which is more complex to setup is to create a job and feed to that job multiple rows of data.

    Each row should have 2 values (0,1000; 1000,2000; ...) and in the job options select "execute for each row" (or something similar).
    Your 2 values should substitute 2 parameters in your SQL query in the Table input step.
    -- Mick --

  7. #7
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by ranala View Post
    mostly you will get java heap space issue whenever you want to read huge data(in my case 80 millions records)
    Java32 on Windows has a limit on the maximum heap space in the area of 1500m
    You should be able to process the number of records that you are asking to do in a single transformation -- you might have to decrease your rowset size if you have really wide rows (if each row is 2G, then there's not a lot that you can do!)

    I'm thinking that you're not doing a simple Table Input -> Table Output, as that would take really wide rows to cause a heap error.

    Matt Casters did a demo a long time ago, where the dataset processed in a single transform was in the billions of rows... so 80M rows should be possible too.

  8. #8
    Join Date
    Mar 2015
    Posts
    190

    Default

    Quote Originally Posted by gutlez View Post
    Java32 on Windows has a limit on the maximum heap space in the area of 1500m
    java 1.7 version which is 64 bit version. yes you are correct when i increase heap space in spoon.bat file upto 4096 then it is working fine.(in windows OS)

    Quote Originally Posted by gutlez View Post
    I'm thinking that you're not doing a simple Table Input -> Table Output, as that would take really wide rows to cause a heap error.
    yer per traditional warehouse , i am trying to load data as like below

    OLTP -> Staging-> DATAMART

    Oltp
    Table input -> Text file Output then Text file input -> Table Output

    Staging
    Generating surrogate key for each id from Oltp database and for incremental records,cleansing operations.

    Datamart
    Staging to Datamart
    (case 1 : Table input with sync after merge step, hope it is good approach)
    (case 2: Table input -> insert/update step but this very slow)
    (case 3: Table input -> Table output with error handling -> Update step for incremental data)

    .[/QUOTE]

    But even we follow this approach we are facing java heap space issues why because in staging we are trying to retrive 80 millions records at the same time thats why we are in plan to use chunk wise iteration (i.e. 1,1000, 1001,2000, 2001,3000 like this)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.