Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: Handling Network failure in ETL process.

  1. #1
    Join Date
    Jan 2012
    Posts
    12

    Default Handling Network failure in ETL process.

    Hi,

    I have a situation where I need to read multiple CSV files of size 1GB or read large amount of data from a Table, perform transformation into it and load the output in table. So running this ETL process will take very long time. Now if due to network failure or some other problem , my ETL process stops in middle, only half of the data is loaded and half is left, then , how can I resume that ETL process from the point it had stopped? If I start the ETL process from beginning by truncating the target table, it is a waste of time to load same data again.
    Is there any method by which I can resume the ETL process from the point it stopped so that I do not need to load the same data again? I have read the post "Batch Numbering of file Imports with Postgres 8.x" in Pentaho Kettle forum where someone has mentioned about generating batch numbers with file names. But this post is not clear to me. It would be a great help if you can provide me the best possible method to handle this situation. It would be more clear if you can provide me some document or sample job which I can use to solve this problem.

    Thanks

  2. #2
    Join Date
    Dec 2011
    Posts
    124

    Default

    As of my knowledge we will be face issue while reading data from source and loading to target database(table input and table output steps). Basically we haven't face this issue while reading data from csv files. we need to look into other things while we running etl's

    a) your system and servers(source and target database) confugaration, should be having minimum of 4 gb ram etc.
    b) your network connectivity(few times network will connect and disconnect)
    c) try to give commit size = 500 (previously if you used 1000 or more, now try to reduce to 500 or 600)
    d) if you are using table output steps better to apply flat files(text files or csv files)
    e) avoid unwanted indexes on columns, or apply combination indesex
    f) increase heap size 1024 in your all kettle(pdi) application batch files
    example :- set PENTAHO_DI_JAVA_OPTIONS="-Xmx512m" "-XX:MaxPermSize=256m"


    apply above scenarios and recheck your etl oparations, if not work please post your system and db confiration details and samples of your jobs we will observe it.

    Thanks
    Last edited by raghavavundavalli; 08-09-2012 at 05:48 AM.

  3. #3
    Join Date
    Jan 2012
    Posts
    12

    Default

    Hi,
    Thanks for the reply.
    But my main question is that, many times my ETL process stops in between due to some kind of problems like DB server failure or any other problem which forces the ETL process to stop , then can we resume our ETL process from the point it had stopped? We have machines with configuration of 8gb RAM.
    I basically need a solution to resume my ETL process from the point it had stopped. If there are total 10,000 records and 5000 have already been inserted. After this , if due to some error my ETL process stops, then I want that my ETL should start reading 5001 records from source. I do not want to begin ETL from start , reading from 1st record.

    Thanks

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Usually folks implement a CDC like scenario where they look into the target table to see where they left off last time. Usually an incremental unique ID is used to know this.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.