Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: performance and scalability of Kettle

  1. #1

    Default performance and scalability of Kettle

    Hi:

    We are planning to evaluate Kettle for one of the projects in our company. Most Importantly, we would like to know how kettle performs in real time data warehouse scenario. By real time, I mean ETL will be kicked out once every hour and average data volume will be around 1 million records (give or take 50,000 records).
    I know that data loading performance also depends on complexity of work-flow and transformations involved. So, I'm not looking for any concrete answer here. I would like to know from experienced folks about the performance and scalability of Kettle. Are there any studies done in this respect?

    Thanks,
    Tapasvi

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    If you get more than 300 rows per second you will finish in time. It depends on what you're doing. And if really required you can optimize transformations in a couple of ways by either starting multiple copies of some steps or using distribution.

    Regards,
    Sven

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Personally, I would launch it every 5 or 15 minutes.
    IMHO, The trick in these situations (regardless of the chosen ETL tool) is too keep the load on both the source systems as the target warehouse stable and predictable throughout the day.

    Put differently, you want to rather process 200,000 records every 12 minutes in stead of 1M every hour.

    Matt

  4. #4

    Default

    Thanks Matt.

    I 'm investigating about:

    1. partitioning data with Kettle and parallism (This also requires me to understand about RAID but thats a diff story)
    2. Using Kettle inside a web container

    Could you please suggest starting point (good document) for these two ? I'll take care of the rest.

    Thanks.
    Tapasvi

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Raid is something completely different, but I talked about (database) partitioning and clustering at the MySQL conference in Santa Clara in April.

    If you skip the first couple of slides, you'll hit the partitioning explanations.

    For the second part, that's a bit outside of the scope for this forum, but I know there is plenty of stuff in the Wiki that explains how to launch Kettle jobs and transformations on the Pentaho platform.
    This one is perhaps a good starting point:

    http://wiki.pentaho.org/display/COM/...ntaho+Solution

    All the best,
    Matt

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.