Hitachi Vantara Pentaho Community Forums
Results 1 to 8 of 8

Thread: Time taken to load 1TB worth data

  1. #1
    Join Date
    Jul 2008
    Posts
    12

    Default Time taken to load 1TB worth data

    Hello everyone,
    Any ideas on how much time it takes to load 1 TB data into a star schema using Kettle? I understand the short answer to this question is 'It depends on the kind of transformations' but If we categorize the trn logic to Simple/Medium/Complex, then how much time does it take to load 1TB data into a 2 stars (each star with 1 fact and 8 dimensions) and the transformation logic is "medium" in terms of complexity.
    One question we frequently hear from our potential customers is, 'why pentaho and why not informatica'? We all know the cost factor, rapid application development, clean gui, easy to learn and use etc., but any pointers on performance?
    Thanks.
    Ramesh.

  2. #2
    Join Date
    Jun 2007
    Posts
    50

    Default Performance

    Hey Ramesh....

    A 1-2 TB volume is not a real challenge for any serious ETL tool. Loading 5 GB per night is not a problem. We have a certain case (one of the largest banks in Canada) where we load/process 30GB of flat files in 2 hours. A very complex process, much more so than a simple star schema.

    With table loads we can do lots of things in parallel on multiple cores, do in memory lookups etc. Bulk loaders are available too or easy to do.
    At Mozilla they are processing/loading data into Vertica @ 1.2M rows/sec, going through 500GB of data in a few hours. 32 cheap nodes is what that takes.

    Here is some more for you to review:

    http://pentaho.com/products/data_integration/

    http://www.pentaho.com/informatica_alternative/

    http://www.bayontechnologies.com/bt/...whitepaper.php

    Regards,

    Mike Tarallo
    Pre-Sales Director
    Pentaho

  3. #3
    Join Date
    Jul 2008
    Posts
    12

    Default

    Mike,
    Thanks for the prompt response and the links too. Here is the situation...It is an on-premise implementation and we are looking at a 1TB initial load and probably 20GB per week. The client is not interested in a cluster of nodes at this moment. What they need is a guesstimated number on load time to load that amount of time in a server say with 2 cores.
    Ramesh.

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Hi Ramesh,

    We can only give you an idea as to how PDI can be made to perform.

    I once bulk loaded at 20MB/s into PostgreSQL on my laptop. Does that mean that my laptop can load 1TB worth of complex ETL data with lookups, cleansing, calculations, history, etc in half a day? I don't think so. (I don't even have that much disk in there :-))

    So please understand that nobody can answer a question like yours up-front. If we would try, it would only make us look like fools afterwards.

    Please note also that the same goes for ALL the vendors your customer might be considering. From our empirical data so far it looks like there are many situations where we are faster than tools like OWB, BODI. I have "about the same" data left and right for Informatica, but that's as far as it goes. We *always* recommend that you benchmark yourself and to *never* assume Pentaho is slower. :-) That last advice might save you some money here and there.

    Matt

  5. #5
    Join Date
    Jun 2007
    Posts
    476

    Default

    I may also add, We are currently working with a very big Telecom loading the CDRs into a DWH, we load daily around 10 to 20 GB and only the LOAD time is arround 3 hours, if you talk about the hole ETL process (uncompressing, cleaning duplicates, filtering fields, etc..) it takes something like 6 hours, this is done at a dev server, a solaris 10 with 2 cpus we are getting ready to go to production and we think it will take at least half the time (a solaris 10 with 8 cpus), so as everyone said, it cant be tell just like that how long will it take.
    Rodrigo Haces
    TAM / Enterprise Architect
    Pentaho

  6. #6
    Join Date
    Jul 2008
    Posts
    12

    Default

    Hey,
    Thanks much for the response. Those numbers give me a decent idea of load times. I can use that as base and work on it further and i do understand that the best answer to that question is 'it depends' :-)
    Ramesh.

  7. #7
    DEinspanjer Guest

    Default

    Quote Originally Posted by mtarallo View Post
    At Mozilla they are processing/loading data into Vertica @ 1.2M rows/sec, going through 500GB of data in a few hours. 32 cheap nodes is what that takes.
    Careful about that quote, it isn't quite accurate. The high volume processing task that I hit 1.2m rps on wasn't going into Vertica.

    At Mozilla, we take care to ensure that our Kettle ETL processes are streamlined and that they don't do anything that requires multiple passes through the data. This is important when you are processing more than 20 GB of raw data every hour. The biggest transformation that I run processes more than 10 GB of data in about 10 to 20 minutes. This includes data clean-up, looking up dimension keys, pre-aggregation to reduce the volume of data, and bulk loading into a staging table in Vertica.

  8. #8
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    It was a sales guy that made the quote, not me

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.