Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: Speed of transformation

  1. #1
    Join Date
    Jun 2007
    Posts
    128

    Default Speed of transformation

    Hi,
    Is there any way to increase the speed of a transformation.
    Actually I am trying to load 100 million records with 16 fields into mysql database.
    I have my data in multiple files.
    I used textfile input and table output step.It is very slow.
    BulkLoad into MySQL is not working on my system.
    So I tried creating a text file out combining the contents of all files into one and then using bulk load feature of mysql(Load data INFILE).
    Now I would like to know is there any way to speed up my transformation?

    Thanks
    Sreelatha

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    In 1 transformation you can start up multiple steps in parallel, e.g. 4 or 5 table output steps so that you max out the IO (your data must be able to cope with this).

    You can use clustering and partitioning, ...

    If bulkload is not working via Kettle I would first create a tracker, and then if not fixed do the loading via a script or so... for 100 million your better off with a bulk loader in most cases ;-)

    Regards,
    Sven
    Last edited by sboden; 09-21-2007 at 02:19 AM.

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Some MySQL tuning will come handy too.
    For example, remove all indexes besides the PK and rebuild it after loading.

    My tests on version 3.0 show that Kettle is nearly always I/O bound on the MySQL side.
    That's a nice step up from version 2.5, but it shows that we're reaching limits on the MySQL database.
    On my laptop that's around 15.000 rows/sec for customer data.
    At 10k rows/s you are doing 36M rows/hour and that is not too shabby using standard Table Output and CSV Input steps.


    Matt

  4. #4

    Default

    Hello,
    is there an article / post that show the usual/avg seep PDI reads data from Database?
    my avg read is 400 rows/s:

    here is a test i did:
    i have an sql server db with a customers table (200,000) . all i did is input table from that db to
    a mysql db the read speed is : 4000 rows/s that is very slow for just reading one table . (4 fields)

    what am i doing wrong?
    or this is the usual speed?
    thank you

    itamar

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I get around 50k to 150k rows/s from my local MySQL instance.

  6. #6

    Default

    I'm averaging around 2500-5000 r/s from a networked mysql 4 database instance, when I say on average, across multiple database schema/datatypes (usually at least 10 columns and mix of numeric, string, and date datatypes, no joins).

    This is one step input, regular Table Input step, standard JDBC settings (and about the same for JNDI w/ connection pooling and direct JDBC connect).

    It wouldn't be bad to share peoples 'real-world' numbers for some of these common steps, particularly with certain setups (networked vs running kettle on same box as database).

    In particular note, if certain things (like increasing the number of parallel input steps) can make a big difference.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.