Hitachi Vantara Pentaho Community Forums
Page 1 of 2 12 LastLast
Results 1 to 10 of 17

Thread: Table Output Performance - MySQL

  1. #1
    Join Date
    Apr 2008
    Posts
    1,770

    Question Table Output Performance - MySQL

    Hi.
    I know that's already been discussed few times in the past, but I have found the performance of Table Output quite staggering.

    I had to upload records to a table on a remote server so I tried with PDI.
    Text input file, Table Output using MySQL.
    The loading speed was 40/50 records/seconds.

    Since it would have taken hours to complete, I gave up and use Toad instead (http://www.toadworld.com/).
    It loaded all my records in less than 1 hour!

    Since the PC was the same, same file, same internet connection, same destination database (and table), what is PDI doing differently?

    My configuration:
    PDI 4.4
    Win Server 2003
    MySQL driver: mysql-connector-java-5.1.22-bin.jar

    MySQL version on remote database: 5.1.68-cll

    What should I change in my PDI configuration to match the upload speed of TOAD?

    Mick

  2. #2

    Default

    With enabled "Use batch update for inserts", you can usually reach a troughput of some thousend records per seconds an a basic infrastructure. Optimizing can be done with ajusting the commit size (for the batch updates) and enabling parallel writes (right click the step, set a value for "change number of copies to start" greater than 1).

    But, validate first if the output is the real bottleneck. Monitor the "input/output" value in the "step metrics" tab while running. In your case, a value of "10000/0" should be shown (when using default rowset size). If not - you may have a problem elsewhere.

    What is the source of your data? Reading/writing from/to the same table in a transformation will be a killer when using mySQL with myISAM storage engine.
    Last edited by MichaelBieri; 05-08-2013 at 05:19 AM.

  3. #3
    Join Date
    Apr 2008
    Posts
    1,770

    Default

    Hi Michael.
    "Use batch update for inserts" is on by default.

    As I have written (maybe not clearly), I was uploading a text file into a table, very simple and straightforward process.

    I will try enabling the parallel writes, but changing the commit size it did not have any impact at all.

    What I don't understand is the difference in performance between the 2 softwares.
    What does Toad do to have - by default - such a good performance?

    My wild guess is that under the hood they are using a "load data infile" - even on windows ;-)

    Mick

  4. #4

    Default

    There are multiple possibilities. Comparing a "bulk load" to a transactional ("table output") way of loading is not fair. Bulk loaders will normaly not update indexes during loading, loading without transactions, ignoring constraints and so on..

    But, loading 50 records/s is a realy bad number - even when using defaults..

    What is the storage engine of your table? Heavy indexes?

    Btw; there is a mysql bulk loader in PDI as well

  5. #5
    Join Date
    Apr 2008
    Posts
    1,770

    Default

    The storage engine should be InnoDb and there are 5 indexes.

    Comparing a "bulk load" to a transactional ("table output") way of loading is not fair
    I agree, but PDI performance is still quite poor.

    Btw; there is a mysql bulk loader in PDI as well
    I know, but it does not work in Windows

    I am not a programmer, so I don't know what Toad does to have such a good performance - keep in mind that input and output were exactly the same.
    Same table, same indexes, same storage engine, same data to upload.

    IF they use a bulk loading method, why PDI cannot do it somehow?

    Mick.

  6. #6

    Default

    You don't realy need to use the bulk load utility. You can load as well with the "load data infile" statement by using the execute SQL step. I think this is the way that toad runs. This is an simple and fast way to get a well formed file to a table, but you may get some troubles with error handling or validation if required.

    But im really confused about your performance experience - i have'nt seen such a bad number since years.
    Are you able to analyze your mysql instance with "show full processlist" while loading?
    Ensure that all your datatypes are correct. The JDBC driver goes crazy when its required to cast during loading. (In my case i sent an "Integer" field in PDI to a Varchar-Field in mysql. After adjusting the datatype of a single atribute, i got a performance gain of > 300%. Everything was working well in both cases - but not with the same speed)
    Last edited by MichaelBieri; 05-08-2013 at 11:31 AM.

  7. #7
    Join Date
    Apr 2008
    Posts
    1,770

    Default

    Ensure that all your datatypes are correct.
    Never thought about that.

    I will do a test after checking data types.

    You can load as well with the "load data infile" statement by using the execute SQL step.
    Yes, another good option, but it's easier (ehm.. ) to use table output and get the mapping done on the fly.

    I'll check your suggestion about data types and post something later.

    Mick

  8. #8
    Join Date
    Apr 2008
    Posts
    1,770

    Default

    Hi.

    I have tried to load data into a test table - but had to do it on a different server.

    First time I have used default data, then used a "Select Value" step to convert 2 fields from Numeric, Decimals to String (they were VARCHAR on DB).

    The metrics were almost identical for both processes.

    I don't know if it's because it is a different version of MySQL and the driver works better with it - but it seems odd.

    Disclaimer: at the moment I cannot do a test on the remote server :-(

    Mick

  9. #9

    Default

    Just noticed this forum thread today... I've been working with Pentaho and MySQL for a few years, and have learned a few tricks and traps. The key is realizing that both are good software, but the JDBC abstraction layer loses a bit in translation because it wasn't designed for a simple API like MySQL.

    The specific cause of the slowdown with Table Output is likely network round-trip activity, which sends one row at a time to the server. Batch inserts don't gain you much, if anything, using MySQL unless you configure certain JDBC properties. Conventional wisdom is that server-side prepared statements are a Good Thing, but with an unconventional database like MySQL there is little or no difference using prepared statements vs. sending a lot of little INSERT statements to the server. Due to the synchronous protocol, the network is the bottleneck in either case.

    To remedy this, in PDI I create a separate, specialized Database Connection I use for batch inserts. Set these two MySQL-specific options on your Database Connection:

    useServerPrepStmts false
    rewriteBatchedStatements true

    Used together, these "fake" batch inserts on the client. Specificially, the insert statements:

    INSERT INTO t (c1,c2) VALUES ('One',1);
    INSERT INTO t (c1,c2) VALUES ('Two',2);
    INSERT INTO t (c1,c2) VALUES ('Three',3);

    will be rewritten into:

    INSERT INTO t (c1,c2) VALUES ('One',1),('Two',2),('Three',3);

    So that the batched rows will be inserted with one statement (and one network round-trip). With this simple change, Table Output is very fast and close to performance of the bulk loader steps.

    In summary, the techniques we've learned for MSSQL or Oracle don't help with MySQL. It's possible to get excellent performance with MySQL but you have to optimize for it. We have a database with several tables of around a billion rows, and these loads are managed entirely with PDI.

    I wish the bulk loaders were easier to use in PDI. (If I ever start writing software again, maybe I will contribute some patches.)

    My current performance problems are not related to bulk loading, but dimension lookup/update operations. Those are very slow with MySQL, and I haven't yet found any useful technique to make them faster. If I discover something I'll share it with this forum.
    Last edited by jsturm; 05-10-2013 at 12:03 PM.

  10. #10
    Join Date
    Apr 2008
    Posts
    1,770

    Default

    Hi jsturm.
    I tested the upload using your suggestion and it work :-)

    The speed went from 1,200 records/sec to 15,000 records/sec!
    Impressive.

    Thanks a lot for your suggestion, it's a trick that should be in the wiki!

    Mick

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2017 Pentaho Corporation. All Rights Reserved.