PDA

View Full Version : Table Output Performance - MySQL



Mick_data
05-08-2013, 04:59 AM
Hi.
I know that's already been discussed few times in the past, but I have found the performance of Table Output quite staggering.

I had to upload records to a table on a remote server so I tried with PDI.
Text input file, Table Output using MySQL.
The loading speed was 40/50 records/seconds.

Since it would have taken hours to complete, I gave up and use Toad instead (http://www.toadworld.com/).
It loaded all my records in less than 1 hour!

Since the PC was the same, same file, same internet connection, same destination database (and table), what is PDI doing differently?

My configuration:
PDI 4.4
Win Server 2003
MySQL driver: mysql-connector-java-5.1.22-bin.jar

MySQL version on remote database: 5.1.68-cll

What should I change in my PDI configuration to match the upload speed of TOAD?

Mick

MichaelBieri
05-08-2013, 05:15 AM
With enabled "Use batch update for inserts", you can usually reach a troughput of some thousend records per seconds an a basic infrastructure. Optimizing can be done with ajusting the commit size (for the batch updates) and enabling parallel writes (right click the step, set a value for "change number of copies to start" greater than 1).

But, validate first if the output is the real bottleneck. Monitor the "input/output" value in the "step metrics" tab while running. In your case, a value of "10000/0" should be shown (when using default rowset size). If not - you may have a problem elsewhere.

What is the source of your data? Reading/writing from/to the same table in a transformation will be a killer when using mySQL with myISAM storage engine.

Mick_data
05-08-2013, 05:38 AM
Hi Michael.
"Use batch update for inserts" is on by default.

As I have written (maybe not clearly), I was uploading a text file into a table, very simple and straightforward process.

I will try enabling the parallel writes, but changing the commit size it did not have any impact at all.

What I don't understand is the difference in performance between the 2 softwares.
What does Toad do to have - by default - such a good performance?

My wild guess is that under the hood they are using a "load data infile" - even on windows ;-)

Mick

MichaelBieri
05-08-2013, 05:53 AM
There are multiple possibilities. Comparing a "bulk load" to a transactional ("table output") way of loading is not fair. Bulk loaders will normaly not update indexes during loading, loading without transactions, ignoring constraints and so on..

But, loading 50 records/s is a realy bad number - even when using defaults..

What is the storage engine of your table? Heavy indexes?

Btw; there is a mysql bulk loader in PDI as well ;)

Mick_data
05-08-2013, 06:10 AM
The storage engine should be InnoDb and there are 5 indexes.


Comparing a "bulk load" to a transactional ("table output") way of loading is not fair
I agree, but PDI performance is still quite poor.


Btw; there is a mysql bulk loader in PDI as well
I know, but it does not work in Windows :mad:

I am not a programmer, so I don't know what Toad does to have such a good performance - keep in mind that input and output were exactly the same.
Same table, same indexes, same storage engine, same data to upload.

IF they use a bulk loading method, why PDI cannot do it somehow?

Mick.

MichaelBieri
05-08-2013, 08:39 AM
You don't realy need to use the bulk load utility. You can load as well with the "load data infile (http://dev.mysql.com/doc/refman/5.1/de/load-data.html)" statement by using the execute SQL step. I think this is the way that toad runs. This is an simple and fast way to get a well formed file to a table, but you may get some troubles with error handling or validation if required.

But im really confused about your performance experience - i have'nt seen such a bad number since years.
Are you able to analyze your mysql instance with "show full processlist" while loading?
Ensure that all your datatypes are correct. The JDBC driver goes crazy when its required to cast during loading. (In my case i sent an "Integer" field in PDI to a Varchar-Field in mysql. After adjusting the datatype of a single atribute, i got a performance gain of > 300%. Everything was working well in both cases - but not with the same speed)

Mick_data
05-08-2013, 09:36 AM
Ensure that all your datatypes are correct.
Never thought about that.

I will do a test after checking data types.


You can load as well with the "load data infile (http://dev.mysql.com/doc/refman/5.1/de/load-data.html)" statement by using the execute SQL step.
Yes, another good option, but it's easier (ehm.. ) to use table output and get the mapping done on the fly.

I'll check your suggestion about data types and post something later.

Mick

Mick_data
05-08-2013, 11:59 AM
Hi.

I have tried to load data into a test table - but had to do it on a different server.

First time I have used default data, then used a "Select Value" step to convert 2 fields from Numeric, Decimals to String (they were VARCHAR on DB).

The metrics were almost identical for both processes.

I don't know if it's because it is a different version of MySQL and the driver works better with it - but it seems odd.

Disclaimer: at the moment I cannot do a test on the remote server :-(

Mick

jsturm
05-09-2013, 09:03 AM
Just noticed this forum thread today... I've been working with Pentaho and MySQL for a few years, and have learned a few tricks and traps. The key is realizing that both are good software, but the JDBC abstraction layer loses a bit in translation because it wasn't designed for a simple API like MySQL.

The specific cause of the slowdown with Table Output is likely network round-trip activity, which sends one row at a time to the server. Batch inserts don't gain you much, if anything, using MySQL unless you configure certain JDBC properties. Conventional wisdom is that server-side prepared statements are a Good Thing, but with an unconventional database like MySQL there is little or no difference using prepared statements vs. sending a lot of little INSERT statements to the server. Due to the synchronous protocol, the network is the bottleneck in either case.

To remedy this, in PDI I create a separate, specialized Database Connection I use for batch inserts. Set these two MySQL-specific options on your Database Connection:

useServerPrepStmts false
rewriteBatchedStatements true

Used together, these "fake" batch inserts on the client. Specificially, the insert statements:

INSERT INTO t (c1,c2) VALUES ('One',1);
INSERT INTO t (c1,c2) VALUES ('Two',2);
INSERT INTO t (c1,c2) VALUES ('Three',3);

will be rewritten into:

INSERT INTO t (c1,c2) VALUES ('One',1),('Two',2),('Three',3);

So that the batched rows will be inserted with one statement (and one network round-trip). With this simple change, Table Output is very fast and close to performance of the bulk loader steps.

In summary, the techniques we've learned for MSSQL or Oracle don't help with MySQL. It's possible to get excellent performance with MySQL but you have to optimize for it. We have a database with several tables of around a billion rows, and these loads are managed entirely with PDI.

I wish the bulk loaders were easier to use in PDI. (If I ever start writing software again, maybe I will contribute some patches.)

My current performance problems are not related to bulk loading, but dimension lookup/update operations. Those are very slow with MySQL, and I haven't yet found any useful technique to make them faster. If I discover something I'll share it with this forum.

Mick_data
05-10-2013, 12:28 PM
Hi jsturm.
I tested the upload using your suggestion and it work :-)

The speed went from 1,200 records/sec to 15,000 records/sec!
Impressive.

Thanks a lot for your suggestion, it's a trick that should be in the wiki!

Mick

satran
05-30-2013, 08:56 AM
Hi,

Could u please suggest where I have to add those two parameters and test ? whether inside the feature list or shared.xml ?

satran
05-30-2013, 08:58 AM
Also,

I have certain tables coming from Source, how to do MySQL load without specifying target field mappings ??? Any idea. Please help me.

Mick_data
05-30-2013, 11:05 AM
Hi.

Could u please suggest where I have to add those two parameters and test ? whether inside the feature list or shared.xml ?
You have to check the Edit Connections Options (or advanced options).

Mick

Mick_data
05-30-2013, 11:07 AM
I have certain tables coming from Source, how to do MySQL load without specifying target field mappings ??? Any idea. Please help me.

I think that if you have source fields names exactly as Mysql table fields names, then table output (or insert update) should work even if you don't specify the target field mapping.

Mick

satran
05-31-2013, 12:21 AM
Thanks Mick. Yes Table Output will work . But MySQL Bulk Loader is loading null or empty data into target when fields are not specified :( Any work around? Also , I am trying the following command inside execute SQL Script to do a bulk load where fields will get replaced from a variable while doing iteration,

mysqlimport -h 10.237.69.125 -P 3306 -u u1celm01 -p password s1celm01 /tmp/Category.fifo -c categoryid,categoryname,subcategoryid,subcategoryname --fields-terminated-by=\; --fields-optionally-enclosed-by=\"

It really connects to MySQL database using MySQL command and lists down all options and usage commands, but after that no processing takes place and it hangs.
FYi - MYSQL Database Server and Pentaho Server resides in separate machines.

satran
05-31-2013, 12:24 AM
Even I have tried the following command inside shell script ,


mysql -h 10.237.69.125 s1celm01 -u u1celm01 -p password -e "LOAD DATA LOCAL INFILE "/tmp/demo_table.fifo" INTO TABLE Category FIELDS TERMINATED BY '|' (categoryid,categoryname,subcategoryid,subcategoryname)"

Mick_data
05-31-2013, 06:40 AM
Hi.
I'm sorry but I've never tried to use mysql command with shell script, so I can't help with that.