View Full Version : performance with postgresql
04-17-2006, 02:12 PM
hey i'm back with new question ;-)
I'm building my ETL solution with kettle, I have to extract data from multiple sources like flat txt files and XML...
I've builded a transformation but my problem is the slowness of any step that interact with database, my database lookup and insert / update steps are running at better case ~8r/s (while xml and txt step running at ~300r/s) and yet I have little tables with less than 10 columns and my examples don't exceed 3000 rows... Is it normal?
I'm running on an IBM T40 1,5GHz pentium M with 512mo of ram and a 7200rpm hard drive under windows XP, with postgresql as RDBMS
04-17-2006, 03:55 PM
it's a good idea to talk about kettle's perfs,in this forum I didn't find any section talkin about that, can kettle's users share with us their experience to have a small idea about kettle's perfs??, so we can define some rules to optimize our ETL process...
thanks for all..
04-17-2006, 05:36 PM
I've reached 2600 rows per second while reading a TEXT File. The file contains around ~2 million rows. My model removes headers and unnecessary rows in the file. The file is 260MB in size. The total running time is 10 to 15 mins. I have 7 to 10 steps in my model with 2 filter steps dummies and input from text and output to database.
My computer is Intel Centrino 1.4 GHz with 256MB Ram. I think you need to have a balanced power between CPU chip and RAM.
I have here a 2.4 GHZ Intel P4 with only 128MB of RAM and my Intel Centrino performs a lot faster than the P4 Computer. The P4 computer have a lot of disk access (Centrino with almost no hard disk access). I strongly believe that the P4 computer will be better if it have at least 256MB of RAM.
04-18-2006, 12:03 AM
If you don't put an index on the lookup table, yes, this is normal.
You can speed up the database lookup with caching etc, but if I would have to guess (and I guess I have to because you give me nothing to work with) that would be the thing to check.
If you can't cache you can try running multiple copies of the database lookup.
I suggest you read the last Weekly Kettle Tip for more information.
04-18-2006, 12:07 AM
Well, this is obviously the case, but thank you for mentioning it.
However, this is true for ALL applications, not just for Kettle.
Note that Spoon can run small transformations in under 30MB of RAM, but that doesn't mean your Windows XP will fly on a 128M system. ;-)
04-18-2006, 03:24 AM
I obviously have an index on the look up keys...
Is there any way to cache in insert / update or database join steps like in database lookup step?
04-18-2006, 03:32 AM
>Is there any way to cache in insert / update or database join
>steps like in database lookup step?
No, not yet. That's because the comparator can be almost anything. For example it's difficult to use caching when you have Between, >, < clauses. I don't have a generic solution yet for this.
>I obviously have an index on the look up keys...
If the correct index is on the table then you should look at the network. Often if the network is very slow, the results are coming in slow as well.
If you *are* on a fast network, then I would insist you take a look at the database. Do some tracing & tuning.
I hope you understand that I have used Insert/Update to populate fact tables at hundreds and thousands of rows per second. However, you do need a well-tuned database and a good network. (as always ;-))
04-18-2006, 05:24 AM
Thanks for your follow-up...
The problem is that i'm running on localhost, so no networking slows down the performance...
And i think the queries have to bee much faster even without tuning the database... Cause running similar ones directly on psql is faster...
do you have any sugestion?
04-18-2006, 06:34 AM
I also tryied direct sql from spoon interface and it seems to be faster than in my tranformation.... I runned a more complex query than the one used for my transformation...
04-18-2006, 07:21 AM
Yes, I have a suggestion: look up the "SELECT" statement that is generated by the lookup step (logging level : Detailed) and run an EXPLAIN or EXPLAIN ANALYZE on it in PostgreSQL. Also, depending on the version of the database you might want ANALYZE database, vacuum, etc.
This will give you some insight into why the select statement runs slow and maybe even offer a solution. Perhaps your systems just needs more RAM, who knows?
In any case, I don't think it's a Kettle problem.
All the best,
04-22-2006, 07:27 AM
Because the performances was too low I understood that was not a Kettle problem.... I tryied on another computer and the performances was much better...
I think a was having windows problems, so I formated and reinstalled it (having no time to waste in shearching the problem), now the performances are greath...
About 700rows/s in insert/update....
But I have remarks about XML performances (Ill open a new thread)
Thanks a lot your help....