Hitachi Vantara Pentaho Community Forums
Results 1 to 16 of 16

Thread: PDI performance questions ??

  1. #1

    Default PDI performance questions ??

    Hi All,
    I am loading large number of files say 600000 files .
    I have some questions regarding performance of pentaho.

    Questions:
    1. After loading 60- 70 thousand files pentaho throws java memory heap error. i have already increase the heap size carte.sh to 2048 MB but still facing this problem, so every time i have to restart the carte server ( this is painful )

    2. from my experience i found out pentaho performance decreases forward in time , I know this is java related issue of garbage collection , if any one has any solution please let me know.

    3. Can anyone tell me how much time pentaho should normally take for inserting 28000 records in one table and 950044 in other table . I am using DB2 ,and it is taking 7 hrs .
    Please let me know if it is slow or avg ?
    Any advice on increasing the performance on pentaho and DB2 front are welcome

    Regards,
    chittora

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    IT ALL DEPENDS

    Quote Originally Posted by chittora View Post
    1. After loading 60- 70 thousand files pentaho throws java memory heap error. i have already increase the heap size carte.sh to 2048 MB but still facing this problem, so every time i have to restart the carte server ( this is painful )
    Depends on what you do with your files... maybe you can rewrite your logic in another way. Maybe there's still something wrong in PDI.. without more details

    Quote Originally Posted by chittora View Post
    2. from my experience i found out pentaho performance decreases forward in time , I know this is java related issue of garbage collection , if any one has any solution please let me know.
    Same answer as 1.

    Quote Originally Posted by chittora View Post
    3. Can anyone tell me how much time pentaho should normally take for inserting 28000 records in one table and 950044 in other table . I am using DB2 ,and it is taking 7 hrs .
    Please let me know if it is slow or avg ?
    Any advice on increasing the performance on pentaho and DB2 front are welcome
    Depends on your database location... is PDI running the same server as DB2, database setup, table DDL, ... from my experience around 200 rows per second is about the maximum you get with medium db2 complexity. 800 rows for a very simple insert... But I've also seen complex stuff run at 2 rows/seconds.
    It's hard to put definite numbers on it... too many variables involved.

    Regards,
    Sven

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Sven, 200/s is horribly low for such a database. We can get to 10-15k/s on Oracle, MySQL, etc. The PostgreSQL bulk loader reaches 40k. Column database LucidDB gets 20k... all on my laptop.

    Even without bulk loading you should be able to get in the thousands of records per second area quite easily. What is that makes DB2 so slow you think?

  4. #4

    Default

    Hi Sven,
    Thanks for the quick reply,

    Here is the complete detail regarding the database performance from my end

    1. DB2 & pentaho are running on the same machine
    2. In my job there are 3 lookups for getting values from diffrent table and 2 database insertion are there.
    3. one file on an average contains 34 records and for inserting one file it is taking 1.1 sec i.e 31.61 records/sec .
    I have already dropped all the indexes on these 2 tables but i am not able to drop primary key index , i dont know why db2 is not allowing me to drop this.

    If i compare the figures given by you for DB record insertion , this is very less

    4. Have you ever done any benchmarking on pentaho like how much load carte server can handle at a time , please let me know .

    for getting rid of java heap issue

    Can i user 2 -3 carte server on the same machine for the same job so that processing will be faster something like grid computing , i dont know whether it possible or not or how much it will help me .
    Can you provide some inputs on this

    I read some of the articles in pentaho community ,found some good tips to increase performance , i will try them

    Regards,
    chittora

  5. #5
    Join Date
    May 2006
    Posts
    4,882

    Default

    For the speed... the maximum I've ever seen an ETL job (without bulk loader) run on DB2 was 1200 rows per second (db2 v8, etl and database, Solaris)... and that really was a very simple ETL... Oracle gets 8000 rows/sec easily... 15K with some very simple jobs. Why DB2 is so slow, I don't know.

    for the original poster... 34 rows per file is a big processing overhead (and a lot of garbage probably with opening/closing)... I would try to merge the files up front. so you have 1 big file, or several bigger files instead of the small files.

    Regards,
    Sven

  6. #6
    Join Date
    Jun 2007
    Posts
    233

    Cool How fast can you read the files?

    Hi There,

    I would be curious to see the transformation that you are using. Are you able to post it here? I am also curious to see how fast you could read the files by themselves. This is a lot of hard drive IO, and maybe while you are reading the file(s) you are also trying to dump to the drive writing into the db2 database. It might be simply too much activity for the one disk to handle this any faster.

    Are you able to dump all the files into a 'central' location, and have a second machine read them while your machine receives the data (from the first over the network) and places it in the database.

    I am not convinced it is DB2 causing the problem. I had a similar performance issue with Postgres 8.3 and discovered (to my embarrassment) that it was in fact my method that was poor and not the database itself. I was getting about 10-12 r/s and after I 'fixed' my approach this jumped to around 450 r/s. I am pretty sure that if I used bulk loading I could dramatically increase that again. Its a fairly complex ETL process too, but the complexity wasnt the bottleneck. This is on an old Dell D610 laptop.

    I could also suggest that perhaps another way to grab all this data would be to first grab a list of all the filenames, dump them to a table, and place a field in that table to contain the status of the file, such as D for delete, P for Processing, C for complete, V for valid, etc... or whatever your logic requires. This way you can process the files one at a time and start again from where you last crashed. Place batch file numbers (some time of sequence number) with each filename and when you read the data from the file you can dump it to a 'staging' table, with the batch number of the file on each row, so that you can always tell where it came from.

    I guess the point I am making iss that you may be better off breaking the job you are trying to do up into a series of smaller jobs, and then using a 'master' job to sequence the smaller jobs together in a logical order. EG/ first read all the files and dump to a stageing table, second work through the staging table one batch at a time and validate the data row by row, third for each validated batch perform my data transformation logic on it, fourth for each non-valid batch I want a list of files / batch numbers emailed to me / dumped to excel / whatever so that I can see what went wrong and act on it.

    This works well for me as an approach, and I am using this type of thinking to handle about 3000 files a week each with approx 200k rows, all on my laptop! It works, and it works well. (I push the data now into Oracle - its all retail sales scanner data).

    Just my 2 cents

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  7. #7

    Default

    Quote Originally Posted by sboden View Post
    For the speed... the maximum I've ever seen an ETL job (without bulk loader) run on DB2 was 1200 rows per second (db2 v8, etl and database, Solaris)...
    Surely Sven you must be confusing the words DB2 and Progress!
    This is a signature.... everyone gets it.

    Join the Unofficial Pentaho IRC channel on freenode.
    Server: chat.freenode.net Channel: ##pentaho

    Please try and make an effort and search the wiki and forums before posting!
    Checkout the Saiku, the future of Open Source Interactive OLAP(http://analytical-labs.com)

    http://mattlittle.files.wordpress.co...-bananaman.jpg

  8. #8

    Default

    Hi Guys,
    I can't post my transformations in public forum if you could provide me your email id i will post it .

    Regards,
    Diwakar

  9. #9

    Default

    you could use the power of a zip file
    This is a signature.... everyone gets it.

    Join the Unofficial Pentaho IRC channel on freenode.
    Server: chat.freenode.net Channel: ##pentaho

    Please try and make an effort and search the wiki and forums before posting!
    Checkout the Saiku, the future of Open Source Interactive OLAP(http://analytical-labs.com)

    http://mattlittle.files.wordpress.co...-bananaman.jpg

  10. #10

    Default how to increase performance

    Hi all,
    Please find the attachment as transformations and job i am using for my ETL process also find the xml file in the attachment which i am using as input (as sample ).

    Please provide me inputs as how i can increase the efficiency.

    Business case : On an average i get 600000 files for loading and each file contains approx 34 records for loading into DB2 .

    Performance of the current system is as follows :
    1. For inserting one file it is taking 1.1 sec i.e 31.61 records/sec which is really too slow as said by Matt and others pentaho can load 8000 rows/sec

    I tried using multiple copies (distribute rows ) of transformation something like parallel process but still the performance is nearly same not much diffrence .

    If i only load all the xml files into memory without doing cleansing carte server throws java memory heap error (currently it is set to 2048 ) also if i load more than 70000 files carte server fails after loading say 45000 files .

    please do let me know how i can increase the performance .

    Regards,
    Chittora
    Attached Files Attached Files

  11. #11

    Default

    Quote Originally Posted by chittora View Post
    ................
    3. Can anyone tell me how much time pentaho should normally take for inserting 28000 records in one table and 950044 in other table . I am using DB2 ,and it is taking 7 hrs .
    ................
    If your DB2 is configured with HADR, it could be the cause for such slowness if the replication is set across many servers.

    My 2 cents.
    Pentaho Data Integration CE 5.3.0.x
    JDK 1.7
    OS X Yosemite version 10.10.x
    MySQL 5.5.37
    Amazon Redshift
    Pacific Standard Time

  12. #12

    Default

    I don't work with DB2, but is there different isolation levels and/or cursor options on the connection level that might help out?

    http://publib.boulder.ibm.com/infoce...d/r0010326.htm

    But, before jumping to conclusions:

    *Look at the r/s for EVERY step in the present setup of the job/transformation.

    *CHANGE the output from DB/2 to a local text file output. Trust me. Just try this and see if the r/s changes or not. If it changes significantly, yes its the DB/2 output. If it does not change it, its a different step that is causing the slow down, and you aren't reaching the DB/2 output limit - only the flow of the information going to the output is the limiting factor.

    My experience, you have to remember the ETL runs as a stream/flow of data, so one step might slow down the rest of the flow. There are 'buffers' you can set to help level the flow, but those can get full (or certain steps, like sort, aren't flow-friendly).

    Also, not sure how this would work out, but set the COMMIT numbers on the output step up (I think it's called commit or batch) -- think transaction batches. I would start at 1k and 10k and play with it until it reaches a good state.
    Last edited by dhartford; 07-08-2009 at 12:39 PM.

  13. #13

    Default

    Hi dhartford,
    I am using batch update feature and currently i am using 500 and 30 k as commit size.
    I will try writing data to file and let you know the performance by tomorrow.

    @ acbonnemaison - i am not using HADR feature of DB2.

    @TheFrog , Matt,bugg_tb,sboden - any comments on the transformation which i am using , please let me know your inputs

    Regards,
    Diwakar

  14. #14

    Default

    Hi dhartford & all ,
    I tried to write the data into csv files instead of DB2 and found out it is taking almost the same time.
    So DB2 is not the bottleneck point for me , it is pentaho or the way i am using the pentaho transformations in implementing the logic

    I tried seeing by myself to improve the performance ( by using bulk load, cache and minimizing the java script objects ) but didnt find any drastic change in time taken by pentaho to process the files.

    Can you suggest me something how to increase the performance ??

    Is there any possibility that the performance of pentaho also depends on the slave server configuration ??

    Regards,
    Chittora

  15. #15

    Default

    I'm not sure if someone has already written up a 'performance tuning' guideline, but some of the things I've done in the past:

    *If pulling data from a database, create a test dataset that represents the source database(s). Convert it to a text file. Make it big.
    *Change ALL your database inputs/outputs to file inputs/outputs.
    *Run the transformation IN SPOON where you can see the r/s across all the steps. Watch it from the first step down, and manually watch for when the r/s become significantly less than the previous step -- those would be your bottle necks.

    Anywhere you try to group, sort, flatten, or otherwise work with aggregate/grouping data WILL slow things down - that is the nature of those task regardless of tools used.

    Another way, if you have enough memory, is to put a 'Blocking step' in between each step, or around a questionable step, then you can get independent r/s measurements for the steps -- these will NOT reflect live performance, only help identify slow steps. If doing the 'blocking step' approach, you don't have to change your sources/outputs, but may want to limit the amount of data coming in to control the memory.
    Last edited by dhartford; 07-09-2009 at 09:34 AM.

  16. #16
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I'm not sure if someone has already written up a 'performance tuning' guideline
    I took a stab at it some time ago.

    http://wiki.pentaho.com/display/COM/...ing+check-list

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.