Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: CPU USAGE is reaching to 100%

  1. #1
    Join Date
    Aug 2015
    Posts
    313

    Default CPU USAGE is reaching to 100%

    Hi,

    I am using PDI-CE-6.0,MySQL 5.6,Java 1.7 , file repository,Windows server 2008 R2. whenever i am running ETL's with Initeal load data(i.e. mean to say not the incremental data) then it is reaching to CPU usage to 100%.

    I am running my job from batch file instead of spoon -> run button.

    Total volume is 9400000, previouly i worked on Oracle, MS SQL Server i haven't faced but i am facing issues with MySQL database. when i search in mysql forum, there are so many facing CPU usage problem. Still i am assuming there is a hope in Pentaho forum.

    I am distributing total records in below way, is it the main reason to reach 100% of CPU ?

    Example :- ( Table Input : Product 2000000 records with 101 columns) --> distributed into 4 steps

    Name:  distributing.jpg
Views: 907
Size:  22.8 KB


    Thanks,
    Santhi

  2. #2
    Join Date
    Jul 2009
    Posts
    476

    Default

    It looks like you are doing a table update for every row in the stream, and if the row isn't found in the table, you create the row in a table output step. During an initial load, every one of your millions of rows will go through the update step, which does a round-trip to the database to look for a matching row, before it goes to the Table Output step.

    The initial load would be much more efficient if you remove the Update steps and send the rows directly to the Table Output steps.

  3. #3
    Join Date
    Aug 2015
    Posts
    313

    Default

    Thank you robj,

    Actually in most of threads , people suggested us to use update step instead of insert/update step and sync after merge with merge rows(diff) is better also.

    I applied sync after merge with merge rows(diff) , but in my scenario i should use target table as my input step, so in this case there are chances of dead lock, table share lock issue.

    thats why i used Table input -> Update -> Table Output (with error handling)

    Etl is running fast but the problem with only MySQL database engine CPU usage reaches to 100%

  4. #4
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by santhi View Post
    Actually in most of threads , people suggested us to use update step instead of insert/update step
    I think you may have missed something in translation.

    There are at least 3 steps that write to tables:
    Update
    Insert/Update
    Table Output

    When you are expecting a large amount of new information with very little updates, the preferred setup is Table Output with Error Handling connected to Update
    When you are expecting a large amount of updates with very little new information, the preferred setup is Update with Error Handling connected to Table Output
    When you are expecting a balanced amount of new information and updated information, use the Insert / Update step.

    Another option (if you are not on Windows) is to use the MySQL Bulk Loader, but that takes things in a completely different direction.

    If your ETL is running very fast, and your MySQL database CPU is hitting 100%, then perhaps your MySQL DB server is your current bottleneck. Remember: It's difficult to completely remove bottlenecks, they usually just move around. If you increase the CPU performance of the MySQL DB server, then your ETL will likely become the bottleneck. When both the MySQL DB and the ETL are not the bottleneck, then the Source DB will be.

    Based on that knowledge, set expectations appropriately. 2M Records at 101 columns wide (don't know the width of the columns...) could be 2GB of traffic, if you are running on 1Gb links, you're looking at a few minutes without any bottlenecks or processing. If you have latency on the links, then the time goes up. If you have lower bandwidth than 1Gb (very few internet links are full-throughput 1Gb!), then the time goes up.

  5. #5
    Join Date
    Aug 2015
    Posts
    313

    Default

    Thank you very much gutlez.

    i understand your scenarios. i am in plan to use better step for all scenarios. yes my ETL is running fast, for example 4000000 records with 101 columns data is loading with in 4 to 4.5 hours.

    my database exist is in AWS - RDS and its configuration is 2 CPU's and 15.5 GB RAM.

    the main problem is when i am running ETL's , few connections are not closing (i.e. locking few connections) but when i run ETL's in AWS-EC2 server then it is working fine.

    I will check my issue deeply, thank you

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.