Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Performance issues in design time

  1. #1
    Join Date
    Sep 2007
    Posts
    6

    Default Performance issues in design time

    I have two exactly same tables. Let we say tableA and tableB with identical structure, in the same database, created using same account etc... tableA is empty, tableB has 4 Milion rows. I have Table Input step using simple 'select ... from tableA / tableB' statement without any where condition, order by etc. There is no other step in the transformation.

    Now when I right click the step and select Show output fields, there are performance differences between empty and non-empty table. On empty table I get the answer immediatelly, on the large table it takes about 5 minutes (!) and CPU utilisation raises to 90% all the time. I can not see any locks on the database (SQL Server 2000). Once the structure has been discovered, there are no performance problems later in the same session (i.e. until restarting spoon), even when changing between tableA and tableB in the from part of the statement.

    Does it mean, that PDI is selecting the data when describing the output structure ?

  2. #2
    Join Date
    Sep 2007
    Posts
    6

    Default

    OK, I analyzed the sql statement sent to the database and really - PDI is selecting the all 4 Million rows !!!!

    FYI - Here is the code to get the last sql statement on MS Sql Server 2000 for session 52

    DECLARE @Handle binary(20)
    SELECT @Handle = sql_handle FROM sysprocesses WHERE spid = 52
    SELECT * FROM ::fn_get_sql(@Handle)

    reports

    dbid objectid number encrypted text
    ------- ----------- --------- ------------ --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    (null) (null) (null) false ()SELECT
    RCLNT
    , GL_SIRID
    , RLDNR
    , RRCTY
    FROM tableB

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    It's highly likely that coming from SQL, there is no other way to find out the layout of the return table.
    By the way, we do pass all sorts of hints to the database to indicate we only want a single row.
    However, "certain" databases simply don't want to listen ;-)

    The reason the query isn't executed twice is because the result is cached. You can clear the cache by clicking right on the database connection and selecting the appropriate option.

    Matt

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    You know what Robert? I just figured out a way around that problem.
    I hope it works on all database, but I have good hopes it does.

    Matt

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    If you are running version 2.5.x, give this kettle.jar a try (2.5.3 preview) : http://kettle.s3.amazonaws.com/kettle.jar
    Copy this jar file into the lib/ directory and see if it still takes 5 minutes.
    (if not, download 2.5.2 and do the same)

    Please make sure to clear your database cache before trying to get the output fields of the "Table Input" step.

    Thanks in advance for your feedback.

    All the best,

    Matt

  6. #6
    Join Date
    Sep 2007
    Posts
    6

    Default

    YES ! It works, tested at my Microsoft SQL Server 2000. Thanks very much for your help.

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    You're welcome. The fix should appear in upcoming releases: 2.5.3 and 3.0.1

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.