Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: Performance difference of tmap in Talend vs. multiple database lookups in PDI

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Join Date
    Aug 2011
    Posts
    2

    Default Performance difference of tmap in Talend vs. multiple database lookups in PDI

    Hello Everyone,

    I encountered significant performance difference while loading fact table after lookup of keys in dimension tables in PDI and Talend.

    Loading 0.1 million records in fact table in Talend took around 15 seconds and in case of PDI it exceeded 3 minutes.

    In PDI, my fact loading transformation layout is as follows:

    Table Input Step(XXX_source) --> Database Lookup1 --> Database Lookup2 --> Database lookup3 --> Select Values --> Table output(fact table)

    I enabled cache of database lookup steps and cache size in row as 0.

    In talend ETL job, there is a central tmap control which performs lookups on multiple dimensional tables and loads records in fact table.

    I monitored logs generated in mysql by both PDI and Talend and found that,

    For Talend, the log showed a series of Select statements for different dimension tables and then Insert statements for fact tables.

    It seemed to me that after getting all the records from dimension tables, the records are looked up / joined in memory and then generates Insert statements to execute against the database.

    In PDI, for each record the database lookup is performed (multiple times as there are lookups in multiple dimensions) and dynamic sql for Insert are generated and executed.

    I rewrote the ETL transformation using the Single Threader step to make it run in Single Threaded Engine which didnot help much in performance.

    How can I improve the performance of the above transformation? Please help.




    And also the above operation needs to be carried out for multiple client records stored in Clients table. Each client has a table named as clientname_source.

    The main job has been designed as

    Start-> Get_client_records --> (execute for every input row) -> Fact loading transformation -> End

    If there are 100 client records, the process loops sequentially for 100 times.
    How do I go about executing the transformation parallely for different clients for performance boost?

    Thanks,
    shraj

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Talend loads all the data in memory, make sure you do the same thing.

  3. #3
    Join Date
    Aug 2011
    Posts
    2

    Default

    Thanks Matt for the quick response.

    But how do we load all the data in memory?

    I tried the option Load all data from table in Database lookup which pre-loads the cache with all the data present in the lookup table. But the comparison made was incorrect mainly because of the differences in the way database and Java Engine compares. How can we work around this?

    Is there any other way for doing that?

    Thanks
    shraj

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    That's impossible for me to know.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.