Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: optimization with spoon and postgresql

  1. #1
    Join Date
    Aug 2014

    Default optimization with spoon and postgresql


    I'm writing you to try an optimization with a CSV of ~2.000.000 rows, I'm available to import it into a table in PostgreSQL database (in 1 minute with PostgreSQL bulkloader, it's fantastic). Inside PostgreSQL I use function to read each record (using cursor) and update one field:

    total_rows CURSOR FOR SELECT id, state, town, client_type, consume FROM la.clients;
    FOR tr IN total_rows LOOP
      perform la.update_band(, tr.state::text,, tr.consume, tr.client_type);
    The la.update_band function has something like:

    IF client_type = 1 THEN
          SELECT INTO band_assined * FROM la.bands WHERE (consume BETWEEN banda_low AND banda_up) AND type_of_client = client_type AND band_state like '%' || state || '%' AND band_town like '%' || town || '%';                                        
    END IF;
    ... (and there are another similar IFs for each client_type, only 4 actually). At the end of update_band function (when is selected band for client) it's update the table la.clients with the selected band:
    UPDATE la.clients SET co_band = band_assined.co_band WHERE id = client_id;

    The method mentioned above works good but it's -too slow-, with 100.000 rows it take something like 2 hours, with 600.000 rows about 11 hours, and with 1.200.000 it's running just right now (16 hours and counting...). The function la.update_band appers to be the problem, the table la.bands has 150 rows (is small) and it's used to select the band for clients using his state, town, type of client and consume.

    The question is how to reduce time for process, I think it could be fast using only spoon (not DBMS) to assign client band before upload to PostgreSQL using bulkloader, maybe could be faster but I don't know how to manage (or map) the table la.bands to avoid select to database and only work with CSV. I also have the doubt if it's more fast or not.

    Some ideas? Thanks for your help and have a nice day.

    Last edited by mga; 11-18-2014 at 10:59 AM.

  2. #2
    Join Date
    Feb 2011


    I think your problem is that you are doing an update for every row. That means a TON of transactions. Once the data is loaded, you may be able to write a single UPDATE statement using CASE in it and modify all rows as 1 transaction.

  3. #3
    njain111 Guest


    I think I agree with you that it should be much faster if you do you band look up logic in PDI/Spoon instead of Postgres.

    After your CSV input, you can use a switch statement for you client type.
    Each client type can have there own 'Data Base Lookup Step' in this step you can look up band, make sure to 'check load all data from table'.

    Then join the various lookup output using dummy step and feed it to 'bulk loader'.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.