Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: transformation gets progressively slower.

  1. #1
    Join Date
    Oct 2007
    Posts
    10

    Default transformation gets progressively slower.

    Im transforming a 50 million row table to a file. I only have to do this once, and technically i can live with the speed issue. It seems that the transformation starts off fast and then starts to get progressively slower. The cpu utilization is 100%. Within this transformation im doing about 6 dimension lookups on a postgres database. Im using version 3.0.

    Its not a huge deal as i said I will only have to transform the table once and then the updates will be doing very small amount of rows. I was just wondering if im missing some type of setting or there is something I could be doing differently.

  2. #2
    Join Date
    Oct 2007
    Posts
    10

    Default

    So, my transformation crawled to a stop at around 14 million. IT started off doing about 20K rows per second. One thing i noticed is that on the combination lookup dimension the following query is ran.

    SELECT column_sk FROM dim_table WHERE ( ( column_bk = 'junk' ) OR ( column_bk IS NULL AND 'junk' IS NULL ) );

    This is going to result in a sequential scan instead of an index on the column_bk when using default btree index because you cant use `OR` with this type of index.

    Query cost ( after running it several times )
    Total runtime: 133.765 ms

    If i remove the OR ( column_bk IS NULL AND 'junk' IS NULL ) );

    Total runtime: 0.109 ms

    Any workaround? is this a bug?

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I don't think an index is scanned sequentially ;-) It's probably the table.
    Make sure your databases "sees" the index if there is one. (analyze table, vacuum stuff, that sort of thing)

    Finally, there is an alternative by using the hashcode option. We will create a hashcode and store it in the database.
    Put an index is put on that and you should see improvements, especially if you have larger numbers of columns to junk away.

    All the best,

    Matt

  4. #4
    Join Date
    Oct 2007
    Posts
    10

    Default

    Sorry of course i meant the table was sequentially scanned, My point was that kettle was using an OR in the sql statement to detect null values ( the column is is declared as not null ). This will result in a sequential scan instead of an index scan, regardless if the index is there and analyzed. in this case it is much slower to do a sequential scan then an index scan. If the dimension was smaller it wouldnt matter as much.

    I will of course look into the hash code. Ill post my results. Thanks matt.

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Well, RDBMSs like Oracle are smart enough to use the index on "column_bk" anyway, so it depends on the optimizer to use the index or not.
    However, even in that case, you run into trouble if you have a situation like these:

    1) you have a very large junk dimension
    I once had the situation where I had to store millions upon millions of transactions with an address associated to it (no key)
    The index on all the fields in the table used up more data then the table itself (bitmap index) AND it was slow to insert the data.

    2) you have a large number of columns in the junk dimension
    Certain databases like Oracle are limited to a certain maximum number of columns that can be placed into an index. On Oracle this was 32 IIRC.

    Both problematic situations can be solved by using a hashcode on the complete row, stored in the table.
    It's an optimization that "maps" the complete row (without the key) onto a 64-bit integer.
    Sure, you can and will have hash-misses, but usually that number is always limited to 4 or 5 in a table with tens of millions rows of data.
    That way, the database has to compare only a few rows of data besides the index.

    If your database even has a problem with this situation, you should consider switching to another RDBMS ;-)

    All the best,

    Matt

  6. #6
    Join Date
    Oct 2007
    Posts
    10

    Thumbs up

    The speed of the hash is great. Thanks. Small duplication errors are occurring that will have to be manually cleaned up. But its worth it for the initial transformation.

    Im not sure why its missing the index.. There is likely some logical reason. Maybe im a tard. ill go back over it. Honestly i dont see the point of null in these types of dimensions anyhow...

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    For reference, the core of the problem is that databases evaluate "null == null" to false.
    However, most databases don't store null values in an index.
    As such, it depends on the cleverness of the optimizer to find out that it makes sense to use the index or not.

    Matt

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.