Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Quickly denormalize a very large dataset & only retain unique rows

  1. #1
    Join Date
    Oct 2007
    Posts
    255

    Default Quickly denormalize a very large dataset & only retain unique rows

    I've a large source dataset, with ~10-15 columns, and at least for the example I'm working with, ~30 million rows in a single ETL run. Real-world conditions would vary significantly, so I'm trying to make this as efficient as I can.

    Once the source data gets stored in the target DB, I could extract the two columns and get the unique results using something akin to:

    Code:
    select fieldA as field_int
      from thetable
      group by fieldA
    union
    select fieldB as field_int
      from thetable
      group by fieldB
    I've tried this, and compared to doing it in a transformation it's very slow. I've opted to add a separate path within the transformation that extracts this data from the source system whose purpose is to produce what the above query would do. It's able to get the job done in the same amount of time as the 2nd path takes alone (that is, if I disable the secondary path), so I feel confident this is at least a good approach, if not the best approach.

    What I'm looking for is advice on how best to streamline what I've accomplished so far. I'm using a 'sort rows' to eliminate duplicate rows, then a denormalize step, then another 'sort rows' to eliminate duplicate entries to this dimension. I'm sorting twice because I end up with 60M+ rows by the time I get to the 2nd sort rows step if I don't initiate the first one, whereas if I eliminate duplicates first, I have about 1.5M rows, which denormalizes to 3M, and after the 2nd sort the output is ~100,000 rows.

    Daniel Einspanjer has offered to outline a solution, and wanted this posted publicly so others could benefit.

    example.zip

    -Brian
    Last edited by Phantal; 09-08-2008 at 04:22 PM.

  2. #2
    DEinspanjer Guest

    Default

    Looking at the example I'm really not sure what the purpose of either your SQL above or the denormalize step truly is.

    It *looks* like the fieldA is some sort of key and the fieldB is some sort of value. Currently I believe your denormalize step doesn't work at all since it never puts any of the fieldB values into field_int. Do you really want a unioned list of all the fieldA and fieldB values?

  3. #3
    DEinspanjer Guest

    Default

    Here is an example of using a Javascript step to cache the values in a HashSet class which will result in a unique list.

    Hope it helps.
    Attached Files Attached Files

  4. #4
    Join Date
    Oct 2007
    Posts
    255

    Default asdf

    Daniel,

    Thank you. I haven't checked in the last few minutes, but the one I attached should've done what I needed -- that is, fieldA & fieldB both contain the same type of value, I just wanted to produce a unique row for any value contained in either field, ignoring everything else. I'm fairly certain it worked before I uploaded it, so if it doesn't work, well, you got the jist

    -Brian

  5. #5
    Join Date
    Oct 2007
    Posts
    255

    Default asdf

    Daniel,

    That is just cool. I had no idea some of that stuff was possible in javascript steps.

    I made a few changes, the most noteable being I eliminated the script that outputs the rows when the script is finished, and added two if statements that'll only add new rows if set.add() returns true:

    Code:
    temp = [new java.lang.Long(fieldA), new java.lang.Long(fieldB)];
    
    for (i = 0; i < 2; i++) {
      if (set.add(temp[i])) {
        newrow = createRowCopy(getOutputRowMeta().size());
        newrow[getInputRowMeta().size()] = temp[i];
    //    newrow = [null, null, temp[i]];
        putRow (newrow);
      }
    }
    
    trans_Status = SKIP_TRANSFORMATION;
    I used createRowCopy, getOutputRowMeta, and getInputRowMeta because this transformation uses a stored procedure to extract data from the database instead of a normal select statement, and Kettle isn't able to get field metadata from stored procedures, so the javascript step doesn't play well with this data. When I was using [null, null, temp], it I received nullpointer exceptions, an arrayindexoutofbounds exception, and I think one or two others, so I decided to construct the new row dynamically.

    -Brian
    Last edited by Phantal; 09-09-2008 at 04:00 PM.

  6. #6
    DEinspanjer Guest

    Default Something similar

    Working off of what I did in this thread, I wrote something similar to this unique row step.

    It isn't quite the same because I didn't need to evaluate the union of two separate fields for uniqueness, which makes this one a lot simpler.

    My javascript step has a main tab and an init tab.

    Process Rows tab:
    Code:
    var key = new java.lang.Comparable(new Key(row));
    
    trans_Status = set.add(key) ? CONTINUE_TRANSFORMATION : SKIP_TRANSFORMATION;
    Init Set tab:
    Code:
    var set = new java.util.HashSet();
    function Key(keyRow)
    {
        this.row =     keyRow;
        this.inputRowMeta = getInputRowMeta();
        this.hash = this.inputRowMeta.hashCode(keyRow);
    }
    
    Key.prototype.equals = function(other)
    {
        return this.inputRowMeta.equals(this.row, other.getMyRow(), [0,1]);
    }
    Key.prototype.hashCode = function()
    {
        return this.hash;
    }
    Key.prototype.getMyRow = function()
    {
        return this.row;
    }
    Key.prototype.compareTo = function(other)
    {
        return this.inputRowMeta.compare(this.getMyRow(), other.getMyRow())
    }

  7. #7
    Join Date
    Oct 2007
    Posts
    255

    Default asdf

    Back when I was working on this, I ended up re-doing it with existing steps in Kettle and was able to get much better performance than the javascript step using the hash table(s), so at least as far as that particular problem goes there's a much faster way to do it.

    Aside from that, the script(s) you have there are interesting, if I get a few minutes today I feel compelled to tinker

    I ended up normalizing the table, adding a 'type' column to indicate whether it was a MAC or IP address, after stripping duplicate rows. Then I sorted by (key, type), telling it to remove duplicates. Treating the IP & MAC as the same field sped this up tremendously.

    I've been wanting to tinker with steps in Kettle, try either adding my own or modifying an existing one, maybe I'll investigate a good way to do this through java.

    -Brian
    Last edited by Phantal; 10-06-2008 at 10:24 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.