Hitachi Vantara Pentaho Community Forums
Results 1 to 14 of 14

Thread: Calculating Rank and/or DenseRank in Kettle

  1. #1
    Join Date
    Oct 2006
    Posts
    8

    Default Calculating Rank and/or DenseRank in Kettle

    I may be trying to do something in PDI that I should be doing elsewhere.

    Quick version of my question: Is it possible to calculate Rank on a sorted data set in PDI? If so, why can't I figure it out?

    Longer explanation:

    I have a table of detail transaction records in a database. In a PDI transformation, I aggregate that data to get Average Daily Transaction Count by User (one row per User).

    Then I sort the data descending on Average Transaction Count so that I can apply a rank to each user.

    The Rank is the part I can't figure out. Adding a sequence doesn't work because it doesn't handle "ties" correctly. I tried getProcessCount('w') in a JavaScript transformation, but I can't figure out how to handle ties.

    I feel like this should be easier but I also think that PDI may not be the right place since this calculation is more of a SQL function: RANK() OVER( ORDER BY ... DESC).

    Is there a way to calculate Rank in PDI? Or am I trying to hammer with a saw?

    Thanks,
    Nathan

  2. #2

    Default

    I have an idea, but I can't guarantee this is the best solution, or even if it works at all!

    So if your data is sorted, and you used a Sequence, you should try the LAG or LEAD funciton of Analytic Query step. Use e.g LAG 1 row based on your sequence. Than you can compare if your metricts are equal (with a Filter step). Then you can reduce the sequence.

    Hmmm. Now I realy wonder if it can work at all.

  3. #3
    Join Date
    Oct 2007
    Posts
    107

    Default

    Hello Nathan,

    You can easily achieve this by using the Group By step. You need to make sure that your data is properly sorted (the way you want your rank to occurs) either within your SQL statement or by using the SORT step. Pass the stream to the Group By where you need to check the "Include all rows", "Add line number, restart in each group" and specify a field name for your rank within the "Line number field name".
    In the group field section, specify the fields you want your Rank over by to occurs on.
    Simple as that :-)
    Have fun.

    Regards,
    C.

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I thought about that one Chamel, but it doesn't take into account identical values. So think Nathan wants to see identical ranks for identical values in the group.
    I guess we could need some kind of ranking method in the group by step. In the mean time perhaps you can write a small script to create the rank.

  5. #5
    Join Date
    Oct 2006
    Posts
    8

    Default

    Yes, line numbers do not take ties into account. And the LAG to compare to the prior value doesn't take into account a three-way tie (or four, or five, or 50 which is very likely in my case for the lowest value of zero).

    If I need to write a script, are you suggesting a JavaScript script? Or a different kind of script?

    I think I can build a JavaScipt step that uses a variable and incrementing it by 1 or 0 combined with either another counter value or getProcessCount('w').

    Do variables in JavaScript step exit beyond the row that it is working on? In other words, if I say:

    if (getProcessCount('w') == 0 ) {
    i = 0;
    } else {
    i++
    }

    Will the value of i carry on from row to row? You know what... scratch that question. I'll just try it out and post back the results.


    I'll post the solution if I get it figured out.

  6. #6
    Join Date
    Oct 2006
    Posts
    8

    Default

    OK. Here is one way to rank in PDI 4.1. Keep in mind that I'm not really an ETL specialist, so this may be a lot uglier than it needs to be. It takes 4 Steps.

    I think Rank/DenseRank should be built into PDI as a Step. It is essentially a sequence that repeats if a row value(s) matches a prior row value(s).

    To answer my earlier question about variables being remembered from step to step... yes, they are remembered. Just be careful how you initialize the variable so you don't erase the value.

    First Step, Sort your data in the order you want to rank.

    Second Step, add a Sequence step to provide your initial Rank.

    If you know you won't have any "ties" then you're done (That was easy). But if you might have a tie, read on.

    Third Step, add an Analytic Query step to get the value(s) 1 row back (LAG where N = 1).

    Last Step, add a JavaScript step with the code to compare the current metric value(s) to the prior metric value(s). It should look something like this:


    //Script here

    var initRank = Rank_Metric; // From Sequence Step
    var priorRowMetric = Prior_Metric_for_ranking;
    var curRowMetric = Metric_for_ranking;

    if (getProcessCount('w') == 0 ) {
    var tieCounter = 0;
    }

    if ( Math.round(curRowMetric * 10000) == Math.round(priorRowMetric * 10000) ) {
    // Have to Round because Number data types don't want to match even when they do

    tieCounter++;
    } else {
    tieCounter = 0;
    };

    var Rank_Metric = initRank - tieCounter;

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    So you found the answer that variables in JS are global and not discarded after each step.
    I'm glad you found a solution. Do others a favor though and file a feature request in JIRA (http://jira.pentaho.org) for the ranking functionality in the "Group By" step.

    Thanks in advance,

    Matt

  8. #8
    Join Date
    Oct 2006
    Posts
    8

    Default

    Rank / Dense Rank added as a new feature request to the Pentaho Jira site.

    http://jira.pentaho.com/browse/PDI-6425

  9. #9
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Thanks!

  10. #10
    Join Date
    Apr 2011
    Posts
    6

    Default

    Hi,

    what do you think of this approach :

    - Sort rows
    - Get unique rows and redirect duplicates
    - Add sequence to unique rows
    - Lookup the rank for duplicate from unique rows rank
    - Merge streams

    See this sample : http://cyberdust.free.fr/pentaho/ran..._duplicate.ktr

  11. #11
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Great idea.

  12. #12
    Join Date
    Dec 2013
    Posts
    2

    Default How is Parallelism achieved

    Hi,

    I was going through the forums to find a way to find dense rank and have landed here.

    I am new to Pentaho and JavaScript concepts but am aware of ETL processes as such.

    The logic you are explaining is clear to me but I am unable to understand how Pentaho achieves this using parallelism. In other words, wouldn't every record be processed 'one at a time' and not 'many at a time' and thus reduce the throughput?

    Thanks
    Pearl.Shr

    Quote Originally Posted by nwenzel View Post
    OK. Here is one way to rank in PDI 4.1. Keep in mind that I'm not really an ETL specialist, so this may be a lot uglier than it needs to be. It takes 4 Steps.

    I think Rank/DenseRank should be built into PDI as a Step. It is essentially a sequence that repeats if a row value(s) matches a prior row value(s).

    To answer my earlier question about variables being remembered from step to step... yes, they are remembered. Just be careful how you initialize the variable so you don't erase the value.

    First Step, Sort your data in the order you want to rank.

    Second Step, add a Sequence step to provide your initial Rank.

    If you know you won't have any "ties" then you're done (That was easy). But if you might have a tie, read on.

    Third Step, add an Analytic Query step to get the value(s) 1 row back (LAG where N = 1).

    Last Step, add a JavaScript step with the code to compare the current metric value(s) to the prior metric value(s). It should look something like this:


    //Script here

    var initRank = Rank_Metric; // From Sequence Step
    var priorRowMetric = Prior_Metric_for_ranking;
    var curRowMetric = Metric_for_ranking;

    if (getProcessCount('w') == 0 ) {
    var tieCounter = 0;
    }

    if ( Math.round(curRowMetric * 10000) == Math.round(priorRowMetric * 10000) ) {
    // Have to Round because Number data types don't want to match even when they do

    tieCounter++;
    } else {
    tieCounter = 0;
    };

    var Rank_Metric = initRank - tieCounter;

  13. #13
    Join Date
    Apr 2007
    Posts
    2,010

    Default

    an individual step is not typically parallelised - if it is then you have to use "partitioning" to work around these issues - essentially ensuring all the right records go to the right thread.

  14. #14
    Join Date
    Dec 2013
    Posts
    2

    Default

    Thanks codek!

    Quote Originally Posted by codek View Post
    an individual step is not typically parallelised - if it is then you have to use "partitioning" to work around these issues - essentially ensuring all the right records go to the right thread.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.