# Thread: Calculating Rank and/or DenseRank in Kettle

1. Junior Member
Join Date
Oct 2006
Posts
8

## Calculating Rank and/or DenseRank in Kettle

I may be trying to do something in PDI that I should be doing elsewhere.

Quick version of my question: Is it possible to calculate Rank on a sorted data set in PDI? If so, why can't I figure it out?

Longer explanation:

I have a table of detail transaction records in a database. In a PDI transformation, I aggregate that data to get Average Daily Transaction Count by User (one row per User).

Then I sort the data descending on Average Transaction Count so that I can apply a rank to each user.

The Rank is the part I can't figure out. Adding a sequence doesn't work because it doesn't handle "ties" correctly. I tried getProcessCount('w') in a JavaScript transformation, but I can't figure out how to handle ties.

I feel like this should be easier but I also think that PDI may not be the right place since this calculation is more of a SQL function: RANK() OVER( ORDER BY ... DESC).

Is there a way to calculate Rank in PDI? Or am I trying to hammer with a saw?

Thanks,
Nathan

2. Member
Join Date
Aug 2009
Posts
56
I have an idea, but I can't guarantee this is the best solution, or even if it works at all!

So if your data is sorted, and you used a Sequence, you should try the LAG or LEAD funciton of Analytic Query step. Use e.g LAG 1 row based on your sequence. Than you can compare if your metricts are equal (with a Filter step). Then you can reduce the sequence.

Hmmm. Now I realy wonder if it can work at all.

3. Senior Member
Join Date
Oct 2007
Posts
107
Hello Nathan,

You can easily achieve this by using the Group By step. You need to make sure that your data is properly sorted (the way you want your rank to occurs) either within your SQL statement or by using the SORT step. Pass the stream to the Group By where you need to check the "Include all rows", "Add line number, restart in each group" and specify a field name for your rank within the "Line number field name".
In the group field section, specify the fields you want your Rank over by to occurs on.
Simple as that :-)
Have fun.

Regards,
C.

4. Chief Data Integration
Join Date
Nov 1999
Posts
9,729
I thought about that one Chamel, but it doesn't take into account identical values. So think Nathan wants to see identical ranks for identical values in the group.
I guess we could need some kind of ranking method in the group by step. In the mean time perhaps you can write a small script to create the rank.

5. Junior Member
Join Date
Oct 2006
Posts
8
Yes, line numbers do not take ties into account. And the LAG to compare to the prior value doesn't take into account a three-way tie (or four, or five, or 50 which is very likely in my case for the lowest value of zero).

If I need to write a script, are you suggesting a JavaScript script? Or a different kind of script?

I think I can build a JavaScipt step that uses a variable and incrementing it by 1 or 0 combined with either another counter value or getProcessCount('w').

Do variables in JavaScript step exit beyond the row that it is working on? In other words, if I say:

if (getProcessCount('w') == 0 ) {
i = 0;
} else {
i++
}

Will the value of i carry on from row to row? You know what... scratch that question. I'll just try it out and post back the results.

I'll post the solution if I get it figured out.

6. Junior Member
Join Date
Oct 2006
Posts
8
OK. Here is one way to rank in PDI 4.1. Keep in mind that I'm not really an ETL specialist, so this may be a lot uglier than it needs to be. It takes 4 Steps.

I think Rank/DenseRank should be built into PDI as a Step. It is essentially a sequence that repeats if a row value(s) matches a prior row value(s).

To answer my earlier question about variables being remembered from step to step... yes, they are remembered. Just be careful how you initialize the variable so you don't erase the value.

First Step, Sort your data in the order you want to rank.

If you know you won't have any "ties" then you're done (That was easy). But if you might have a tie, read on.

Third Step, add an Analytic Query step to get the value(s) 1 row back (LAG where N = 1).

Last Step, add a JavaScript step with the code to compare the current metric value(s) to the prior metric value(s). It should look something like this:

//Script here

var initRank = Rank_Metric; // From Sequence Step
var priorRowMetric = Prior_Metric_for_ranking;
var curRowMetric = Metric_for_ranking;

if (getProcessCount('w') == 0 ) {
var tieCounter = 0;
}

if ( Math.round(curRowMetric * 10000) == Math.round(priorRowMetric * 10000) ) {
// Have to Round because Number data types don't want to match even when they do

tieCounter++;
} else {
tieCounter = 0;
};

var Rank_Metric = initRank - tieCounter;

7. Chief Data Integration
Join Date
Nov 1999
Posts
9,729
So you found the answer that variables in JS are global and not discarded after each step.
I'm glad you found a solution. Do others a favor though and file a feature request in JIRA (http://jira.pentaho.org) for the ranking functionality in the "Group By" step.

Matt

8. Junior Member
Join Date
Oct 2006
Posts
8
Rank / Dense Rank added as a new feature request to the Pentaho Jira site.

http://jira.pentaho.com/browse/PDI-6425

9. Chief Data Integration
Join Date
Nov 1999
Posts
9,729
Thanks!

10. Junior Member
Join Date
Apr 2011
Posts
6
Hi,

what do you think of this approach :

- Sort rows
- Get unique rows and redirect duplicates
- Add sequence to unique rows
- Lookup the rank for duplicate from unique rows rank
- Merge streams

See this sample : http://cyberdust.free.fr/pentaho/ran..._duplicate.ktr

11. Chief Data Integration
Join Date
Nov 1999
Posts
9,729
Great idea.

12. Junior Member
Join Date
Dec 2013
Posts
2

## How is Parallelism achieved

Hi,

I was going through the forums to find a way to find dense rank and have landed here.

I am new to Pentaho and JavaScript concepts but am aware of ETL processes as such.

The logic you are explaining is clear to me but I am unable to understand how Pentaho achieves this using parallelism. In other words, wouldn't every record be processed 'one at a time' and not 'many at a time' and thus reduce the throughput?

Thanks
Pearl.Shr

Originally Posted by nwenzel
OK. Here is one way to rank in PDI 4.1. Keep in mind that I'm not really an ETL specialist, so this may be a lot uglier than it needs to be. It takes 4 Steps.

I think Rank/DenseRank should be built into PDI as a Step. It is essentially a sequence that repeats if a row value(s) matches a prior row value(s).

To answer my earlier question about variables being remembered from step to step... yes, they are remembered. Just be careful how you initialize the variable so you don't erase the value.

First Step, Sort your data in the order you want to rank.

If you know you won't have any "ties" then you're done (That was easy). But if you might have a tie, read on.

Third Step, add an Analytic Query step to get the value(s) 1 row back (LAG where N = 1).

Last Step, add a JavaScript step with the code to compare the current metric value(s) to the prior metric value(s). It should look something like this:

//Script here

var initRank = Rank_Metric; // From Sequence Step
var priorRowMetric = Prior_Metric_for_ranking;
var curRowMetric = Metric_for_ranking;

if (getProcessCount('w') == 0 ) {
var tieCounter = 0;
}

if ( Math.round(curRowMetric * 10000) == Math.round(priorRowMetric * 10000) ) {
// Have to Round because Number data types don't want to match even when they do

tieCounter++;
} else {
tieCounter = 0;
};

var Rank_Metric = initRank - tieCounter;

13. Senior Member
Join Date
Apr 2007
Posts
2,010
an individual step is not typically parallelised - if it is then you have to use "partitioning" to work around these issues - essentially ensuring all the right records go to the right thread.

14. Junior Member
Join Date
Dec 2013
Posts
2
Thanks codek!

Originally Posted by codek
an individual step is not typically parallelised - if it is then you have to use "partitioning" to work around these issues - essentially ensuring all the right records go to the right thread.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•