Thread: Dimensional modeling problem - multivalued dimensions and avg aggregation function

1. Junior Member
Join Date
Dec 2016
Posts
19

Dimensional modeling problem - multivalued dimensions and avg aggregation function

Hello,

I'm facing a problem regarding dimensional modeling, in which I need to model a cube that has a multivalued dimension, and a fact measure whose aggregation function is the Average. I'll give a detailed explanation of my problem.

l'm using the pentaho bi-server and the saiku analytics plugin to model data coming from a school. The problem is quiet simple: I need to store which grade every student has received for each course they have taken. So the dimensions are student and course, and the measure of the fact table is the grade. The aggregation function to the grade measure is the average function. This way, if I filter only by students, I get the overall average grade of that student, considering all courses. The problem is the requirement for a third dimension: hobbies.

According to the business rules, one student can have multiple hobbies, and there is no fixed static number of hobbies. This is a problem of multivalued dimension (http://www.kimballgroup.com/2014/05/...table-detours/).

As I need to provide the possibility for the user to drill-down per hobby, I can't just use a single string to describe all hobbies for a student like this: "Gaming, Sports, Reading". For example, the use may need to know the average grade of students whose hobbies includes Gaming, and compare that grade with the average grade obtained by students that enjoy Reading.

Does anyone know how to solve this problem?

Edit:I just forgot to write that I tried the bridge solution described in the link mentioned, and modeled the fact table using a weight measure column. However the average aggregation function does not work as properly as the sum function, when the user performs drill-down operations or filtering for 2 or 3 specific hobbies. I've attached two images, one that describes the bridge table configuration, and other that shows a constructed view, necessary for analyzing using the Saiku Plugin, in the pentaho bi-server platform.

Bridge table configuration:

View corresponding to the join of bridge table

Thank you,
Bruno
Last edited by amaralbf; 03-10-2017 at 05:04 PM.

2. Junior Member
Join Date
Feb 2017
Posts
18
Hello Bruno,

you can try to model the Bridge-Table as a parent-child hierarchy in Mondrian.
http://mondrian.pentaho.com/document...ld_hierarchies Here you can see an example where the "employee" is linked to his parent the "supervisor". I guess it would be easy to apply this schema to link the group/bridge table to their respective data or to say to get the hobby table into a parent-child relationship with the student table.
If you want to know how to build a bridge table with Pentaho DI look at this link here:
Be aware that especially with Bridge Tables and your measurements you must prevent users from joining the tables all together from outside to inside. That can leave you with wrong results.

With kind regards

underscores

3. Junior Member
Join Date
Dec 2016
Posts
19

You're saying that I should try to model my student dimension this way?:

Hierarchy:
student_name -> hobby_name

This way, my dimension would look like this:

dimension_id | student_id | student_name | hobby_name
-------------------------------------------------------------------------------------------
1 1 Bob Games
2 1 Bob Playing Guitar
3 2 Paul Reading
4 3 Lilian Games

The problem with this approach is that using the Saiku Analytics plugin, I don't think I can aggregate the fact measure over one or more sublevel values through the graphical interface. For example, the user could not perform the following query using only the graphical interface of Saiku Analytics: give me the average grade of students that enjoy gaming [and playing the guitar]. Can this be done using mondrian queries?

I think one alternative to this problem would be creating a new dimension just for Hobby, which results in my initial example, with the thinner granularity fact table (grade per student per hobby).

Edit: another question: if I got it right, and the dimension table actually looks like that, wouldn't the granularity of the fact table remain the same (grade per student per hobby), since a single tuple of the student dimension no longer represents one student, but a hobby of a student?

And I didn't understand the following statement:

Be aware that especially with Bridge Tables and your measurements you must prevent users from joining the tables all together from outside to inside
Thanks!

Best regards,
Bruno
Last edited by amaralbf; 03-14-2017 at 03:30 PM.

4. Junior Member
Join Date
Dec 2016
Posts
19

Problem with AVG aggregate function and multi-valued dimension

This is a better explained and more focused on the problem replacement for my other post (http://forums.pentaho.com/showthread...ation-function). I apologize for posting twice about the same problem, but in this one I'll expose the problem much more clearly.

The problem is: I have the grade that every student from a school has received for each course he/she has taken. One small example is shown below:

Bob received grade 7 for the Math course.
Bob received grade 3 for the Physics course.
Lilian received grade 8 for the Math course.

The solution for this problem is a simple star-schema with 1 fact table and the student and course dimension tables.

However, each student can have none, one or more hobbies. For the rest of the post, I'll assume my initial solution, which is to change the fact table granularity from "per student per course" to "per student per hobby per course".

This way, my fact table look like this:

My first attempt was to simply replicate the grade that every student has received for every hobby he/she has. The aggregate function for this measure is the average.

This first attempt works when I try to query the average grade received per hobby:

For example, for the Gaming hobby, there is a grade 7 for Math (from Bob), a grade 3 for Physics (also from Bob), and a grade 8 for Math (from Lilian). Thus, Avg(8, 3, 7) = 6 is the correct value for the Gaming hobby.

However, when I try to query the average grade per course, I got the following results:

The results are incorrect, because for the Math course, we have a 7 and an 8, which result in an average of 7.5. But, due to the fact that Bob has 2 hobbies and Lilian has 3, the average is calculated considering the 5 fact tuples, resulting in the following calculation: (7 + 7 + 8 + 8 + 8) / 5 = 7.6.

As a workaround for this case, I tried a different approach, which consists in weighting each grade according to the number of hobbies each student has. The new fact_table is shown below:

Then, I defined a new measure for the average calculation:
Code:
`Grade (attempt 2) = sum(weighted_grade) / sum(weight)`
We can see that now the results to the query (average grade per course) are correct:

This works, because first we sum up all weighted_grade values, then sum up all weight values, and finally divide the former by the latter. For the Math course, for example, the average grade is calculated as: (3.5 + 3.5 + 2.667 + 2.667 + 2.667) / (0.5 + 0.5 + 0.333 + 0.333 + 0.333) = (7 + 8) / (1 + 1) = 7.5.

(continued in the next post. I had to create two posts due to the attachment count limit of 5 per post)

5. Junior Member
Join Date
Dec 2016
Posts
19
(continuing from last post....)

The problem is that the measure Grade (attempt 2) does not work when hobby is request in the select clause:

The calculation for Gaming, in this query, was: (3.5 + 1.5 + 2.667) / (0.5 + 0.5 + 0.3333) = (7.667) / (1.333) = 5.75. This is incorrect because it consists in a weighted average calculation, in which the 8, that has the lower denominator (0.333 compared to 0.5) receives a smaller weight.

If I only had these two cases, I could (as I actually did in my tests) create a calculate measure that shows Grade (attempt 1) in one case, and Grade (attempt 2) in the other case. However, there's a third (there must be more, actually) case where none of these measures works. This is the case when we do not require hobby values in the SELECT clause, but in the WHERE clause.

The following results are displayed when I query the average grade per course, but only considering the grades obtained by students that enjoy Gaming and Reading.

Calculation for measure Grade (attempt 1): Avg(7, 7, 8) = 7.33

Calculation for measure Grade (attempt 2): (3.5 + 3.5 + 1.5 + 1.5 + 2.667) / (0.5 + 0.5 + 0.5 + 0.5 + 0.333) = (12.667) / (2.333) = 7.25

The correct result should be (7 + 8) / 2 = 7.5

Using SQL I could solve this problem by using a subquery. In the subquery I first apply any filter such as "students that enjoy Gaming and Reading" and/or "courses Math and Biology". Only then I select (using DISTINCT) the grades corresponding to these cases. Since I use DISTINCT, I cannot get the same grade more than once. Performing the average calculation in this resulting set of grades works perfectly.

Finally, I have two problems:

1) I cannot construct the corresponding query in Mondrian (due to my lack of knowledge);

2) If I find a way to construct such query, I'll need to find a way to put the needed calculation in a Calculated Measure, in the cube definition, because I need it to be available for the end user to analyze it through the Saiku Plugin. So, just the query is not valid for me, because I'm not using it in a report or something similar. I need it to be interactive.

Well, that is my problem. Any thoghts, insights, and (of course) solutions are welcome.

My best regards,
Bruno
Last edited by amaralbf; 03-15-2017 at 12:50 PM.

6. Junior Member
Join Date
Feb 2017
Posts
18
Hello Bruno

I imagined it similar to the employee tables in the example
TableStudent:

StudentID|StudentName
1 underscores
2 Bruno
TableHobbies:
HobbyID|HobbyName
1 Fishing
2 ETL Development
3 Bike Driving

Closure Table:

StudentID|HobbyID (|HobbyName
1 1 (Fishing
1 2 (ETL Development
2 2 (ETL Development
2 3 (Bike Driving

So what you do is you create a relationship table that represents the transitive relationship between your entries. You can directly map the employee-supervisor example. Now you need to apply this structure to the Mondrian Schema as shown in the link. At least that would be my first thought if I had such a problem.
My last statement refers to the fact that if you physically build this bridge table in your database. Someone could come and start joining the group table with the bridge table with the dimension with the fact table. And that does not have to lead to necessarly correct data if you think about it.

Originally Posted by amaralbf

You're saying that I should try to model my student dimension this way?:

Hierarchy:
student_name -> hobby_name

This way, my dimension would look like this:

dimension_id | student_id | student_name | hobby_name
-------------------------------------------------------------------------------------------
1 1 Bob Games
2 1 Bob Playing Guitar
3 2 Paul Reading
4 3 Lilian Games

The problem with this approach is that using the Saiku Analytics plugin, I don't think I can aggregate the fact measure over one or more sublevel values through the graphical interface. For example, the user could not perform the following query using only the graphical interface of Saiku Analytics: give me the average grade of students that enjoy gaming [and playing the guitar]. Can this be done using mondrian queries?

Bruno

7. Junior Member
Join Date
Dec 2016
Posts
19
Hi underscores,

I'll think about your proposed solution and give it a try.

If you want to see how I have tried to solve the problem, I wrote a new post (http://forums.pentaho.com/showthread...lued-dimension), much more verbose, in which I explain in details of what I'm trying to achieve.

I'll try to model my problem with parent-child hierarchies and will report it back here.

Thanks!
Bruno

8. Junior Member
Join Date
Dec 2016
Posts
19
Hi underscores,

I don't think my problem can be solved with closure tables, because there is no different depth levels in the hierarchy. If you look at the closure table example, you can see that the distance column is essential. In my case, the distance would be always 1, which means I do not need a closure table at all.

Without closure tables, and modeling the hierarchy for the students dimension as: student_name -> hobby_name, I got the following results:

The results are correct. But how can I query grouping by the sub-level "hobby_name"? The corresponding query generated by Saiku plugin is:

Code:
```WITH
SET [~ROWS] AS
Hierarchize({{[student].[student_name].Members}, {[student].[hobby_name].Members}})
SELECT
NON EMPTY {[Measures].[grade]} ON COLUMNS,
NON EMPTY [~ROWS] ON ROWS
I can query only by hobby_name, using:
Code:
```WITH
SET [~ROWS] AS
{[student].[hobby_name].Members}
SELECT
NON EMPTY {[Measures].[grade]} ON COLUMNS,
NON EMPTY [~ROWS] ON ROWS
Which results in:

This way, the students are omitted, but I cannot have the average grade aggregated by hobby. The two Gaming elements actually have different ancestors ([student].[Bob].[Gaming]) and ([student].[Lilian][Gaming]).
Is there a way to group by hobbies even though they have distinct ancestors?

Best regards,
Bruno

Originally Posted by underscores
Hello Bruno

I imagined it similar to the employee tables in the example
TableStudent:

StudentID|StudentName
1 underscores
2 Bruno
TableHobbies:
HobbyID|HobbyName
1 Fishing
2 ETL Development
3 Bike Driving

Closure Table:

StudentID|HobbyID (|HobbyName
1 1 (Fishing
1 2 (ETL Development
2 2 (ETL Development
2 3 (Bike Driving

So what you do is you create a relationship table that represents the transitive relationship between your entries. You can directly map the employee-supervisor example. Now you need to apply this structure to the Mondrian Schema as shown in the link. At least that would be my first thought if I had such a problem.
My last statement refers to the fact that if you physically build this bridge table in your database. Someone could come and start joining the group table with the bridge table with the dimension with the fact table. And that does not have to lead to necessarly correct data if you think about it.

9. Junior Member
Join Date
Dec 2016
Posts
19
Update I've achieved the same result as described in my post, using two hierarchies to the same dimension containing the relation between student and hobby.

I created a student dimension table as follows:

student_id | student_name | hobby_name
--------------------------------------------------------
1 | Bob | Gaming
1 | Bob | Reading
2 | Lilian | Gaming
2 | Lilian | Jogging
2 | Lilian | Writing

Then, I defined the student dimension in the Mondrian schema with two hierarchies of 1 level each (actually 2, if you count the all member).

Student hierarchy: all -> student_name
Hobby hierarchy: all -> hobby_name

This way, querying by all the following conditions worked perfectly:

- course;
- student;
- hobby;
- course + hobby;
- student + hobby;
- student + course;
- student + course + hobby;

However, when I filter for two or more hobbies and do not include them in the select clause, I get the wrong results, as shown in the print screen from Saiku Analytics plugin:

I hope this additional information will be helpful.

My best regards,
Bruno

10. Junior Member
Join Date
Feb 2017
Posts
18
Hello
I guess that does in fact overcome my current knowledge of Mondrian however I found this interesting take on multi-valued dimensions. They use the Virtual Cube here. What a Virtual Cube does however is simply put: combines two or more cubes that have shared dimensions (or non-shared too) with different granularity.
http://lists.pentaho.org/pipermail/m...ay/005138.html maybe this helps a bit better than what I could give you I'm also just a starter in this kind of work. However I do not ultimately see the problem with the missing field in the closure table. I will think about it a bit later on again.
With kind regards,

underscores

PS: Always think about what you want to analyze with your cube. It might not always be possible to create a swiss army knife of a cube with Mondrian that can solve all requests at once.

11. Junior Member
Join Date
Dec 2016
Posts
19
Hello underscores,

thank you for your reply! That for sure does overcome my knowledge too!

I've already thought about the virtual cubes. It indeed solves the problem of looking only for the grades per student, or look for the grades when considering the hobbies. But I guess the main problem is, when viewing the grades per student per hobby, if a select only the grades associated with two or more hobbies, it will inevitably count the same grade more than once, if the student has more than one of the hobbies required as the filter.

I'll keep searching for a solution, of course, but I suspect, as you said, that maybe it can't be done using Mondrian queries and the Saiku Plugin.

Best regards,
Bruno

12. Junior Member
Join Date
Feb 2017
Posts
18
Originally Posted by amaralbf
Hello underscores,

thank you for your reply! That for sure does overcome my knowledge too!

I've already thought about the virtual cubes. It indeed solves the problem of looking only for the grades per student, or look for the grades when considering the hobbies. But I guess the main problem is, when viewing the grades per student per hobby, if a select only the grades associated with two or more hobbies, it will inevitably count the same grade more than once, if the student has more than one of the hobbies required as the filter.

I'll keep searching for a solution, of course, but I suspect, as you said, that maybe it can't be done using Mondrian queries and the Saiku Plugin.

Best regards,
Bruno
Hello Bruno,

I really hope someone else will comment here and give you better help and that you achieve your goal.

With kind regards,

underscores

13. Junior Member
Join Date
Feb 2017
Posts
18
I wanted to add something to the virtual cube.
Create your Schema like this:
Cube StudentxHobby:
dimension:
Top -> Student -> Hobby
Fact: Average Grade By Student
Cube: Hobby:
Dimension:
Top -> Hobby
Fact: Average Grade By Hobby

Virtual Cube: Student And Hobby
Virtual Measure: Average Grade By Hobby
Virtual Measure: Average Grade By Student
Dimensions: Shared Dimensions (Time/Date, etc.)
Dimensions: StudentxHobby
Dimensions: Hobby
Than you can select the Dimensions you need with the Measure you need in Saiku and get both results properly. That would solve all your problems through a backdoor easily and wont reduce effiency too much. If this is an assignment you might just justify your approach like this:
Mondrian does not allow muilti-valued dimensions that are not reflexive. So you decided to calculate two measures for each to get legal results and combine them through a Virtual Cube which is a common way to combine different granularity, detail grade, dimensionality of values. Since both are different busines measures with different goals (a hobby does not study but someone with a hobby might study).

14. Junior Member
Join Date
Dec 2016
Posts
19
Hi underscores,

This approach is valid for getting the correct results when selecting by student as well as when selecting by hobby (and combining them with other dimensions). This is the result I already have, using the approach described in the posts #4 and #5 of this thread.

Said that, I don't see how this approach can solve the problem of calculating the following query: average of grades per course, filtering by 2 of 3 hobbies. It is the same query I mention on the post #5.

Thanks and best regards,

Bruno

15. Junior Member
Join Date
Dec 2016
Posts
19
Just an update, I've discussed the problem with Diethard Steiner, and this discussion led to the elaboration of a very good and informative post on his blog. The URL to the post is: https://diethardsteiner.github.io/mo...Attribute.html

I recommend checking his blog out, as I've been following his blog for a long time and have found lots of useful information regarding Business Intelligence and the use of Mondrian and Pentaho Community.

Best regards,
Bruno

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•