Hitachi Vantara Pentaho Community Forums
Results 1 to 15 of 15

Thread: Dimensional modeling problem - multivalued dimensions and avg aggregation function

  1. #1
    Join Date
    Dec 2016
    Posts
    19

    Default Dimensional modeling problem - multivalued dimensions and avg aggregation function

    Hello,


    I'm facing a problem regarding dimensional modeling, in which I need to model a cube that has a multivalued dimension, and a fact measure whose aggregation function is the Average. I'll give a detailed explanation of my problem.


    l'm using the pentaho bi-server and the saiku analytics plugin to model data coming from a school. The problem is quiet simple: I need to store which grade every student has received for each course they have taken. So the dimensions are student and course, and the measure of the fact table is the grade. The aggregation function to the grade measure is the average function. This way, if I filter only by students, I get the overall average grade of that student, considering all courses. The problem is the requirement for a third dimension: hobbies.


    According to the business rules, one student can have multiple hobbies, and there is no fixed static number of hobbies. This is a problem of multivalued dimension (http://www.kimballgroup.com/2014/05/...table-detours/).


    As I need to provide the possibility for the user to drill-down per hobby, I can't just use a single string to describe all hobbies for a student like this: "Gaming, Sports, Reading". For example, the use may need to know the average grade of students whose hobbies includes Gaming, and compare that grade with the average grade obtained by students that enjoy Reading.

    Does anyone know how to solve this problem?

    Edit:I just forgot to write that I tried the bridge solution described in the link mentioned, and modeled the fact table using a weight measure column. However the average aggregation function does not work as properly as the sum function, when the user performs drill-down operations or filtering for 2 or 3 specific hobbies. I've attached two images, one that describes the bridge table configuration, and other that shows a constructed view, necessary for analyzing using the Saiku Plugin, in the pentaho bi-server platform.

    Bridge table configuration:
    Name:  modeling_example.jpg
Views: 853
Size:  26.3 KB

    View corresponding to the join of bridge table
    Name:  modeling_example_2.jpg
Views: 628
Size:  22.1 KB

    Thank you,
    Bruno
    Last edited by amaralbf; 03-10-2017 at 05:04 PM.

  2. #2
    Join Date
    Feb 2017
    Posts
    18

    Default

    Hello Bruno,

    you can try to model the Bridge-Table as a parent-child hierarchy in Mondrian.
    http://mondrian.pentaho.com/document...ld_hierarchies Here you can see an example where the "employee" is linked to his parent the "supervisor". I guess it would be easy to apply this schema to link the group/bridge table to their respective data or to say to get the hobby table into a parent-child relationship with the student table.
    If you want to know how to build a bridge table with Pentaho DI look at this link here:
    http://type-exit.org/adventures-with...bridge-tables/
    Be aware that especially with Bridge Tables and your measurements you must prevent users from joining the tables all together from outside to inside. That can leave you with wrong results.

    With kind regards

    underscores

  3. #3
    Join Date
    Dec 2016
    Posts
    19

    Default

    Thank you for your reply, underscores.

    You're saying that I should try to model my student dimension this way?:

    Hierarchy:
    student_name -> hobby_name

    This way, my dimension would look like this:

    dimension_id | student_id | student_name | hobby_name
    -------------------------------------------------------------------------------------------
    1 1 Bob Games
    2 1 Bob Playing Guitar
    3 2 Paul Reading
    4 3 Lilian Games

    The problem with this approach is that using the Saiku Analytics plugin, I don't think I can aggregate the fact measure over one or more sublevel values through the graphical interface. For example, the user could not perform the following query using only the graphical interface of Saiku Analytics: give me the average grade of students that enjoy gaming [and playing the guitar]. Can this be done using mondrian queries?

    I think one alternative to this problem would be creating a new dimension just for Hobby, which results in my initial example, with the thinner granularity fact table (grade per student per hobby).

    Edit: another question: if I got it right, and the dimension table actually looks like that, wouldn't the granularity of the fact table remain the same (grade per student per hobby), since a single tuple of the student dimension no longer represents one student, but a hobby of a student?

    And I didn't understand the following statement:

    Be aware that especially with Bridge Tables and your measurements you must prevent users from joining the tables all together from outside to inside
    Thanks!

    Best regards,
    Bruno
    Last edited by amaralbf; 03-14-2017 at 03:30 PM.

  4. #4
    Join Date
    Dec 2016
    Posts
    19

    Default Problem with AVG aggregate function and multi-valued dimension

    This is a better explained and more focused on the problem replacement for my other post (http://forums.pentaho.com/showthread...ation-function). I apologize for posting twice about the same problem, but in this one I'll expose the problem much more clearly.

    The problem is: I have the grade that every student from a school has received for each course he/she has taken. One small example is shown below:

    Bob received grade 7 for the Math course.
    Bob received grade 3 for the Physics course.
    Lilian received grade 8 for the Math course.

    The solution for this problem is a simple star-schema with 1 fact table and the student and course dimension tables.

    However, each student can have none, one or more hobbies. For the rest of the post, I'll assume my initial solution, which is to change the fact table granularity from "per student per course" to "per student per hobby per course".

    This way, my fact table look like this:

    Name:  fact_table.png
Views: 540
Size:  7.6 KB


    My first attempt was to simply replicate the grade that every student has received for every hobby he/she has. The aggregate function for this measure is the average.

    This first attempt works when I try to query the average grade received per hobby:

    Name:  hobby_attempt1.png
Views: 526
Size:  4.8 KB


    For example, for the Gaming hobby, there is a grade 7 for Math (from Bob), a grade 3 for Physics (also from Bob), and a grade 8 for Math (from Lilian). Thus, Avg(8, 3, 7) = 6 is the correct value for the Gaming hobby.

    However, when I try to query the average grade per course, I got the following results:

    Name:  course_attempt1.png
Views: 524
Size:  3.6 KB


    The results are incorrect, because for the Math course, we have a 7 and an 8, which result in an average of 7.5. But, due to the fact that Bob has 2 hobbies and Lilian has 3, the average is calculated considering the 5 fact tuples, resulting in the following calculation: (7 + 7 + 8 + 8 + 8) / 5 = 7.6.

    As a workaround for this case, I tried a different approach, which consists in weighting each grade according to the number of hobbies each student has. The new fact_table is shown below:

    Name:  new_fact_table.png
Views: 532
Size:  11.4 KB


    Then, I defined a new measure for the average calculation:
    Code:
    Grade (attempt 2) = sum(weighted_grade) / sum(weight)
    We can see that now the results to the query (average grade per course) are correct:

    Name:  course_attempt2.png
Views: 526
Size:  3.7 KB


    This works, because first we sum up all weighted_grade values, then sum up all weight values, and finally divide the former by the latter. For the Math course, for example, the average grade is calculated as: (3.5 + 3.5 + 2.667 + 2.667 + 2.667) / (0.5 + 0.5 + 0.333 + 0.333 + 0.333) = (7 + 8) / (1 + 1) = 7.5.

    (continued in the next post. I had to create two posts due to the attachment count limit of 5 per post)

  5. #5
    Join Date
    Dec 2016
    Posts
    19

    Default

    (continuing from last post....)

    The problem is that the measure Grade (attempt 2) does not work when hobby is request in the select clause:

    Name:  hobby_attempt2.png
Views: 529
Size:  4.9 KB


    The calculation for Gaming, in this query, was: (3.5 + 1.5 + 2.667) / (0.5 + 0.5 + 0.3333) = (7.667) / (1.333) = 5.75. This is incorrect because it consists in a weighted average calculation, in which the 8, that has the lower denominator (0.333 compared to 0.5) receives a smaller weight.

    If I only had these two cases, I could (as I actually did in my tests) create a calculate measure that shows Grade (attempt 1) in one case, and Grade (attempt 2) in the other case. However, there's a third (there must be more, actually) case where none of these measures works. This is the case when we do not require hobby values in the SELECT clause, but in the WHERE clause.

    The following results are displayed when I query the average grade per course, but only considering the grades obtained by students that enjoy Gaming and Reading.

    Name:  course_attempt3.png
Views: 534
Size:  4.2 KB


    Calculation for measure Grade (attempt 1): Avg(7, 7, 8) = 7.33

    Calculation for measure Grade (attempt 2): (3.5 + 3.5 + 1.5 + 1.5 + 2.667) / (0.5 + 0.5 + 0.5 + 0.5 + 0.333) = (12.667) / (2.333) = 7.25

    The correct result should be (7 + 8) / 2 = 7.5

    Using SQL I could solve this problem by using a subquery. In the subquery I first apply any filter such as "students that enjoy Gaming and Reading" and/or "courses Math and Biology". Only then I select (using DISTINCT) the grades corresponding to these cases. Since I use DISTINCT, I cannot get the same grade more than once. Performing the average calculation in this resulting set of grades works perfectly.

    Finally, I have two problems:

    1) I cannot construct the corresponding query in Mondrian (due to my lack of knowledge);

    2) If I find a way to construct such query, I'll need to find a way to put the needed calculation in a Calculated Measure, in the cube definition, because I need it to be available for the end user to analyze it through the Saiku Plugin. So, just the query is not valid for me, because I'm not using it in a report or something similar. I need it to be interactive.

    Well, that is my problem. Any thoghts, insights, and (of course) solutions are welcome.

    My best regards,
    Bruno
    Last edited by amaralbf; 03-15-2017 at 12:50 PM.

  6. #6
    Join Date
    Feb 2017
    Posts
    18

    Default

    Hello Bruno

    I imagined it similar to the employee tables in the example
    TableStudent:

    StudentID|StudentName
    1 underscores
    2 Bruno
    TableHobbies:
    HobbyID|HobbyName
    1 Fishing
    2 ETL Development
    3 Bike Driving

    Closure Table:

    StudentID|HobbyID (|HobbyName
    1 1 (Fishing
    1 2 (ETL Development
    2 2 (ETL Development
    2 3 (Bike Driving

    So what you do is you create a relationship table that represents the transitive relationship between your entries. You can directly map the employee-supervisor example. Now you need to apply this structure to the Mondrian Schema as shown in the link. At least that would be my first thought if I had such a problem.
    My last statement refers to the fact that if you physically build this bridge table in your database. Someone could come and start joining the group table with the bridge table with the dimension with the fact table. And that does not have to lead to necessarly correct data if you think about it.

    Quote Originally Posted by amaralbf View Post
    Thank you for your reply, underscores.

    You're saying that I should try to model my student dimension this way?:

    Hierarchy:
    student_name -> hobby_name

    This way, my dimension would look like this:

    dimension_id | student_id | student_name | hobby_name
    -------------------------------------------------------------------------------------------
    1 1 Bob Games
    2 1 Bob Playing Guitar
    3 2 Paul Reading
    4 3 Lilian Games

    The problem with this approach is that using the Saiku Analytics plugin, I don't think I can aggregate the fact measure over one or more sublevel values through the graphical interface. For example, the user could not perform the following query using only the graphical interface of Saiku Analytics: give me the average grade of students that enjoy gaming [and playing the guitar]. Can this be done using mondrian queries?


    Bruno

  7. #7
    Join Date
    Dec 2016
    Posts
    19

    Default

    Hi underscores,

    I'll think about your proposed solution and give it a try.

    If you want to see how I have tried to solve the problem, I wrote a new post (http://forums.pentaho.com/showthread...lued-dimension), much more verbose, in which I explain in details of what I'm trying to achieve.

    I'll try to model my problem with parent-child hierarchies and will report it back here.

    Thanks!
    Bruno

  8. #8
    Join Date
    Dec 2016
    Posts
    19

    Default

    Hi underscores,

    I don't think my problem can be solved with closure tables, because there is no different depth levels in the hierarchy. If you look at the closure table example, you can see that the distance column is essential. In my case, the distance would be always 1, which means I do not need a closure table at all.

    Without closure tables, and modeling the hierarchy for the students dimension as: student_name -> hobby_name, I got the following results:

    Name:  student_hobby_hierarchy.png
Views: 555
Size:  4.9 KB


    The results are correct. But how can I query grouping by the sub-level "hobby_name"? The corresponding query generated by Saiku plugin is:

    Code:
    WITH
    SET [~ROWS] AS
        Hierarchize({{[student].[student_name].Members}, {[student].[hobby_name].Members}})
    SELECT
    NON EMPTY {[Measures].[grade]} ON COLUMNS,
    NON EMPTY [~ROWS] ON ROWS
    FROM [grade_fact]
    I can query only by hobby_name, using:
    Code:
    WITH
    SET [~ROWS] AS
        {[student].[hobby_name].Members}
    SELECT
    NON EMPTY {[Measures].[grade]} ON COLUMNS,
    NON EMPTY [~ROWS] ON ROWS
    FROM [grade_fact]
    Which results in:

    Name:  hobby_names.png
Views: 559
Size:  3.8 KB

    This way, the students are omitted, but I cannot have the average grade aggregated by hobby. The two Gaming elements actually have different ancestors ([student].[Bob].[Gaming]) and ([student].[Lilian][Gaming]).
    Is there a way to group by hobbies even though they have distinct ancestors?

    Best regards,
    Bruno

    Quote Originally Posted by underscores View Post
    Hello Bruno

    I imagined it similar to the employee tables in the example
    TableStudent:

    StudentID|StudentName
    1 underscores
    2 Bruno
    TableHobbies:
    HobbyID|HobbyName
    1 Fishing
    2 ETL Development
    3 Bike Driving

    Closure Table:

    StudentID|HobbyID (|HobbyName
    1 1 (Fishing
    1 2 (ETL Development
    2 2 (ETL Development
    2 3 (Bike Driving

    So what you do is you create a relationship table that represents the transitive relationship between your entries. You can directly map the employee-supervisor example. Now you need to apply this structure to the Mondrian Schema as shown in the link. At least that would be my first thought if I had such a problem.
    My last statement refers to the fact that if you physically build this bridge table in your database. Someone could come and start joining the group table with the bridge table with the dimension with the fact table. And that does not have to lead to necessarly correct data if you think about it.

  9. #9
    Join Date
    Dec 2016
    Posts
    19

    Default

    Update I've achieved the same result as described in my post, using two hierarchies to the same dimension containing the relation between student and hobby.

    I created a student dimension table as follows:

    student_id | student_name | hobby_name
    --------------------------------------------------------
    1 | Bob | Gaming
    1 | Bob | Reading
    2 | Lilian | Gaming
    2 | Lilian | Jogging
    2 | Lilian | Writing

    Then, I defined the student dimension in the Mondrian schema with two hierarchies of 1 level each (actually 2, if you count the all member).

    Student hierarchy: all -> student_name
    Hobby hierarchy: all -> hobby_name

    This way, querying by all the following conditions worked perfectly:

    - course;
    - student;
    - hobby;
    - course + hobby;
    - student + hobby;
    - student + course;
    - student + course + hobby;

    However, when I filter for two or more hobbies and do not include them in the select clause, I get the wrong results, as shown in the print screen from Saiku Analytics plugin:

    Name:  saiku.jpg
Views: 571
Size:  17.2 KB


    I hope this additional information will be helpful.

    My best regards,
    Bruno

  10. #10
    Join Date
    Feb 2017
    Posts
    18

    Default

    Hello
    I guess that does in fact overcome my current knowledge of Mondrian however I found this interesting take on multi-valued dimensions. They use the Virtual Cube here. What a Virtual Cube does however is simply put: combines two or more cubes that have shared dimensions (or non-shared too) with different granularity.
    http://lists.pentaho.org/pipermail/m...ay/005138.html maybe this helps a bit better than what I could give you I'm also just a starter in this kind of work. However I do not ultimately see the problem with the missing field in the closure table. I will think about it a bit later on again.
    With kind regards,

    underscores

    PS: Always think about what you want to analyze with your cube. It might not always be possible to create a swiss army knife of a cube with Mondrian that can solve all requests at once.

  11. #11
    Join Date
    Dec 2016
    Posts
    19

    Default

    Hello underscores,

    thank you for your reply! That for sure does overcome my knowledge too!

    I've already thought about the virtual cubes. It indeed solves the problem of looking only for the grades per student, or look for the grades when considering the hobbies. But I guess the main problem is, when viewing the grades per student per hobby, if a select only the grades associated with two or more hobbies, it will inevitably count the same grade more than once, if the student has more than one of the hobbies required as the filter.

    I'll keep searching for a solution, of course, but I suspect, as you said, that maybe it can't be done using Mondrian queries and the Saiku Plugin.

    Best regards,
    Bruno

  12. #12
    Join Date
    Feb 2017
    Posts
    18

    Default

    Quote Originally Posted by amaralbf View Post
    Hello underscores,

    thank you for your reply! That for sure does overcome my knowledge too!

    I've already thought about the virtual cubes. It indeed solves the problem of looking only for the grades per student, or look for the grades when considering the hobbies. But I guess the main problem is, when viewing the grades per student per hobby, if a select only the grades associated with two or more hobbies, it will inevitably count the same grade more than once, if the student has more than one of the hobbies required as the filter.

    I'll keep searching for a solution, of course, but I suspect, as you said, that maybe it can't be done using Mondrian queries and the Saiku Plugin.

    Best regards,
    Bruno
    Hello Bruno,

    I'm really sorry that I could not properly help you. A Virtual Cube is something very simple. You would create two measurements one for the Students and Hobbies and one for the Students alone. You could than use these measures in your virtual cube. Right now I do an internship to create a simple OLAP Solution based on Mondrian for a small company as part of a mandatory university internship and suffer from similar problems as you do. However in my opinion to receive a proper result with Mondrian you really have to ensure the 1:n relationship or use a Closure Table. However the last solution requires you to create a transitive, reflexive closure. Another way would be to denormalize the student-hobby information and create a hobby group for each Student. However that would not satisfy the requirement to see what hobbies have the best average grades (a somewhat funny thought however to relate hobbies to grades. but again here a calculation over student and hobbies in two cubes might work). In my current work I had 3 processes hidden in a single source database. However the source database design used a "fake-key" to create an entry for every process. So while non of these 3 processes had any actual relationship, different number of rows their "keys" were all collected in a single "center-table". During my work I found that these processes create different value-IDs and therefor need different dimensions and have therefor a different granularity. I created 3 cubes with their respective measures and combined them in a virtual cube over their shared dimension (customer, machine, date). This however created large problems when trying to drill-down to the source data. It is simply impossible. As I found out by now this is due to the indexing that got handled incorrectly. You see I struggle a bit too (however I'm also a bad student and probably mentally very plain). You might be able to achieve your goal with one of these tricks however. You should remember: If the result you see is correct, no one cares what you did to achieve that. And in most companies you will be one of the very few to understand OLAP, Big Data to the extend you already have, believe me.
    I really hope someone else will comment here and give you better help and that you achieve your goal.

    With kind regards,

    underscores

  13. #13
    Join Date
    Feb 2017
    Posts
    18

    Default

    I wanted to add something to the virtual cube.
    Create your Schema like this:
    Cube StudentxHobby:
    dimension:
    Top -> Student -> Hobby
    Fact: Average Grade By Student
    Cube: Hobby:
    Dimension:
    Top -> Hobby
    Fact: Average Grade By Hobby

    Virtual Cube: Student And Hobby
    Virtual Measure: Average Grade By Hobby
    Virtual Measure: Average Grade By Student
    Dimensions: Shared Dimensions (Time/Date, etc.)
    Dimensions: StudentxHobby
    Dimensions: Hobby
    Than you can select the Dimensions you need with the Measure you need in Saiku and get both results properly. That would solve all your problems through a backdoor easily and wont reduce effiency too much. If this is an assignment you might just justify your approach like this:
    Mondrian does not allow muilti-valued dimensions that are not reflexive. So you decided to calculate two measures for each to get legal results and combine them through a Virtual Cube which is a common way to combine different granularity, detail grade, dimensionality of values. Since both are different busines measures with different goals (a hobby does not study but someone with a hobby might study).

  14. #14
    Join Date
    Dec 2016
    Posts
    19

    Default

    Hi underscores,

    This approach is valid for getting the correct results when selecting by student as well as when selecting by hobby (and combining them with other dimensions). This is the result I already have, using the approach described in the posts #4 and #5 of this thread.

    Said that, I don't see how this approach can solve the problem of calculating the following query: average of grades per course, filtering by 2 of 3 hobbies. It is the same query I mention on the post #5.

    Thanks and best regards,

    Bruno

  15. #15
    Join Date
    Dec 2016
    Posts
    19

    Default

    Just an update, I've discussed the problem with Diethard Steiner, and this discussion led to the elaboration of a very good and informative post on his blog. The URL to the post is: https://diethardsteiner.github.io/mo...Attribute.html

    I recommend checking his blog out, as I've been following his blog for a long time and have found lots of useful information regarding Business Intelligence and the use of Mondrian and Pentaho Community.

    Best regards,
    Bruno

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.