PDA

View Full Version : [Mondrian] All Members vs Aggregate



Thiyagu Palanisamy
04-24-2007, 07:50 AM
As of now mondrian uses multiple sql queries to get data for a MDX with
members at different levels. We are working on changing this behavior to
combine these multiple sqls to single one.

For an example this MDX generates 2 sqls, one for all gender and another
for male,female

select gender.members on 0 from [sales 2] where measures.[Sales Count];

sql:
1. select "time_by_day"."the_year" as "c0", "customer"."gender" as "c1",
count("sales_fact_1997"."product_id") as "m0"
from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997",
"customer" "customer"
where
"sales_fact_1997"."time_id" = "time_by_day"."time_id" and
"time_by_day"."the_year" = 1997 and
"sales_fact_1997"."customer_id" = "customer"."customer_id"
group by "time_by_day"."the_year", "customer"."gender"

2. select "time_by_day"."the_year" as "c0",
count("sales_fact_1997"."product_id") as "m0"
from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997"
where
"sales_fact_1997"."time_id" = "time_by_day"."time_id" and
"time_by_day"."the_year" = 1997
group by
"time_by_day"."the_year"


with our prototype, these sqls are grouped into one (this query will only
work with DBs that support grouping sets):

1. select "time_by_day"."the_year" as "c0", "customer"."gender" as "c1",
count("sales_fact_1997"."product_id") as "m0"
from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997",
"customer" "customer"
where
"sales_fact_1997"."time_id" = "time_by_day"."time_id" and
"time_by_day"."the_year" = 1997 and
"sales_fact_1997"."customer_id" = "customer"."customer_id"
group by grouping sets(("time_by_day"."the_year", "customer"."gender"),
("time_by_day"."the_year"))

Although the grouping set implementation looks fine and works in most
cases, there is a tricky problem associated with it. The problem is
grouping "all" (aggregate) value got from rolling up is equal to aggregate
of the members involved and not the all value

to reproduce it, alter fact table and insert rows for null customers:

--modify fact table to allow nulls for customer_id
alter table "sales_fact_1997" modify "customer_id" null
--add sales records with no customers
insert into "sales_fact_1997" values (337,371,null,0,2,6.3455,2.3445,4)
insert into "sales_fact_1997" values (337,440,null,0,2,6.3355,2.335,2)

results:
remains same with grouping sql:
Axis #0:
{[Measures].[Sales Count]}
Axis #1:
{[Gender].[All Gender]}
{[Gender].[All Gender].[F]}
{[Gender].[All Gender].[M]}
Row #0: 86,837
Row #0: 42,831
Row #0: 44,006

without grouping (mondrian's current behavior):
....
Row #0: 86,839 //includes 2 records that don't belong to any
gender
Row #0: 42,831
Row #0: 44,006

In summary "All" in mondrian is not really an aggregate of all the
members, it also includes data from rows that don't join to fact table.
This behavior is not consistent with AS 2000 (MOLAP).
How should we handle this issue?
One of the options that we can think of is, to use outer joins. The
problem with this is it returns addition row for "null" gender.



Thanks,
Thiyagu & Raghu


_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Richard Emberson
04-24-2007, 09:30 AM
In our local usage, this has been know for a while;
dimensions can have
only partial coverage of the base fact table. If you
have orphan rows in the fact table for a given
dimension, then for an
aggregate the sum across a lower level will not equal
the sum at the top, all level. This is because for
lower levels the sql is joining across from fact table
foreign key to dimension table primary key. But for
the top level result, there is an optimization(?)
and the generated sql does not assure that only fact
table rows with matching entries in the dimension table
make up the result.
One might wish that there was a property that controled
this behavior, so that, if one wished, that no
orphan rows be counted. The alternative is to have
your dbs make sure that there is no bad data in the
fact tables (we have experienced that, as with much of
life, dirt happens; over time bad data creeps into
fact tables).

Richard


Thiyagu Palanisamy wrote:
>
> As of now mondrian uses multiple sql queries to get data for a MDX with
> members at different levels. We are working on changing this behavior
> to combine these multiple sqls to single one.
>
> For an example this MDX generates 2 sqls, one for all gender and another
> for male,female
>
> select gender.members on 0 from [sales 2] where measures.[Sales Count];
>
> sql:
> 1. /select "time_by_day"."the_year" as "c0", "customer"."gender" as
> "c1", count("sales_fact_1997"."product_id") as "m0"
> from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997",
> "customer" "customer"
> where
> "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
> "time_by_day"."the_year" = 1997 and
> "sales_fact_1997"."customer_id" = "customer"."customer_id"
> group by "time_by_day"."the_year", "customer"."gender" /
>
> 2. /select "time_by_day"."the_year" as "c0",
> count("sales_fact_1997"."product_id") as "m0"
> from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997"
> where
> "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
> "time_by_day"."the_year" = 1997
> group by
> "time_by_day"."the_year"
> /
>
> with our prototype, these sqls are grouped into one (this query will
> only work with DBs that support grouping sets):
>
> 1. /select "time_by_day"."the_year" as "c0", "customer"."gender" as
> "c1", count("sales_fact_1997"."product_id") as "m0"
> from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997",
> "customer" "customer"
> where
> "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
> "time_by_day"."the_year" = 1997 and
> "sales_fact_1997"."customer_id" = "customer"."customer_id"
> group by grouping sets(("time_by_day"."the_year", "customer"."gender"),
> ("time_by_day"."the_year")) /
>
> Although the grouping set implementation looks fine and works in most
> cases, there is a tricky problem associated with it. The problem is
> grouping "all" (aggregate) value got from rolling up is equal to
> aggregate of the members involved and not the all value
>
> to reproduce it, alter fact table and insert rows for null customers:
>
> --modify fact table to allow nulls for customer_id
> alter table "sales_fact_1997" modify "customer_id" null
> --add sales records with no customers
> insert into "sales_fact_1997" values (337,371,null,0,2,6.3455,2.3445,4)
> insert into "sales_fact_1997" values (337,440,null,0,2,6.3355,2.335,2)
>
> results:
> remains same with grouping sql:
> Axis #0:
> {[Measures].[Sales Count]}
> Axis #1:
> {[Gender].[All Gender]}
> {[Gender].[All Gender].[F]}
> {[Gender].[All Gender].[M]}
> Row #0: 86,837
> Row #0: 42,831
> Row #0: 44,006
>
> without grouping (mondrian's current behavior):
> ...
> Row #0: 86,839 //includes 2 records that don't belong to any gender
> Row #0: 42,831
> Row #0: 44,006
>
> In summary "All" in mondrian is not really an aggregate of all the
> members, it also includes data from rows that don't join to fact table.
> This behavior is not consistent with AS 2000 (MOLAP).
> How should we handle this issue?
> One of the options that we can think of is, to use outer joins. The
> problem with this is it returns addition row for "null" gender.
>
>
>
> Thanks,
> Thiyagu & Raghu
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Mondrian mailing list
> Mondrian (AT) pentaho (DOT) org
> http://lists.pentaho.org/mailman/listinfo/mondrian


--
Quis custodiet ipsos custodes:
This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all
copies of the original message.
_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Thiyagu Palanisamy
04-27-2007, 09:40 AM
Hi,

For now we are leaving All Members vs Aggregate issue open and proceeding
with getting Grouping sets functionality working.

We have developed a Proof of concept to achieve the Grouping sets
functionality, with this we have managed to get all the tests except 2
tests passing. Two failing tests are because of All member queries getting
orphan rows(InlineTableTests).

Limitations of our POC:
1. "All value" will not be consistent in cases where there are extra fact
rows (i.e, rows that don't connect to the dimension of interest).

2. The implementation doesn't reduce the number of queries in cases like
select {member,member.children} ...
This is because, it becomes very tricky to identify that members.children
roll up to member, since the roll-up logic heavily relies on constraint
bit key to identify batch that can be grouped into a single sql query

Note: POC is a hacky solution, it doesn't consider issues like
Concurrency..

Logic for identifying the Batches which can be grouped :
1. If Batch A's constraint columns list is super set of Batch B's
constraint columns
and
2. If matching columns of the Batch A and Batch B has the same value
and
3. If Non matching columns of super set have All values

We are sending this for an early feed back, if you see any issues with our
approach please let us know, we will be starting our actual implementation
next week.

Attached the tar file created using packChanges containing our POC in the
forum post : http://forums.pentaho.org/showthread.php?t=53644

Thanks,
Thiyagu & Raghu





Richard Emberson <remberson (AT) edgedynamics (DOT) com>
Sent by: mondrian-bounces (AT) pentaho (DOT) org
04/24/2007 07:02 PM
Please respond to
Mondrian developer mailing list <mondrian (AT) pentaho (DOT) org>


To
Mondrian developer mailing list <mondrian (AT) pentaho (DOT) org>
cc

Subject
Re: [Mondrian] All Members vs Aggregate






In our local usage, this has been know for a while;
dimensions can have
only partial coverage of the base fact table. If you
have orphan rows in the fact table for a given
dimension, then for an
aggregate the sum across a lower level will not equal
the sum at the top, all level. This is because for
lower levels the sql is joining across from fact table
foreign key to dimension table primary key. But for
the top level result, there is an optimization(?)
and the generated sql does not assure that only fact
table rows with matching entries in the dimension table
make up the result.
One might wish that there was a property that controled
this behavior, so that, if one wished, that no
orphan rows be counted. The alternative is to have
your dbs make sure that there is no bad data in the
fact tables (we have experienced that, as with much of
life, dirt happens; over time bad data creeps into
fact tables).

Richard


Thiyagu Palanisamy wrote:
>
> As of now mondrian uses multiple sql queries to get data for a MDX with
> members at different levels. We are working on changing this behavior
> to combine these multiple sqls to single one.
>
> For an example this MDX generates 2 sqls, one for all gender and another

> for male,female
>
> select gender.members on 0 from [sales 2] where measures.[Sales Count];
>
> sql:
> 1. /select "time_by_day"."the_year" as "c0", "customer"."gender" as
> "c1", count("sales_fact_1997"."product_id") as "m0"
> from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997",
> "customer" "customer"
> where
> "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
> "time_by_day"."the_year" = 1997 and
> "sales_fact_1997"."customer_id" = "customer"."customer_id"
> group by "time_by_day"."the_year", "customer"."gender" /
>
> 2. /select "time_by_day"."the_year" as "c0",
> count("sales_fact_1997"."product_id") as "m0"
> from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997"
> where
> "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
> "time_by_day"."the_year" = 1997
> group by
> "time_by_day"."the_year"
> /
>
> with our prototype, these sqls are grouped into one (this query will
> only work with DBs that support grouping sets):
>
> 1. /select "time_by_day"."the_year" as "c0", "customer"."gender" as
> "c1", count("sales_fact_1997"."product_id") as "m0"
> from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997",
> "customer" "customer"
> where
> "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
> "time_by_day"."the_year" = 1997 and
> "sales_fact_1997"."customer_id" = "customer"."customer_id"
> group by grouping sets(("time_by_day"."the_year", "customer"."gender"),

> ("time_by_day"."the_year")) /
>
> Although the grouping set implementation looks fine and works in most
> cases, there is a tricky problem associated with it. The problem is
> grouping "all" (aggregate) value got from rolling up is equal to
> aggregate of the members involved and not the all value
>
> to reproduce it, alter fact table and insert rows for null customers:
>
> --modify fact table to allow nulls for customer_id
> alter table "sales_fact_1997" modify "customer_id" null
> --add sales records with no customers
> insert into "sales_fact_1997" values (337,371,null,0,2,6.3455,2.3445,4)
> insert into "sales_fact_1997" values (337,440,null,0,2,6.3355,2.335,2)
>
> results:
> remains same with grouping sql:
> Axis #0:
> {[Measures].[Sales Count]}
> Axis #1:
> {[Gender].[All Gender]}
> {[Gender].[All Gender].[F]}
> {[Gender].[All Gender].[M]}
> Row #0: 86,837
> Row #0: 42,831
> Row #0: 44,006
>
> without grouping (mondrian's current behavior):
> ...
> Row #0: 86,839 //includes 2 records that don't belong to any
gender
> Row #0: 42,831
> Row #0: 44,006
>
> In summary "All" in mondrian is not really an aggregate of all the
> members, it also includes data from rows that don't join to fact table.
> This behavior is not consistent with AS 2000 (MOLAP).
> How should we handle this issue?
> One of the options that we can think of is, to use outer joins. The
> problem with this is it returns addition row for "null" gender.
>
>
>
> Thanks,
> Thiyagu & Raghu
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Mondrian mailing list
> Mondrian (AT) pentaho (DOT) org
> http://lists.pentaho.org/mailman/listinfo/mondrian


--
Quis custodiet ipsos custodes:
This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all
copies of the original message.
_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian


_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Thiyagu Palanisamy
05-08-2007, 10:50 AM
Hello,

For implementing Sql with Grouping Set we need to change the way the
following classes interact during loading of Cell request data.

1. fastBatchingCellReader.loadAggregations() -> batch.loadAggregation()
2. batch.loadAggregation() -> aggregationManager.loadAggregation()
3. aggregationManager.loadAggregation() -> aggregation.load()
4. aggregation.load() -> Segment.load()

Using SQL with Grouping Set function needs to be decided at the Batch
level, once we identify the batches that can be grouped then we need to
tag them together and pass this information down to Segment.load() to fire
single SQL for multiple batches.

Segment.load() static method needs information about all the segment
instances which are going to be queried together along with some
additional information from Aggregation/Batch class.

So we need to put together all the segment instances from multiple grouped
batches and pass it to Segment.load().

We are thinking of the following approach to achieve this,

Instead of aggregation.load() calling Segment.load() with its Segments
list, aggregation.load() need to return Segment list of each batch to the
FBCR(or class representing Grouped Batch).
Then FBCR will group the segments of grouped batches and make a call to
Segment.load static method to fire single sql with grouping sets and
populate all those Segments.

We are close to finishing a working prototype based on the above approach.
Hence would like to get your early feedback to ensure that we are on right
track.

Code posted at http://forums.pentaho.org/showthread.php?t=53644 is POC for
logic of grouping the Batch and extracting information from a grouping set
sql to multiple segments of multiple batches.

Thanks,
Thiyagu



_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Julian Hyde
05-08-2007, 02:50 PM
Thiyagu,

The details are there, but some of the larger design issues still need
to be solved. In short, we need a top-down design. Here are a few ideas.

To start with, I would like to see a description of the design that a
typical DBA would understand. 'Here is a simple MDX statement, and here
is the SQL with GROUPING SETS it generates'.

A slightly more complex case. Consider the case where we load CA, OR,
and all cities in WA. What will the SQL look like for that query? I
don't think it's possible to constrain some grouping sets without
constraining others, so we would end up loading all cities in CA and OR.
That might be problem; maybe we can generate SQL to avoid that, or just
not merge the batches together.

What happens if a query requests batches A, B, and C, and B is already
being loaded?

What happens if a query requests batches A, B, C, D, and B and D are
better satisfied using aggregate tables? Suppose further that B and D
can be satisfied using the same aggregate table; should we generate
GROUPING SETS syntax against aggregate tables?

If a query only has one GROUPING SET, will be use the GROUPING SET
syntax, or just the regular GROUP BY?

Concurrency control is a huge issue to be designed. Previously, load
would lock an Aggregate. That is not longer feasible, because load will
affect more than one Aggregate. The POC code continues to lock the first


Several places in the code take an array of Aggregations. I think that
they should take an array of Segments instead. This is valid because - I
think - the Segments in a given load will all belong to different
Aggregations. This array of Segments could be refactored into a class,
which could hold some of the newly-static methods.

You have added tests in CmdRunnerTest and FastBatchingCellReaderTest. I
think the MDX-level tests should be in BasicQueryTest - you are not
testing new functionality in CmdRunner, after all.
FastBatchingCellReaderTest could be useful; I look forward to seeing it
filled out. If I were writing this feature, I would be adding tests very
similar to those in TestAggregationManager

Specific code reviews.

AggQuerySpec:

*

Use collections (e.g. lists) for modern code. They are
definitely clearer than 2D arrays. For example, 'SqlQuery.ClauseList[]
getGroupingSets(RolapStar.Column[][] columns, SqlQuery sqlQuery)' would
certainly benefit from this.

Aggregation:

*

The method
public synchronized void load(
RolapStar.Column[] columns,
RolapStar.Measure[] measures,
StarColumnPredicate[] predicates,
RolapAggregationManager.PinSet pinnedSegments,
List<Aggregation> grouping
)
is effectively static now you have added the 'grouping' parameter. Which
means that the 'synchronized' keyword is now dubious: if you're loading
segments A, B, and C it is not valid to just lock A.
*

Check the formatting for parameters in this method and other new
methods in this file.
*

setColumns: Making columns mutable seems to be asking for
trouble. I note that setColumns is called from
FastBatchingCellReader.merge, but what if the aggregation concerned has
existing segments? This needs to be re-thought.
*

A lot of Aggregation methods have become static. I don't like
static methods. My hunch is that we can refactor out a new class which
represents a collection of segments being loaded at the same time.
*

getSegment is poorly named since it returns more than one
segment

AggregationManager:

*

loadAggregation has a parameter constrainedColumnsBitKey, and
uses it to call Rolapstar.lookupOrCreateAggregation. That is no longer
appropriate: the aggregations in the grouping will each have a different
bit key. This is a symptom of a larger problem: the aggregations in the
grouping should be treated symmetrically, and there should be no
'primary' aggregation to call 'load' against.
*

'spec.generateSqlQuery(grouping)' : shouldn't the grouping be
part of the QuerySpec?

Segment:

*

exec() and getObjects() should return Object[][] not Object[].
Even better, return List<Object[]>.
*

getObjects: Why does this throw QueryCanceledException? There
are many reasons why ResultSet.next() might throw.

DrillThroughQuerySpec:

*

should assert that groupingSet array is empty

CmdRunnerTest:

*

move tests out of CmdRunnerTest into BasicQueryTest or similar.
You're not really testing functionality in CmdRunner.

SqlQuery:

*


Rather than making ClauseList public, I suggest you create a
specific inner class GroupingSetClause. Some of the other helper methods
could be moved to this class.
*

SqlQuery.groupingSets should accept StringBuilder rather than
return String.
*

'grouping set' makes terminology confusing - not your fault -
but I suggest that you rename 'setGroupingSet(ClauseList[]
groupingSets)' to 'setGroupingSetList(List<ClauseList> groupingSetList)
* Grouping set names are identifiers. They need to be quoted
according to the dialect.

General:

*

use StringBuilder rather than StringBuffer for all new code
*

use foreach-style 'for' loops wherever possible
*

javadoc

Julian


_____

From: mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org]
On Behalf Of Thiyagu Palanisamy
Sent: Tuesday, May 08, 2007 7:30 AM
To: mondrian (AT) pentaho (DOT) org
Subject: Re: [Mondrian] All Members vs Aggregate



Hello,

For implementing Sql with Grouping Set we need to change the way the
following classes interact during loading of Cell request data.

1. fastBatchingCellReader.loadAggregations() -> batch.loadAggregation()
2. batch.loadAggregation() -> aggregationManager.loadAggregation()
3. aggregationManager.loadAggregation() -> aggregation.load()
4. aggregation.load() -> Segment.load()

Using SQL with Grouping Set function needs to be decided at the Batch
level, once we identify the batches that can be grouped then we need to
tag them together and pass this information down to Segment.load() to
fire single SQL for multiple batches.

Segment.load() static method needs information about all the segment
instances which are going to be queried together along with some
additional information from Aggregation/Batch class.

So we need to put together all the segment instances from multiple
grouped batches and pass it to Segment.load().

We are thinking of the following approach to achieve this,

Instead of aggregation.load() calling Segment.load() with its Segments
list, aggregation.load() need to return Segment list of each batch to
the FBCR(or class representing Grouped Batch).
Then FBCR will group the segments of grouped batches and make a call to
Segment.load static method to fire single sql with grouping sets and
populate all those Segments.

We are close to finishing a working prototype based on the above
approach.
Hence would like to get your early feedback to ensure that we are on
right track.

Code posted at http://forums.pentaho.org/showthread.php?t=53644 is POC
for logic of grouping the Batch and extracting information from a
grouping set sql to multiple segments of multiple batches.

Thanks,
Thiyagu





_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian