Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: [Mondrian] All Members vs Aggregate

  1. #1
    Thiyagu Palanisamy Guest

    Default [Mondrian] All Members vs Aggregate

    As of now mondrian uses multiple sql queries to get data for a MDX with
    members at different levels. We are working on changing this behavior to
    combine these multiple sqls to single one.

    For an example this MDX generates 2 sqls, one for all gender and another
    for male,female

    select gender.members on 0 from [sales 2] where measures.[Sales Count];

    sql:
    1. select "time_by_day"."the_year" as "c0", "customer"."gender" as "c1",
    count("sales_fact_1997"."product_id") as "m0"
    from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997",
    "customer" "customer"
    where
    "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
    "time_by_day"."the_year" = 1997 and
    "sales_fact_1997"."customer_id" = "customer"."customer_id"
    group by "time_by_day"."the_year", "customer"."gender"

    2. select "time_by_day"."the_year" as "c0",
    count("sales_fact_1997"."product_id") as "m0"
    from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997"
    where
    "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
    "time_by_day"."the_year" = 1997
    group by
    "time_by_day"."the_year"


    with our prototype, these sqls are grouped into one (this query will only
    work with DBs that support grouping sets):

    1. select "time_by_day"."the_year" as "c0", "customer"."gender" as "c1",
    count("sales_fact_1997"."product_id") as "m0"
    from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997",
    "customer" "customer"
    where
    "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
    "time_by_day"."the_year" = 1997 and
    "sales_fact_1997"."customer_id" = "customer"."customer_id"
    group by grouping sets(("time_by_day"."the_year", "customer"."gender"),
    ("time_by_day"."the_year"))

    Although the grouping set implementation looks fine and works in most
    cases, there is a tricky problem associated with it. The problem is
    grouping "all" (aggregate) value got from rolling up is equal to aggregate
    of the members involved and not the all value

    to reproduce it, alter fact table and insert rows for null customers:

    --modify fact table to allow nulls for customer_id
    alter table "sales_fact_1997" modify "customer_id" null
    --add sales records with no customers
    insert into "sales_fact_1997" values (337,371,null,0,2,6.3455,2.3445,4)
    insert into "sales_fact_1997" values (337,440,null,0,2,6.3355,2.335,2)

    results:
    remains same with grouping sql:
    Axis #0:
    {[Measures].[Sales Count]}
    Axis #1:
    {[Gender].[All Gender]}
    {[Gender].[All Gender].[F]}
    {[Gender].[All Gender].[M]}
    Row #0: 86,837
    Row #0: 42,831
    Row #0: 44,006

    without grouping (mondrian's current behavior):
    ....
    Row #0: 86,839 //includes 2 records that don't belong to any
    gender
    Row #0: 42,831
    Row #0: 44,006

    In summary "All" in mondrian is not really an aggregate of all the
    members, it also includes data from rows that don't join to fact table.
    This behavior is not consistent with AS 2000 (MOLAP).
    How should we handle this issue?
    One of the options that we can think of is, to use outer joins. The
    problem with this is it returns addition row for "null" gender.



    Thanks,
    Thiyagu & Raghu


    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

  2. #2
    Richard Emberson Guest

    Default Re: [Mondrian] All Members vs Aggregate

    In our local usage, this has been know for a while;
    dimensions can have
    only partial coverage of the base fact table. If you
    have orphan rows in the fact table for a given
    dimension, then for an
    aggregate the sum across a lower level will not equal
    the sum at the top, all level. This is because for
    lower levels the sql is joining across from fact table
    foreign key to dimension table primary key. But for
    the top level result, there is an optimization(?)
    and the generated sql does not assure that only fact
    table rows with matching entries in the dimension table
    make up the result.
    One might wish that there was a property that controled
    this behavior, so that, if one wished, that no
    orphan rows be counted. The alternative is to have
    your dbs make sure that there is no bad data in the
    fact tables (we have experienced that, as with much of
    life, dirt happens; over time bad data creeps into
    fact tables).

    Richard


    Thiyagu Palanisamy wrote:
    >
    > As of now mondrian uses multiple sql queries to get data for a MDX with
    > members at different levels. We are working on changing this behavior
    > to combine these multiple sqls to single one.
    >
    > For an example this MDX generates 2 sqls, one for all gender and another
    > for male,female
    >
    > select gender.members on 0 from [sales 2] where measures.[Sales Count];
    >
    > sql:
    > 1. /select "time_by_day"."the_year" as "c0", "customer"."gender" as
    > "c1", count("sales_fact_1997"."product_id") as "m0"
    > from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997",
    > "customer" "customer"
    > where
    > "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
    > "time_by_day"."the_year" = 1997 and
    > "sales_fact_1997"."customer_id" = "customer"."customer_id"
    > group by "time_by_day"."the_year", "customer"."gender" /
    >
    > 2. /select "time_by_day"."the_year" as "c0",
    > count("sales_fact_1997"."product_id") as "m0"
    > from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997"
    > where
    > "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
    > "time_by_day"."the_year" = 1997
    > group by
    > "time_by_day"."the_year"
    > /
    >
    > with our prototype, these sqls are grouped into one (this query will
    > only work with DBs that support grouping sets):
    >
    > 1. /select "time_by_day"."the_year" as "c0", "customer"."gender" as
    > "c1", count("sales_fact_1997"."product_id") as "m0"
    > from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997",
    > "customer" "customer"
    > where
    > "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
    > "time_by_day"."the_year" = 1997 and
    > "sales_fact_1997"."customer_id" = "customer"."customer_id"
    > group by grouping sets(("time_by_day"."the_year", "customer"."gender"),
    > ("time_by_day"."the_year")) /
    >
    > Although the grouping set implementation looks fine and works in most
    > cases, there is a tricky problem associated with it. The problem is
    > grouping "all" (aggregate) value got from rolling up is equal to
    > aggregate of the members involved and not the all value
    >
    > to reproduce it, alter fact table and insert rows for null customers:
    >
    > --modify fact table to allow nulls for customer_id
    > alter table "sales_fact_1997" modify "customer_id" null
    > --add sales records with no customers
    > insert into "sales_fact_1997" values (337,371,null,0,2,6.3455,2.3445,4)
    > insert into "sales_fact_1997" values (337,440,null,0,2,6.3355,2.335,2)
    >
    > results:
    > remains same with grouping sql:
    > Axis #0:
    > {[Measures].[Sales Count]}
    > Axis #1:
    > {[Gender].[All Gender]}
    > {[Gender].[All Gender].[F]}
    > {[Gender].[All Gender].[M]}
    > Row #0: 86,837
    > Row #0: 42,831
    > Row #0: 44,006
    >
    > without grouping (mondrian's current behavior):
    > ...
    > Row #0: 86,839 //includes 2 records that don't belong to any gender
    > Row #0: 42,831
    > Row #0: 44,006
    >
    > In summary "All" in mondrian is not really an aggregate of all the
    > members, it also includes data from rows that don't join to fact table.
    > This behavior is not consistent with AS 2000 (MOLAP).
    > How should we handle this issue?
    > One of the options that we can think of is, to use outer joins. The
    > problem with this is it returns addition row for "null" gender.
    >
    >
    >
    > Thanks,
    > Thiyagu & Raghu
    >
    >
    > ------------------------------------------------------------------------
    >
    > _______________________________________________
    > Mondrian mailing list
    > Mondrian (AT) pentaho (DOT) org
    > http://lists.pentaho.org/mailman/listinfo/mondrian



    --
    Quis custodiet ipsos custodes:
    This email message is for the sole use of the intended recipient(s) and
    may contain confidential information. Any unauthorized review, use,
    disclosure or distribution is prohibited. If you are not the intended
    recipient, please contact the sender by reply email and destroy all
    copies of the original message.
    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

  3. #3
    Thiyagu Palanisamy Guest

    Default Re: [Mondrian] All Members vs Aggregate

    Hi,

    For now we are leaving All Members vs Aggregate issue open and proceeding
    with getting Grouping sets functionality working.

    We have developed a Proof of concept to achieve the Grouping sets
    functionality, with this we have managed to get all the tests except 2
    tests passing. Two failing tests are because of All member queries getting
    orphan rows(InlineTableTests).

    Limitations of our POC:
    1. "All value" will not be consistent in cases where there are extra fact
    rows (i.e, rows that don't connect to the dimension of interest).

    2. The implementation doesn't reduce the number of queries in cases like
    select {member,member.children} ...
    This is because, it becomes very tricky to identify that members.children
    roll up to member, since the roll-up logic heavily relies on constraint
    bit key to identify batch that can be grouped into a single sql query

    Note: POC is a hacky solution, it doesn't consider issues like
    Concurrency..

    Logic for identifying the Batches which can be grouped :
    1. If Batch A's constraint columns list is super set of Batch B's
    constraint columns
    and
    2. If matching columns of the Batch A and Batch B has the same value
    and
    3. If Non matching columns of super set have All values

    We are sending this for an early feed back, if you see any issues with our
    approach please let us know, we will be starting our actual implementation
    next week.

    Attached the tar file created using packChanges containing our POC in the
    forum post : http://forums.pentaho.org/showthread.php?t=53644

    Thanks,
    Thiyagu & Raghu





    Richard Emberson <remberson (AT) edgedynamics (DOT) com>
    Sent by: mondrian-bounces (AT) pentaho (DOT) org
    04/24/2007 07:02 PM
    Please respond to
    Mondrian developer mailing list <mondrian (AT) pentaho (DOT) org>


    To
    Mondrian developer mailing list <mondrian (AT) pentaho (DOT) org>
    cc

    Subject
    Re: [Mondrian] All Members vs Aggregate






    In our local usage, this has been know for a while;
    dimensions can have
    only partial coverage of the base fact table. If you
    have orphan rows in the fact table for a given
    dimension, then for an
    aggregate the sum across a lower level will not equal
    the sum at the top, all level. This is because for
    lower levels the sql is joining across from fact table
    foreign key to dimension table primary key. But for
    the top level result, there is an optimization(?)
    and the generated sql does not assure that only fact
    table rows with matching entries in the dimension table
    make up the result.
    One might wish that there was a property that controled
    this behavior, so that, if one wished, that no
    orphan rows be counted. The alternative is to have
    your dbs make sure that there is no bad data in the
    fact tables (we have experienced that, as with much of
    life, dirt happens; over time bad data creeps into
    fact tables).

    Richard


    Thiyagu Palanisamy wrote:
    >
    > As of now mondrian uses multiple sql queries to get data for a MDX with
    > members at different levels. We are working on changing this behavior
    > to combine these multiple sqls to single one.
    >
    > For an example this MDX generates 2 sqls, one for all gender and another


    > for male,female
    >
    > select gender.members on 0 from [sales 2] where measures.[Sales Count];
    >
    > sql:
    > 1. /select "time_by_day"."the_year" as "c0", "customer"."gender" as
    > "c1", count("sales_fact_1997"."product_id") as "m0"
    > from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997",
    > "customer" "customer"
    > where
    > "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
    > "time_by_day"."the_year" = 1997 and
    > "sales_fact_1997"."customer_id" = "customer"."customer_id"
    > group by "time_by_day"."the_year", "customer"."gender" /
    >
    > 2. /select "time_by_day"."the_year" as "c0",
    > count("sales_fact_1997"."product_id") as "m0"
    > from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997"
    > where
    > "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
    > "time_by_day"."the_year" = 1997
    > group by
    > "time_by_day"."the_year"
    > /
    >
    > with our prototype, these sqls are grouped into one (this query will
    > only work with DBs that support grouping sets):
    >
    > 1. /select "time_by_day"."the_year" as "c0", "customer"."gender" as
    > "c1", count("sales_fact_1997"."product_id") as "m0"
    > from "time_by_day" "time_by_day", "sales_fact_1997" "sales_fact_1997",
    > "customer" "customer"
    > where
    > "sales_fact_1997"."time_id" = "time_by_day"."time_id" and
    > "time_by_day"."the_year" = 1997 and
    > "sales_fact_1997"."customer_id" = "customer"."customer_id"
    > group by grouping sets(("time_by_day"."the_year", "customer"."gender"),


    > ("time_by_day"."the_year")) /
    >
    > Although the grouping set implementation looks fine and works in most
    > cases, there is a tricky problem associated with it. The problem is
    > grouping "all" (aggregate) value got from rolling up is equal to
    > aggregate of the members involved and not the all value
    >
    > to reproduce it, alter fact table and insert rows for null customers:
    >
    > --modify fact table to allow nulls for customer_id
    > alter table "sales_fact_1997" modify "customer_id" null
    > --add sales records with no customers
    > insert into "sales_fact_1997" values (337,371,null,0,2,6.3455,2.3445,4)
    > insert into "sales_fact_1997" values (337,440,null,0,2,6.3355,2.335,2)
    >
    > results:
    > remains same with grouping sql:
    > Axis #0:
    > {[Measures].[Sales Count]}
    > Axis #1:
    > {[Gender].[All Gender]}
    > {[Gender].[All Gender].[F]}
    > {[Gender].[All Gender].[M]}
    > Row #0: 86,837
    > Row #0: 42,831
    > Row #0: 44,006
    >
    > without grouping (mondrian's current behavior):
    > ...
    > Row #0: 86,839 //includes 2 records that don't belong to any

    gender
    > Row #0: 42,831
    > Row #0: 44,006
    >
    > In summary "All" in mondrian is not really an aggregate of all the
    > members, it also includes data from rows that don't join to fact table.
    > This behavior is not consistent with AS 2000 (MOLAP).
    > How should we handle this issue?
    > One of the options that we can think of is, to use outer joins. The
    > problem with this is it returns addition row for "null" gender.
    >
    >
    >
    > Thanks,
    > Thiyagu & Raghu
    >
    >
    > ------------------------------------------------------------------------
    >
    > _______________________________________________
    > Mondrian mailing list
    > Mondrian (AT) pentaho (DOT) org
    > http://lists.pentaho.org/mailman/listinfo/mondrian



    --
    Quis custodiet ipsos custodes:
    This email message is for the sole use of the intended recipient(s) and
    may contain confidential information. Any unauthorized review, use,
    disclosure or distribution is prohibited. If you are not the intended
    recipient, please contact the sender by reply email and destroy all
    copies of the original message.
    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian


    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

  4. #4
    Thiyagu Palanisamy Guest

    Default Re: [Mondrian] All Members vs Aggregate

    Hello,

    For implementing Sql with Grouping Set we need to change the way the
    following classes interact during loading of Cell request data.

    1. fastBatchingCellReader.loadAggregations() -> batch.loadAggregation()
    2. batch.loadAggregation() -> aggregationManager.loadAggregation()
    3. aggregationManager.loadAggregation() -> aggregation.load()
    4. aggregation.load() -> Segment.load()

    Using SQL with Grouping Set function needs to be decided at the Batch
    level, once we identify the batches that can be grouped then we need to
    tag them together and pass this information down to Segment.load() to fire
    single SQL for multiple batches.

    Segment.load() static method needs information about all the segment
    instances which are going to be queried together along with some
    additional information from Aggregation/Batch class.

    So we need to put together all the segment instances from multiple grouped
    batches and pass it to Segment.load().

    We are thinking of the following approach to achieve this,

    Instead of aggregation.load() calling Segment.load() with its Segments
    list, aggregation.load() need to return Segment list of each batch to the
    FBCR(or class representing Grouped Batch).
    Then FBCR will group the segments of grouped batches and make a call to
    Segment.load static method to fire single sql with grouping sets and
    populate all those Segments.

    We are close to finishing a working prototype based on the above approach.
    Hence would like to get your early feedback to ensure that we are on right
    track.

    Code posted at http://forums.pentaho.org/showthread.php?t=53644 is POC for
    logic of grouping the Batch and extracting information from a grouping set
    sql to multiple segments of multiple batches.

    Thanks,
    Thiyagu



    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

  5. #5
    Julian Hyde Guest

    Default RE: [Mondrian] All Members vs Aggregate

    Thiyagu,

    The details are there, but some of the larger design issues still need
    to be solved. In short, we need a top-down design. Here are a few ideas.

    To start with, I would like to see a description of the design that a
    typical DBA would understand. 'Here is a simple MDX statement, and here
    is the SQL with GROUPING SETS it generates'.

    A slightly more complex case. Consider the case where we load CA, OR,
    and all cities in WA. What will the SQL look like for that query? I
    don't think it's possible to constrain some grouping sets without
    constraining others, so we would end up loading all cities in CA and OR.
    That might be problem; maybe we can generate SQL to avoid that, or just
    not merge the batches together.

    What happens if a query requests batches A, B, and C, and B is already
    being loaded?

    What happens if a query requests batches A, B, C, D, and B and D are
    better satisfied using aggregate tables? Suppose further that B and D
    can be satisfied using the same aggregate table; should we generate
    GROUPING SETS syntax against aggregate tables?

    If a query only has one GROUPING SET, will be use the GROUPING SET
    syntax, or just the regular GROUP BY?

    Concurrency control is a huge issue to be designed. Previously, load
    would lock an Aggregate. That is not longer feasible, because load will
    affect more than one Aggregate. The POC code continues to lock the first


    Several places in the code take an array of Aggregations. I think that
    they should take an array of Segments instead. This is valid because - I
    think - the Segments in a given load will all belong to different
    Aggregations. This array of Segments could be refactored into a class,
    which could hold some of the newly-static methods.

    You have added tests in CmdRunnerTest and FastBatchingCellReaderTest. I
    think the MDX-level tests should be in BasicQueryTest - you are not
    testing new functionality in CmdRunner, after all.
    FastBatchingCellReaderTest could be useful; I look forward to seeing it
    filled out. If I were writing this feature, I would be adding tests very
    similar to those in TestAggregationManager

    Specific code reviews.

    AggQuerySpec:

    *

    Use collections (e.g. lists) for modern code. They are
    definitely clearer than 2D arrays. For example, 'SqlQuery.ClauseList[]
    getGroupingSets(RolapStar.Column[][] columns, SqlQuery sqlQuery)' would
    certainly benefit from this.

    Aggregation:

    *

    The method
    public synchronized void load(
    RolapStar.Column[] columns,
    RolapStar.Measure[] measures,
    StarColumnPredicate[] predicates,
    RolapAggregationManager.PinSet pinnedSegments,
    List<Aggregation> grouping
    )
    is effectively static now you have added the 'grouping' parameter. Which
    means that the 'synchronized' keyword is now dubious: if you're loading
    segments A, B, and C it is not valid to just lock A.
    *

    Check the formatting for parameters in this method and other new
    methods in this file.
    *

    setColumns: Making columns mutable seems to be asking for
    trouble. I note that setColumns is called from
    FastBatchingCellReader.merge, but what if the aggregation concerned has
    existing segments? This needs to be re-thought.
    *

    A lot of Aggregation methods have become static. I don't like
    static methods. My hunch is that we can refactor out a new class which
    represents a collection of segments being loaded at the same time.
    *

    getSegment is poorly named since it returns more than one
    segment

    AggregationManager:

    *

    loadAggregation has a parameter constrainedColumnsBitKey, and
    uses it to call Rolapstar.lookupOrCreateAggregation. That is no longer
    appropriate: the aggregations in the grouping will each have a different
    bit key. This is a symptom of a larger problem: the aggregations in the
    grouping should be treated symmetrically, and there should be no
    'primary' aggregation to call 'load' against.
    *

    'spec.generateSqlQuery(grouping)' : shouldn't the grouping be
    part of the QuerySpec?

    Segment:

    *

    exec() and getObjects() should return Object[][] not Object[].
    Even better, return List<Object[]>.
    *

    getObjects: Why does this throw QueryCanceledException? There
    are many reasons why ResultSet.next() might throw.

    DrillThroughQuerySpec:

    *

    should assert that groupingSet array is empty

    CmdRunnerTest:

    *

    move tests out of CmdRunnerTest into BasicQueryTest or similar.
    You're not really testing functionality in CmdRunner.

    SqlQuery:

    *


    Rather than making ClauseList public, I suggest you create a
    specific inner class GroupingSetClause. Some of the other helper methods
    could be moved to this class.
    *

    SqlQuery.groupingSets should accept StringBuilder rather than
    return String.
    *

    'grouping set' makes terminology confusing - not your fault -
    but I suggest that you rename 'setGroupingSet(ClauseList[]
    groupingSets)' to 'setGroupingSetList(List<ClauseList> groupingSetList)
    * Grouping set names are identifiers. They need to be quoted
    according to the dialect.

    General:

    *

    use StringBuilder rather than StringBuffer for all new code
    *

    use foreach-style 'for' loops wherever possible
    *

    javadoc

    Julian


    _____

    From: mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org]
    On Behalf Of Thiyagu Palanisamy
    Sent: Tuesday, May 08, 2007 7:30 AM
    To: mondrian (AT) pentaho (DOT) org
    Subject: Re: [Mondrian] All Members vs Aggregate



    Hello,

    For implementing Sql with Grouping Set we need to change the way the
    following classes interact during loading of Cell request data.

    1. fastBatchingCellReader.loadAggregations() -> batch.loadAggregation()
    2. batch.loadAggregation() -> aggregationManager.loadAggregation()
    3. aggregationManager.loadAggregation() -> aggregation.load()
    4. aggregation.load() -> Segment.load()

    Using SQL with Grouping Set function needs to be decided at the Batch
    level, once we identify the batches that can be grouped then we need to
    tag them together and pass this information down to Segment.load() to
    fire single SQL for multiple batches.

    Segment.load() static method needs information about all the segment
    instances which are going to be queried together along with some
    additional information from Aggregation/Batch class.

    So we need to put together all the segment instances from multiple
    grouped batches and pass it to Segment.load().

    We are thinking of the following approach to achieve this,

    Instead of aggregation.load() calling Segment.load() with its Segments
    list, aggregation.load() need to return Segment list of each batch to
    the FBCR(or class representing Grouped Batch).
    Then FBCR will group the segments of grouped batches and make a call to
    Segment.load static method to fire single sql with grouping sets and
    populate all those Segments.

    We are close to finishing a working prototype based on the above
    approach.
    Hence would like to get your early feedback to ensure that we are on
    right track.

    Code posted at http://forums.pentaho.org/showthread.php?t=53644 is POC
    for logic of grouping the Batch and extracting information from a
    grouping set sql to multiple segments of multiple batches.

    Thanks,
    Thiyagu





    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.