Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: [Mondrian] RE: Interest in aggregate tables and their utilization

  1. #1
    Julian Hyde Guest

    Default [Mondrian] RE: Interest in aggregate tables and their utilization

    Justin,

    I'm pleased that you are implementing aggregate tables for MySQL. There is
    definitely a huge need. Without some kind of materialization, any database
    can only answer a top-level ROLAP query by doing a full table scan. That
    effort tends to grow linearly with the data size, and MySQL runs out of
    stream fast.

    There are several difficult problems with aggregate tables:
    1. Choosing the right set of aggregate tables to instantiate;
    2. Generating the DDL for those aggregate tables;
    3. Initially populating them;
    4. Incrementally populating them;
    5. Formulating queries so that the aggregate tables are used.

    All of these problems are hard. The good news is that Mondrian's approach is
    complimentary to yours. We knew that MySQL's optimizer couldn't do #5 and
    was unlikely to do so in the near future, so we explicitly included the
    aggregate definitions in mondrian's schema and changed the SQL we generate.
    AggGen was a weak attempt at #1, #2 and #3, and you are doing an excellent
    job at #4 (also #3).

    Even better news, there is a superior tool for recommending and creating
    aggregate tables (#1, #2, #3). It is called Pentaho Aggregate Designer [ see
    http://ci.pentaho.com/view/Analysis/...designer-core/ and
    http://ci.pentaho.com/view/Analysis/job/PAD/ ].

    PAD is an aggregate table recommender, exposed via a user interface. The
    recommendation algorithm is inspired by "Implementing Data Cubes
    Efficiently" (Harinarayan, Rajaraman, and Ullman, 1995).
    http://ilpubs.stanford.edu:8090/102/, and I used a randomized algorithm to
    tame the beastly polynomial running time.

    The algorithm is pluggable. You can drop in a completely different
    algorithm, if you have the urge to write one, provided that it works on the
    same inputs and produces a set of aggregate tables. It tool is also
    extensible: you can write your own input plugin to read a schema, or output
    plugin to generate DDL. The algorithm can even run from the command line,
    and has been embedded inside a stored procedure in LucidDB.

    The algorithm works on a static analysis of the schema (structure of the
    star schema, observed cardinalities of rows and column values) but does not
    today take query patterns into account. Strictly speaking if you want to be
    driven purely by the actual queries, AggGen is a superior algorithm, because
    it generated the aggregates for precisely the queries the end-user entered.
    But PAD generates much better coverage.

    I suggest you take a look at the PAD source code and see whether it would be
    possible to write a plugin to generate stored procedure calls. Also download
    and run PAD; playing with the UI gives you an idea what it can do.

    The algorithm could certainly do a good job at generating a near-optimal set
    of aggregate tables for a given amount of available disk space.

    Julian

    > -----Original Message-----
    > From: Justin Swanhart [mailto:greenlion (AT) gmail (DOT) com]
    > Sent: Wednesday, June 09, 2010 12:35 PM
    > To: jhyde (AT) pentaho (DOT) com
    > Cc: mondrian (AT) pentaho (DOT) org
    > Subject: Interest in aggregate tables and their utilization
    >
    > Hi,
    >
    > First, I should preface this with a note that I have very limited
    > experience with Mondrian, but extensive experience with MySQL and SQL
    > in general. Also, sorry for the length of this post, but aggregate
    > tables and maintaining them is a complex topic.
    >
    > I am interested in aggregate tables, not because I personally have a
    > large data set which I am reporting on, but instead because I have a
    > project called Flexviews (flexviews.sourceforge.net) which enables
    > fast (incrementally) refreshable materialized views to MySQL. These
    > views work similar to fast refresh views in Oracle, materialized query
    > tables in DB2, or indexed views in SQL server. Flexviews has three
    > parts, a data dictionary (MySQL tables), an external CDC tool (which
    > makes table change logs) and a set of stored procedures for
    > manipulating the dictionary (to create or modify views) and for
    > refreshing the materialized views from the table change logs. It also
    > supports completely rebuilding views from scratch.
    >
    > Flexviews supports the major aggregate functions: SUM, COUNT, AVG,
    > MIN, MAX, COUNT DISTINCT, and joins. It should be able to
    > incrementally maintain any of the aggregate tables that are defined by
    > AggGen.
    >
    > One of the difficulties with Flexviews is that incrementally
    > maintaining aggregate tables is only one part of the problem.
    > Effectively using them requires a query rewrite strategy to
    > automatically use the aggregate tables when they are available. As
    > far as I know, Mondrian is the only open source software that has
    > similar functionality. This pairs well with Flexviews, which is the
    > only open source (LGPL) project which supports incrementally
    > refreshing materialized views.
    >
    > I am thinking about modifying AggGen to support the stored procedure
    > API to create Flexviews materialized views directly. I was wondering
    > if you had any thoughts or suggestions about
    > such modifications. I was also wondering if any progress had been
    > made on the other tools which are mentioned in the Aggregate Tables
    > section of the manual such as the recommender. Another challenge with
    > Flexviews is of course, choosing the best views to materialize.
    >
    > For those curious how Flexviews works:
    >
    > Flexviews is based primarily on:
    > http://pages.cs.wisc.edu/~beyer/pape...w_sigmod00.pdf
    > (How to Roll a Join)
    > http://citeseerx.ist.psu.edu/viewdoc...10.1.1.54.8573
    > (Maintenance of Data Cubes and Summary Tables in a Warehouse)
    >
    > The first paper covers the incremental computation for SPJ queries. It
    > should be noted that Flexviews doesn't currently implement rolling
    > propagation, but only the simpler propagate function (both are defined
    > by pseudo code in Beyer's paper). The better algorithm will be
    > implemented in the future, but the lesser performing version is still
    > orders of magnitude faster than a complete refresh. Flexviews
    > implements a recursive stored procedure version of the algorithm, and
    > communicates with the external CDC program which is processing the
    > database transaction logs (binary logs). The paper does not address
    > SPJ queries and instead points the researcher to the second paper.
    > One last note,the algorithm features "two phase propagation". That
    > is, computation of deltas is completely orthogonal to the application
    > of the deltas, and they can be performed asynchronously. This means
    > that deltas can be calculated periodically during daily processing,
    > and applied to the view at a later time in bulk. This provides
    > another nice benefit in the multiple views can be refreshed to the
    > same exact transactional point in time providing a consistent
    > aggregated data model.
    >
    > The second paper provides a method of using 'summary aggregate tables'
    > and it goes into detail about how transformations on various aggregate
    > functions can be performed in order to compute incremental results.
    > For instance, COUNT(*) can be turned into SUM(1), etc. The paper
    > covers the general rewrites necessary to support maintaining aggregate
    > tables. The ideas in this paper pair pretty well with the the rolling
    > join propagation algorithm.
    >
    > Instead of handling non-distributable aggregate functions directly,
    > Flexviews creates an ancillary view which projects out the expressions
    > as group-by attributes:
    > select a,b,sum(c),min(d)
    > from xyz
    > group by a,b;
    >
    > will create a second view:
    > select a,b,d
    > from xyz
    > group by a,b,d
    >
    > This has the side effect of making all single table views self
    > maintainable, even those with non-distributable aggregate functions,
    > at the cost of extra space and maintenance effort.
    >
    >
    > --
    > Justin Swanhart
    >


    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

  2. #2
    Justin Swanhart Guest

    Default [Mondrian] Re: Interest in aggregate tables and their utilization

    Hi,

    I haven't been haven't much luck working directly with the aggregate
    designer. I can't get the source code to compile in Eclipse, and when
    I try to use the RC version on sourceforge, everything works great
    until I try to preview or export the aggregate scripts (see below).
    Exporting the modified mondrian schema XML file works though.

    I tried this on both a Vista machine with the Sun JVM, and on a FC8
    machine with the OpenJDK.

    22:00:32,262 ERROR [SwingButton] Error calling oncommand event: exportHandler.sh
    owPreview()
    org.pentaho.ui.xul.XulException: Error invoking method: exportHandler.showPrevie
    w()
    at org.pentaho.ui.xul.impl.AbstractXulDomContainer.invoke(AbstractXulDom
    Container.java:327)
    at org.pentaho.ui.xul.swing.tags.SwingButton$OnClickRunnable.run(SwingBu
    tton.java:58)
    at java.awt.event.InvocationEvent.dispatch(Unknown Source)
    at java.awt.EventQueue.dispatchEvent(Unknown Source)
    at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
    at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
    at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
    at java.awt.Dialog$1.run(Unknown Source)
    at java.awt.Dialog$3.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.awt.Dialog.show(Unknown Source)
    at java.awt.Component.show(Unknown Source)
    at java.awt.Component.setVisible(Unknown Source)
    at java.awt.Window.setVisible(Unknown Source)
    at java.awt.Dialog.setVisible(Unknown Source)
    at org.pentaho.ui.xul.swing.tags.SwingDialog.show(SwingDialog.java:234)
    at org.pentaho.aggdes.ui.form.controller.ExportHandler.openDialog(Export
    Handler.java:144)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.pentaho.ui.xul.impl.AbstractXulDomContainer.invoke(AbstractXulDom
    Container.java:323)
    at org.pentaho.ui.xul.swing.tags.SwingButton$OnClickRunnable.run(SwingBu
    tton.java:58)
    at java.awt.event.InvocationEvent.dispatch(Unknown Source)
    at java.awt.EventQueue.dispatchEvent(Unknown Source)
    at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
    at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
    at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
    at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
    at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
    at java.awt.EventDispatchThread.run(Unknown Source)
    Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.pentaho.ui.xul.impl.AbstractXulDomContainer.invoke(AbstractXulDom
    Container.java:323)
    ... 30 more
    Caused by: java.lang.NullPointerException
    at mondrian.rolap.RolapStar$Column.generateExprString(RolapStar.java:116
    8)
    at mondrian.rolap.RolapStar.generateSql(RolapStar.java:925)
    at org.pentaho.aggdes.model.mondrian.MondrianSchema.generateAggregateSql
    (MondrianSchema.java:273)
    at org.pentaho.aggdes.output.impl.PopulateTableGenerator.generate(Popula
    teTableGenerator.java:59)
    at org.pentaho.aggdes.output.impl.AbstractGenerator.generateFull(Abstrac
    tGenerator.java:34)
    at org.pentaho.aggdes.output.impl.OutputServiceImpl.getFullArtifact(Outp
    utServiceImpl.java:152)
    at org.pentaho.aggdes.ui.form.controller.ExportHandler.showPreview(Expor
    tHandler.java:212)
    ... 35 more

    Clicking 'execute' on the CREATE scripts doesn't generate an
    exception, but it triggers a SQL error about a duplicate column on the
    15th view:
    StatementCallback; bad SQL grammar [CREATE TABLE `dw_Sales_Cube_15` (
    `customer_State` CHAR(2),
    `time_Year` YEAR(4),
    `time_Quarter` INT(11),
    `time_Month` INT(11),
    `product_Brand` CHAR(20),
    `sales_total_sale` DOUBLE,
    `sales_Fact_Count` INTEGER,
    `sales_fact_count` INTEGER)]; nested exception is
    java.sql.BatchUpdateException: Duplicate column name
    'sales_fact_count'

    I can give you my database, mondrian schema or both. The database is
    randomly generated data for the snowflake schema described in the
    Aggregate Tables portion of the manual: sales, mfr, product, customer,
    day. There could be a problem in my schema file.

    --Justin

    On Sat, Jun 12, 2010 at 12:32 AM, Julian Hyde <jhyde (AT) pentaho (DOT) com> wrote:[color=blue]
    > Justin,
    >
    > I'm pleased that you are implementing aggregate tables for MySQL. There is
    > definitely a huge need. Without some kind of materialization, any database
    > can only answer a top-level ROLAP query by doing a full table scan. That
    > effort tends to grow linearly with the data size, and MySQL runs out of
    > stream fast.
    >
    > There are several difficult problems with aggregate tables:
    > 1. Choosing the right set of aggregate tables to instantiate;
    > 2. Generating the DDL for those aggregate tables;
    > 3. Initially populating them;
    > 4. Incrementally populating them;
    > 5. Formulating queries so that the aggregate tables are used.
    >
    > All of these problems are hard. The good news is that Mondrian's approach is
    > complimentary to yours. We knew that MySQL's optimizer couldn't do #5 and
    > was unlikely to do so in the near future, so we explicitly included the
    > aggregate definitions in mondrian's schema and changed the SQL we generate.
    > AggGen was a weak attempt at #1, #2 and #3, and you are doing an excellent
    > job at #4 (also #3).
    >
    > Even better news, there is a superior tool for recommending and creating
    > aggregate tables (#1, #2, #3). It is called Pentaho Aggregate Designer [ see
    > http://ci.pentaho.com/view/Analysis/...designer-core/ and
    > http://ci.pentaho.com/view/Analysis/job/PAD/ ].
    >
    > PAD is an aggregate table recommender, exposed via a user interface. The
    > recommendation algorithm is inspired by "Implementing Data Cubes
    > Efficiently" (Harinarayan, Rajaraman, and Ullman, 1995).
    > http://ilpubs.stanford.edu:8090/102/, and I used a randomized algorithm to
    > tame the beastly polynomial running time.
    >
    > The algorithm is pluggable. You can drop in a completely different
    > algorithm, if you have the urge to write one, provided that it works on the
    > same inputs and produces a set of aggregate tables. It tool is also
    > extensible: you can write your own input plugin to read a schema, or output
    > plugin to generate DDL. The algorithm can even run from the command line,
    > and has been embedded inside a stored procedure in LucidDB.
    >
    > The algorithm works on a static analysis of the schema (structure of the
    > star schema, observed cardinalities of rows and column values) but does not
    > today take query patterns into account. Strictly speaking if you want to be
    > driven purely by the actual queries, AggGen is a superior algorithm, because
    > it generated the aggregates for precisely the queries the end-user entered.
    > But PAD generates much better coverage.
    >
    > I suggest you take a look at the PAD source code and see whether it would be
    > possible to write a plugin to generate stored procedure calls. Also download
    > and run PAD; playing with the UI gives you an idea what it can do.
    >
    > The algorithm could certainly do a good job at generating a near-optimal set
    > of aggregate tables for a given amount of available disk space.
    >
    > Julian
    >[color=green]
    >> -----Original Message-----
    >> From: Justin Swanhart [mailto:greenlion (AT) gmail (DOT) com]
    >> Sent: Wednesday, June 09, 2010 12:35 PM
    >> To: jhyde (AT) pentaho (DOT) com
    >> Cc: mondrian (AT) pentaho (DOT) org
    >> Subject: Interest in aggregate tables and their utilization
    >>
    >> Hi,
    >>
    >> First, I should preface this with a note that I have very limited
    >> experience with Mondrian, but extensive experience with MySQL and SQL
    >> in general.

  3. #3
    Justin Swanhart Guest

    Default [Mondrian] Re: Interest in aggregate tables and their utilization

    Hi,

    I have the same problem with the stable release. However, I
    discovered a cause of the problem. When I put 100 in as the maximum
    number of aggregate tables to generate, the tool produces 15 tables
    and I can not preview the output. Interestingly, when I enter 15 as
    the maximum limit, 12 tables are generated and there are no errors.
    This is at least the case with the stable release.

    I have updated the Flexviews "converter" script at
    http://flexviews.sourceforge.net/convert.php
    It can now take the 'populate' SQL script created by the aggregate
    designer and convert the INSERT ... SELECT statements into stored
    procedure calls. Here is example output (it can support any number of
    tables at once):
    INPUT:
    -- Populate aggregate table dw_Sales_Cube_12
    INSERT INTO `dw_Sales_Cube_12` (
    `customer_State`,
    `time_Year`,
    `time_Quarter`,
    `time_Month`,
    `product_Brand`,
    `sales_total_sale`,
    `sales_fact_count`)
    select
    `customer`.`state` as `customer_State`,
    `time`.`year` as `time_Year`,
    `time`.`quarter` as `time_Quarter`,
    `time`.`month` as `time_Month`,
    `product`.`brand` as `product_Brand`,
    sum(units * dollars) as `sales_total_sale`,
    count(*) as `sales_fact_count`
    from
    `dw`.`sales` as `sales`,
    `dw`.`customer` as `customer`,
    `dw`.`time` as `time`,
    `dw`.`product` as `product`
    where
    `sales`.`custid` = `customer`.`custid` and
    `sales`.`day` = `time`.`day` and
    `sales`.`prodid` = `product`.`prodid`
    group by
    `customer`.`state`,
    `time`.`year`,
    `time`.`quarter`,
    `time`.`month`,
    `product`.`brand`;

    OUTPUT:

    # REFRESH_TYPE: INCREMENTAL
    CALL flexviews.create('test', 'dw_Sales_Cube_12', 'INCREMENTAL');
    SET @mvid := LAST_INSERT_ID();
    CALL flexviews.add_table(@mvid,'dw', 'sales','sales', NULL);
    CALL flexviews.add_table(@mvid,'dw', 'customer','customer', ' ');
    CALL flexviews.add_table(@mvid,'dw', 'time','time', ' ');
    CALL flexviews.add_table(@mvid,'dw', 'product','product', ' ');
    CALL flexviews.add_expr(@mvid,'GROUP', 'customer.state', 'customer_State');
    CALL flexviews.add_expr(@mvid,'GROUP', 'time.year', 'time_Year');
    CALL flexviews.add_expr(@mvid,'GROUP', 'time.quarter', 'time_Quarter');
    CALL flexviews.add_expr(@mvid,'GROUP', 'time.month', 'time_Month');
    CALL flexviews.add_expr(@mvid,'GROUP', 'product.brand', 'product_Brand');
    CALL flexviews.add_expr(@mvid,'SUM', 'units * dollars', 'sales_total_sale');
    CALL flexviews.add_expr(@mvid,'COUNT', '*', 'sales_fact_count');
    CALL flexviews.add_expr(@mvid,'WHERE','sales.custid = customer.custid
    and sales.day = time.day and sales.prodid = product.prodid' ,
    'where_clause');
    CALL flexviews.enable(@mvid);

    --Justin

    On Sun, Jun 13, 2010 at 10:13 PM, Justin Swanhart <greenlion (AT) gmail (DOT) com> wrote:[color=blue]
    > Hi,
    >
    > I haven't been haven't much luck working directly with the aggregate
    > designer.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.