Hitachi Vantara Pentaho Community Forums
Results 1 to 8 of 8

Thread: [Mondrian] Multi-threading SQL execution

  1. #1
    Matt Campbell Guest

    Default [Mondrian] Multi-threading SQL execution

    We've been looking into ways to improve performance when running against
    large fact tables. One candidate we identified was to multi-thread SQL
    query execution. As things stand today, in Mondrian, SQL queries are
    executed sequentially. This means that if an MDX query involves 2 fact
    tables, and each fact table takes 5 seconds to query, that it takes Mondrian
    at least 10 seconds to get results. Running the two queries simultaneously
    would likely produce much better speed (of course dependent on the RDBMS
    capacity).

    Has anyone else considered multi-threading SQL execution?

    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

  2. #2
    Julian Hyde Guest

    Default RE: [Mondrian] Multi-threading SQL execution

    Option #1 (ROLLUP/CUBE BY) is viable and useful. If there are differences
    between DBMS vendors in how they implement this support, let's stick to the
    letter of the SQL:2003 standard.

    To implement option #1, someone will have to get their hands dirty
    understanding how cell requests are turned into SQL queries. The hardest
    part is to look at a collection of cell requests and figure out whether they
    can be satisfied using the same query.

    Is there a chance that a ROLLUP query will compute exponentially more
    results than individual GROUP BY queries? If so, we will need to do a
    cost:benefit analysis before issuing a ROLLUP query.

    Option #2 (parallel query execution) is also viable, and is useful if option
    #1 is implemented, because certain queries, especially those on virtual
    cubes, may generate queries which are not a rollup of each other.

    Implementing option #2 it requires a modest amount of coding, mainly
    introducing a multi-threaded request queue, and a significant amount of
    testing for threading issues.

    A third option is to support rollup within cache. If mondrian notices that
    there is a request for ([Time].[1997].[Q1], ... [Q4], [Product].[Beer]) and
    also a request for ([Time].[1997], [Product].[Beer]) then it should execute
    request #1 then answer request #2 by rolling up the results of request #1.

    ALL of these options will benefit mondrian and each offers something that
    the other two do not, so it's difficult to choose between them. My instinct
    is that option #2 is slightly less work than option #1, but has less
    benefit. Take your pick!

    Julian

    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

  3. #3
    Pappyn Bart Guest

    Default RE: [Mondrian] Multi-threading SQL execution

    As Michael already mentioned, Option #2 could have troubles with
    database transactions.
    When running against a dynamic database, option #2 will make it even
    more difficult to keep the cache integrity.

    While I think it would be better that a central cache manager would have
    multiple threads to load the results (both aggregates as hierarchies)
    and
    would run separately from the mdx query executing threads (just making
    requests to the cache). There could still
    be problems when multiple threads would read the same cube data using
    different transactions. In some cases
    you might allow each cube (when using virtual cubes) to have a single
    thread, but still, in most cases, this would lead
    to problems.

    For example : When calculating aggregate tables at night, it is possible
    that one sql query depends on a newer
    version (an aggregate table already updated), and another query
    depending on an aggregate table yet to be updated.

    While database transactions are not there yet, it would make future work
    in that direction more difficult, it might be impossible.

    I see a lot of movement in different directions for the moment, about
    scalability, multi user access, integrity, dynamic databases,
    huge results, small real time results... But I notice that some
    decisions don't cover the whole idea or are not compatible with each
    other.

    I think it would be useful if there was an analysis made of the whole
    idea, with a roadmap supporting this vision, or at least each direction
    that is not compatible with other usages of mondrian, should be made
    configurable. So it would be possible - at cube level - or at
    mondrian.properties level to configure how mondrian will behave.

    Bart
    ________________________________

    From: mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org]
    On Behalf Of Julian Hyde
    Sent: dinsdag 13 februari 2007 2:38
    To: 'Mondrian developer mailing list'
    Subject: RE: [Mondrian] Multi-threading SQL execution


    Option #1 (ROLLUP/CUBE BY) is viable and useful. If there are
    differences between DBMS vendors in how they implement this support,
    let's stick to the letter of the SQL:2003 standard.

    To implement option #1, someone will have to get their hands dirty
    understanding how cell requests are turned into SQL queries. The hardest
    part is to look at a collection of cell requests and figure out whether
    they can be satisfied using the same query.

    Is there a chance that a ROLLUP query will compute exponentially more
    results than individual GROUP BY queries? If so, we will need to do a
    cost:benefit analysis before issuing a ROLLUP query.

    Option #2 (parallel query execution) is also viable, and is useful if
    option #1 is implemented, because certain queries, especially those on
    virtual cubes, may generate queries which are not a rollup of each
    other.

    Implementing option #2 it requires a modest amount of coding, mainly
    introducing a multi-threaded request queue, and a significant amount of
    testing for threading issues.

    A third option is to support rollup within cache. If mondrian notices
    that there is a request for ([Time].[1997].[Q1], ... [Q4],
    [Product].[Beer]) and also a request for ([Time].[1997],
    [Product].[Beer]) then it should execute request #1 then answer request
    #2 by rolling up the results of request #1.

    ALL of these options will benefit mondrian and each offers something
    that the other two do not, so it's difficult to choose between them. My
    instinct is that option #2 is slightly less work than option #1, but has
    less benefit. Take your pick!

    Julian

    ______________________________________________________________________
    This email has been scanned by the Email Security System.
    ______________________________________________________________________


    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

  4. #4
    Matt Campbell Guest

    Default Re: [Mondrian] Multi-threading SQL execution

    I agree, Bart, if we did implement Option #2 it would have to be
    configurable.

    I hadn't known until I read the thread Michael posted that using
    java.lang.Thread is not a good idea in a servlet. I'm still a bit fuzzy on
    why that is, but it seems clear that Option #2 will require some serious
    thought to do correctly.

    For option #1 Julian asked if we might inadvertantly pull back much more
    data than we need, and suggested we somehow do a cost/benefit analysis
    before deciding to use ROLLUP or CUBE. I think that's a very good
    point--blindly doing a ROLLUP or CUBE in all cases could result in much
    worse performance, particularly when a query involves several dimensions.



    On 2/13/07, Pappyn Bart <Bart.Pappyn (AT) vandewiele (DOT) com> wrote:
    >
    > As Michael already mentioned, Option #2 could have troubles with database
    > transactions.
    > When running against a dynamic database, option #2 will make it even more
    > difficult to keep the cache integrity.
    >
    > While I think it would be better that a central cache manager would have
    > multiple threads to load the results (both aggregates as hierarchies) and
    > would run separately from the mdx query executing threads (just making
    > requests to the cache). There could still
    > be problems when multiple threads would read the same cube data using
    > different transactions. In some cases
    > you might allow each cube (when using virtual cubes) to have a single
    > thread, but still, in most cases, this would lead
    > to problems.
    >
    > For example : When calculating aggregate tables at night, it is possible
    > that one sql query depends on a newer
    > version (an aggregate table already updated), and another query depending
    > on an aggregate table yet to be updated.
    >
    > While database transactions are not there yet, it would make future work
    > in that direction more difficult, it might be impossible.
    >
    > I see a lot of movement in different directions for the moment, about
    > scalability, multi user access, integrity, dynamic databases,
    > huge results, small real time results... But I notice that some decisions
    > don't cover the whole idea or are not compatible with each other.
    >
    > I think it would be useful if there was an analysis made of the whole
    > idea, with a roadmap supporting this vision, or at least each direction
    > that is not compatible with other usages of mondrian, should be made
    > configurable. So it would be possible - at cube level - or at
    > mondrian.properties level to configure how mondrian will behave.
    >
    > Bart
    > ------------------------------
    > *From:* mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org]
    > *On Behalf Of *Julian Hyde
    > *Sent:* dinsdag 13 februari 2007 2:38
    > *To:* 'Mondrian developer mailing list'
    > *Subject:* RE: [Mondrian] Multi-threading SQL execution
    >
    > *Option #1 (ROLLUP/CUBE BY) *is viable and useful. If there are
    > differences between DBMS vendors in how they implement this support, let's
    > stick to the letter of the SQL:2003 standard.
    >
    > To implement option #1, someone will have to get their hands dirty
    > understanding how cell requests are turned into SQL queries. The hardest
    > part is to look at a collection of cell requests and figure out whether they
    > can be satisfied using the same query.
    >
    > Is there a chance that a ROLLUP query will compute exponentially more
    > results than individual GROUP BY queries? If so, we will need to do a
    > cost:benefit analysis before issuing a ROLLUP query.
    >
    > *Option #2 (parallel query execution) *is also viable, and is useful if
    > option #1 is implemented, because certain queries, especially those on
    > virtual cubes, may generate queries which are not a rollup of each other.
    >
    > Implementing option #2 it requires a modest amount of coding, mainly
    > introducing a multi-threaded request queue, and a significant amount of
    > testing for threading issues.
    >
    > A *third option is to support rollup within cache*. If mondrian notices
    > that there is a request for ([Time].[1997].[Q1], ... [Q4], [Product].[Beer])
    > and also a request for ([Time].[1997], [Product].[Beer]) then it should
    > execute request #1 then answer request #2 by rolling up the results of
    > request #1.
    >
    > ALL of these options will benefit mondrian and each offers something that
    > the other two do not, so it's difficult to choose between them. My instinct
    > is that option #2 is slightly less work than option #1, but has less
    > benefit. Take your pick!
    >
    > Julian
    >
    > ______________________________________________________________________
    > This email has been scanned by the Email Security System.
    > ______________________________________________________________________
    >
    > _______________________________________________
    > Mondrian mailing list
    > Mondrian (AT) pentaho (DOT) org
    > http://lists.pentaho.org/mailman/listinfo/mondrian
    >
    >


    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

  5. #5
    Julian Hyde Guest

    Default RE: [Mondrian] Multi-threading SQL execution

    I agree with the sentiment regarding threads. If we implement #2 (parallel
    SQL query execution) I want mondrian to be able to run in a mode where
    everything is done in one thread.

    (If you read my blog post What's
    <http://julianhyde.blogspot.com/2006/...m-criticized-f
    or.html> in a name, you will know that I am militantly agnostic on whether
    mondrian is a server. A repository is one sure sign that a software library
    is growing into a server, but a thread pool is another.)

    When we do allocate threads, we will use java.util.concurrent.ThreadFactory
    - that should give enough flexibility for various environments. (I have a
    feeling that ThreadFactory was introduced because people wanted a safe way
    to create threads inside their EJBs, so we can use if for the same purpose.)

    I want to challenge Michael's assertion that "almost no JDBC driver will
    allow parallel queries". It may be true, but if so, it will cost us a lot of
    effort to do things another way, so I want to see some evidence. Michael,
    can you give an example of a JDBC driver which cannot execute two statements
    in the same connection (and transaction context)?

    Julian




    _____

    From: mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org] On
    Behalf Of michael bienstein
    Sent: Tuesday, February 13, 2007 2:08 PM
    To: Mondrian developer mailing list
    Subject: Re : [Mondrian] Multi-threading SQL execution


    Matt,

    The reason that threads shouldn't be created in Servlets is conceptually
    simple. Servlets run in a managed environment. The control over the usage
    of system resources should be handled by the environment without
    interference from applications being managed. Threads are one of those
    resources. To give an idea of this, there is actually no restriction on the
    number of Threads you create in Java. However if you go over 8 threads per
    CPU you get serious performance degradation because of the switching the CPU
    has to do between the threads (at least on Windows Sun JDK 1.4). So it's
    better if you set up your application server and configure how many threads
    get used for all the applications in the same JVM and no application changes
    this.

    More generally, Java has very bad semantics for enforcing this sort of
    control over system resources. Consider memory. I didn't know that Java 5
    allowed some control over this as Richard has shown us recently. Look at
    the problems though: Derby runs in the same JVM as Mondrian and all of a
    sudden we get lots of failures because we are told that we're running out of
    memory. How can you limit how much memory a WAR can use? You can't. With
    File access you probably could do it with a custom SecurityManager. With
    Threads that's probably true too, but no major app server does it.

    The second reason has nothing to do with Mondrian. EJB 1 and 2 focussed on
    being able to call business logic without worrying about where that logic
    was implemented. It could be in a different JVM or even on another
    computer. Using threads (or static variables) is therefore forbidden even
    though if you know you are only running locally then it will still all work.

    Lastly, I know commercial BI software that *breaks* this rule and launches a
    thread in a Servlet init() method. It is truly astounding how many bad
    things pop up when the language or framework doesn't constrain you.

    Michael

    PS: Congrats on 2.3 feature complete. Really happy to see cache flushing in
    there. But did Bart's bug on virtual cube get fixed? I'm really curious as
    to what causes it.


    _____

    D

  6. #6
    John V. Sichi Guest

    Default Re: [Mondrian] Multi-threading SQL execution

    Julian Hyde wrote:
    > I want to challenge Michael's assertion that "almost no JDBC driver will
    > allow parallel queries". It may be true, but if so, it will cost us a
    > lot of effort to do things another way, so I want to see some evidence.
    > Michael, can you give an example of a JDBC driver which cannot execute
    > two statements in the same connection (and transaction context)?


    http://www.mysql.com/news-and-events...000000154.html

    http://download-east.oracle.com/docs...90211/tips.htm

    JVS
    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

  7. #7
    Julian Hyde Guest

    Default RE: [Mondrian] Multi-threading SQL execution

    > > I want to challenge Michael's assertion that "almost no
    > JDBC driver will
    > > allow parallel queries". It may be true, but if so, it will

    > cost us a
    > > lot of effort to do things another way, so I want to see

    > some evidence.
    > > Michael, can you give an example of a JDBC driver which

    > cannot execute
    > > two statements in the same connection (and transaction context)?

    >
    > http://www.mysql.com/news-and-events...000000154.html
    >
    >

    http://download-east.oracle.com/docs....901/a90211/ti
    ps.htm


    Thanks John. That's good enough evidence.

    Let me also challenge the assumption that in the typical mondrian
    installation, the contents of the DBMS are dynamically changing. My
    impression is that in 95% of cases, data is loaded nightly or weekly. There
    are clearly exceptions - as Michael and Bart have both explained - but we
    should not treat the exceptions as the norm.

    Option #2 (parallel query execution) is viable and valuable to the majority
    of the mondrian community, and we shouldn't be too hasty in ruling it out.
    Option #1 (ROLLUP) has limitations in its applicability too, because not all
    databases support that syntax.

    Julian

    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

  8. #8
    Pappyn Bart Guest

    Default RE: [Mondrian] Multi-threading SQL execution

    In fact, my project has in fact both cases, data changing at night and
    data changing all the time.

    On the first - which I call archived data - there are aggregate tables
    and cache enabled.

    On the second - which I call hot data, I turned off aggregate tables and
    cache. The data is cached
    so frequently that maintaining cache or aggregate tables makes no sense.

    Both are identical in layout and share all dimensions. Both are merged
    in one virtual cube.

    But even if the system is only changing at a certain time (at night),
    the period the data is changing, mondrian results are not reliable. So
    atomic cache population becomes important.

    Since the exception to the 95% rule is handled by cubes not maintaining
    cache, then the solution falls apart in two tasks :

    A) Cubes not maintaining cache should use one atomic operation for all
    sql queries launched in that thread. There can be running
    other queries in parallel using a different transaction, since that
    cache is thread local (since my changes recently).

    B) Cubes maintaining cache. I think Julian's blog article tells the
    direction here. The only thing I need to add there, is
    the fact I also like a plugin to be able to flush the cache in the same
    way.

    -> A) is easy to implement since there is no possibility to corrupt
    other sessions.

    -> B) Can be more easy if mondrian knows when data has changed. It is
    still up to the designer of the database and the software filling the
    database to make sure his changes to the database are atomic. My plugin
    I recently have added to mondrian can be useful. The way I implemented
    it, I created a table in the database, maintaining a list of tables and
    the last time they have changed. The modifications table is changed in
    the same atomic operation of the actual table change. Mondrian asks the
    plugin if a table has changed. If mondrian knows when the tables are
    changed and - requirement - the tables are changed in an atomic
    operation, than mondrian does not need long lasting transactions or
    read-consistent views of the database. It just needs to implement the
    same behavior as A). Mondrian asks each time before a query is started
    what data has changed, it flushes cache, load agg thread local in an
    atomic operation and when finished, it should check in results in global
    cache (when other sessions are not using the cache any more).
    Everything is already in place, only the transactions should be there.

    But when the new cache control feature is used, then flushing of cache
    is not in sync with mondrian queries, nor with the changing database,
    which make things much more difficult. I think it might be useful if
    both principles are merged into one, using cache control in the plugin.

    Telling mondrian, "the database is changing", requires the database
    process being able to access the jvm where mondrian is running, which is
    not always the case, and still, it would be not in sync. When the plugin
    is used, it can be an in-between. The process can communication using
    special tables in the database, with mondrian. Or the plugin can
    communicate in other means, soap, whatever with the external process
    responsible for modifying the database. The plugin should be able to
    use the same db transaction as the mdx query session.

    I think forcing the plugin as a default way of implementing real time
    olap is not a huge requirement, since the usage is already very
    specific. Most people will be satisfied using the cache control
    implementation as it is now, since data changes at night and the system
    is not very likely being used at that time. Most users just want to
    keep their application server up and running, and want means to flush
    the cache. This was already possible, but the cache control feature
    reduces the cost of the operation.

    Bart
    -----Original Message-----
    From: mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org]
    On Behalf Of Julian Hyde
    Sent: woensdag 14 februari 2007 3:17
    To: mondrian (AT) pentaho (DOT) org
    Subject: RE: [Mondrian] Multi-threading SQL execution


    > > I want to challenge Michael's assertion that "almost no

    > JDBC driver will
    > > allow parallel queries". It may be true, but if so, it will

    > cost us a
    > > lot of effort to do things another way, so I want to see

    > some evidence.
    > > Michael, can you give an example of a JDBC driver which

    > cannot execute
    > > two statements in the same connection (and transaction context)?

    >
    > http://www.mysql.com/news-and-events...a0000000154.ht
    > ml
    >
    >

    http://download-east.oracle.com/docs...java.901/a9021
    1/ti
    ps.htm


    Thanks John. That's good enough evidence.

    Let me also challenge the assumption that in the typical mondrian
    installation, the contents of the DBMS are dynamically changing. My
    impression is that in 95% of cases, data is loaded nightly or weekly.
    There are clearly exceptions - as Michael and Bart have both explained -
    but we should not treat the exceptions as the norm.

    Option #2 (parallel query execution) is viable and valuable to the
    majority of the mondrian community, and we shouldn't be too hasty in
    ruling it out.
    Option #1 (ROLLUP) has limitations in its applicability too, because not
    all databases support that syntax.

    Julian

    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

    ______________________________________________________________________
    This email has been scanned by the Email Security System.
    ______________________________________________________________________
    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.