US and Worldwide: +1 (866) 660-7555
Results 1 to 10 of 10

Thread: Mondrian produces inefficient SQL

  1. #1

    Question Mondrian produces inefficient SQL

    Hi all,

    I am using Mondrian to do ROLAP on a fact table which has ~4M rows, and 7 dimension tables. A bunch of aggregate tables have been created to speed up queries, but I noticed Mondrian does not use them very efficiently at times.

    Case in point is a very simple query which returns measures (named A and B here) for every dimension_key (with no nameColumn, so just the integer is returned):
    Code:
    SELECT {[Measures].[A], [Measures].[B]} ON COLUMNS,
    [Dimension].[Key].MEMBERS ON ROWS
    FROM Fact
    The following SQL would get the desired result from an existing aggregate table which has the data aggregated for every dimension_key, per year:

    Code:
    SELECT
      dimension_key, sum(a), sum(b)
    FROM
      agg_1
    GROUP BY
      dimension_key
    ORDER BY
      dimension_key
    The GROUP BY is to sum over all years. This runs in ~5ms. However, while Mondrian does recognize the aggregate table it decides to create a bunch of SQL queries with the following format:

    Code:
    select
        "dimension"."dimension_key" as "c0",
        sum("agg_1"."a") as "m0",
        sum("agg_1"."b") as "m1"
    from
        "public"."dimension" as "dimension",
        "public"."agg_1" as "agg_1"
    where
        "agg_1"."dimension_key" = "dimension"."dimension_key"
    and
         "dimension"."dimension_key" in (1,2,3 .. 2500) <-- Note I abbreviated it, Mondrian drops 2500 numbers here
    group by
         "dimension"."dimension_key"
    As there are 22384 distinct dimension_key values in the dimension table, 9 of these queries are executed (1..2500, 2501..5000, etc). While each query would run relatively speedy as well (around or under 5 ms), I don't understand why it is done this way as it does take unnecessarily longer. I have looked at the mondrian.result.highCardChunkSize and mondrian.result.limit properties but setting these to high values does not change anything.

    There is absolutely no reason to check if dimension_key has a certain value, there does not even need to be a join with the dimension table ('WHERE agg_1.dimension_key = dimension.dimension_key') as the aggregate table contains all the dimension_keys present in the fact table so I just want them all to return. Does anyone know why Mondrian chooses this query plan? Please note whenever the amount of unique keys is less than 2500 Mondrian will execute a query similar to the one I gave, not joining the dimension table when it is not required and not adding the 'dimension_key IN (1..2500)' clause.

    Basically my question comes down to this; when does Mondrian split up queries like this, why is it done and how can it be prevented (if it does not have any benefits which it seemingly does not)?

    Another thing, Mondrian does not handle a Parent-Child hierarchy + Closure properly in that aggregate tables are not used at all. If I take away the Closure the aggregate table IS used, but Mondrian takes ages to determine all descendants of every node (which it would read from the closure table very quickly) so this is not an option. I have read numerous posts on this forum and in jira about issues with Parent-Child / Closure hierarchies and aggregate tables and it does not appear to have been fixed yet. I have followed the documentation carefully which essentially tells the user to rely on a "trick in Mondrian's internals" but this does not seem to work.

    Thank you in advance for any suggestions.

    -- Daniel

  2. #2

    Default

    Did you ever get an answer?
    thanks

  3. #3
    Join Date
    Feb 2013
    Posts
    14

    Default

    For what concerns the way queries are split, I'm having the same issue (see thread).
    The only workaround I found is to downgrade mondrian to 3.3, if you managed to solve this issue please let me know.

  4. #4
    Join Date
    Dec 2010
    Posts
    247

    Default

    Hi dknippers,
    I think that Mondrian executes that kind of queries to populate its cache.
    I got better performances by increasing the value of mondrian.rolap.maxConstraints (in my case 10000) and mondrian.rolap.cellBatchSize (in my case 300000).

  5. #5

    Default

    Hi,
    The problem with the mondrian.rolap.maxConstraints is that it is DB dependent... Oracle will allow only 1000.
    Here, the issue is not coming from the fact that there is a limit on the where condition size, the problem is that mondrian should not use the where condition for the specified column as it is not filtered !
    Thanks


  6. #6
    Join Date
    Jan 2013
    Posts
    357

    Default

    I'm surprised to hear that queries are being run in batches in dknippers case. Based on my understanding exceeding maxConstraints *should* cause Mondrian to drop the IN list, not batch up a series of queries the way he describes. If anyone has a reproducible case (ideally with Foodmart), log a Jira case.

    Another possible factor in dknippers case is that it looks like the dimension_key field on the aggregate table is both a foreign key for non-collapsed levels, as well as a dimension level itself. That's may explain why it's joining in the dimension table, even though in his case it doesn't need to.

  7. #7
    Join Date
    Oct 2012
    Posts
    18

    Default

    I'm seeing the exact same problem, for some reason mondrian iterates over the dimension values and each time executes a query on the fact table. it doesn't make any sense and I tried playing with the different parameters with no luck.

    Downgrading to 3.3 solved the issue.

    Any help will be greatly appreciated.

    Thanks

  8. #8
    Join Date
    Jul 2012
    Posts
    4

    Default

    wanted to just say that i also have this same issue. downgrading isn't an option for me but i've found a workaround for the time being..

    thoughts on what's going on:
    -i believe each query is loading an individual segment, thus maybe adjusting mondrian.rolap.SparseSegmentDensityThreshold or mondrian.rolap.SparseSegmentValueThreshold could affect the segment loading behavior somehow?

    how i've gotten around it:
    -don't NON EMPTY on any axis
    -instead, NonEmptyCrossJoin() everything into a named set and then Extract() the tuples you want on each axis

    i've found the above strategy works best with:
    mondrian.native.crossjoin.enable=true
    mondrian.native.nonempty.enable=false
    mondrian.rolap.cellBatchSize=PickAReallyBigNumber

  9. #9
    Join Date
    Jan 2013
    Posts
    357

    Default

    Hi ceason,
    Thanks for exploring workarounds. Can you try an experiment for me? Can you verify that you don't see the sequence of SQL queries with different IN lists (the way dknippers describes below) by just changing cellBatchSize to a big number? That is, even with the native.* properties set to defaults and without the Extract() trick?

    BTW, that Extract() trick is clever. Native crossjoin evaluation is good at reducing the size of the tuple list that needs to be evaluated, but that doesn't do you much good if your attributes are spread across axes. You effectively found a way to benefit from native crossjoin and still place the attributes on different axes. Nice.

    -matt

  10. #10
    Join Date
    Jun 2013
    Posts
    44

    Default

    i will not say that I'm another added number to stretching the list of same victims instead just have a wish to get the solution to the trouble which any one here have met yet...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •