Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Mondrian: Star Schema vs Single Fact Table w/No Dimension Tables

  1. #1
    Join Date
    Jan 2007
    Posts
    485

    Default Mondrian: Star Schema vs Single Fact Table w/No Dimension Tables

    I have been using DWs with “single fact table” which includes all my dimensions. When working with small-sized tables (up to 7-10 dimensions + 3-4 measures, and up to 500,000 records), perfomance is not a problem.

    I currently have a “single fact table” that has over 1.7 millon rows, some 22 dimensions and 12 measures, and of course performance is an issue. For my testing/learning, I am using a small sample (some 72,000 records) and given the # of dim/measures, it still stakes some 5-7 minutes to get my 1st general view of my sample cube.

    Following the MDX/Mondrian documentation, I am reviewing the cmd.exe after I change the MDX Query Editor to resemble the views I anticipate my users will be wanting to try so that I can start designing some aggregate tables based upon the "Segment.load" Sqls mondrian performs. (…I know that I have an infinite number of aggregates -in the realm of 22^12 or 12^22 number of possible aggregates- but in any event, some agg tables might help me out …)

    Sorry for the long background… here’s are my questions:

    a) Is it -technically speaking” more advisable to build star-schemas (with corresponding dim tables) rather than relying on “single-fact” tables with 0 dim tables?
    b) Does Mondrian work better (speedier, etc.) when one builds the DW based on a “star-schema” or is it “feasable/logical” to work with a single-fact table (no dim tables)?

    Besides wanting some advise, the reason I ask is because I notice that the majority of sql’s Mondrian performs have to do with the “SqlMemberSource.getMemberChildren”, and among these, the longest running ones result in “null” (at least on the first run of the “generic” / bare cube), because Mondrian (apparently) tries to determine Children by comparing "dimensions" against each other (... select "nn" from [table] where dim1 = dim2..., being dim1 and dim2 unrelated). My guess is that if I had separate “dim tables”, the sql’s would probably be faster because Mondrian would recognize these not being related to one-another??.

    As usual, many thanks in advance for you advise/response.

  2. #2
    Join Date
    Jan 2007
    Posts
    485

    Question Please could anyone help!! Star Schema vs Single Fact Table w/No Dimension

    I'd very much appreciate somebody's help on this... Thanks, DMurray3

  3. #3
    Join Date
    Mar 2007
    Posts
    216

    Smile

    Hi,

    I don't know how works the Mondrian Engine but I am almost certain that using a Star Schema instead of a Single Fact Table w/No Dimension Tables will be faster. I see many reasons for this :

    -considering that the fact table is always read and that a dimension table is only read when displayed,
    --displaying a dimension in a Star Schema will only read the Star Schema's Fact Table ( typically a fact identifier, the measures, and the FKs to Dimension Tables ) and the wanted Dimension Table content,
    --displaying a dimension in a Single Fact Table w/No Dimension Table will access all data whatever dimension you want to display.

    -considering that the Star Schema's Fact Table has less text data than a Single Fact Table w/No Dimension Table, e.g. for Country Dimension,
    --in a Star Schema's Fact Table you will have the numeric attribute CodeCountry in each rows, it is a FK for the PK of Dimension Table 'CountryDimension' which has numeric attribute CountryCode and text attribute CountryName.
    --in a Single Fact Table w/No Dimension you will have the text attribute CountryName in each row.

    -considering that reading numeric data is faster than text data,

    I would say that Star Schema will be faster.

    a+, =)
    -=Clément=-

  4. #4
    Join Date
    Jan 2007
    Posts
    485

    Default

    Thanks a bundle.... I'll restructure my fact table to a star schema, and will let you know the outcome.... If you don't here from me in the next couple of weeks, have a Merry Xmas....

  5. #5
    Join Date
    Nov 2007
    Posts
    19

    Default

    Daniel,
    could you please give detailes about the test you are talking about, related to 72,000 records case?
    How many dimensions and measures/calculatedmember?
    Which hardware were used for mondrian run?
    Which container (JBoss?)?
    Which database?

    Maybe there is some wiki page on mondrian documented execution test?
    Thanks and Regards.
    Andrea

  6. #6
    Join Date
    Jan 2007
    Posts
    485

    Default

    andreacolpo,

    The 72,000 records is a sample of my larger single-fact table with all dimensions colapsed. As mentiones this sample has 22 dimensions & 12 measures (no calculated messures to srat off with..).

    I am running my test databases and PCI 1.2.0.534-GA "as is" on an HP Pavillon zd8000 with 2 processors with 2gb RAM and my data resides in MySQL.

    My question to the forum was more of an academic one, since I have simply resorted to have a single fact table with all dimensions collaped in it. That was the case while I was using small dbs... but now that I wanted to do the same with a db with 1.7 millon records, I started to run into performance problems. These I understand are not only on account of the single fact table, but also the consequence of no having agg tables and further the result of my hardware...

    I have had to interrupt my pentaho testing and am recently coming back to it... I almost have the 1.7 records arranged in a star schema and will be building some agg tables... this should help out on the performance..but then again, I will always have some performance issue due to the HW...

    I'll keep you posted on how I resolve this academic problem....

    Thanks for you concerns.... Daniel

  7. #7
    Join Date
    Feb 2008
    Posts
    3

    Default

    I am also wondering what the fundamental principle is behind the use of star schemas. Is there any theoretical foundation behind their use for obtaining better performance, versus using a fact table with dimensions defined against particular columns in the fact table (ie, what Mondrian calls degenerate dimensions) ?

    I fail to see why queries against multiple dimension tables followed by a star join should be any faster than simply querying directly on the fact table. I have a suspicion that the star schema was designed to work around the limits of older SQL-DBMSs, where index combination on B-Tree indexes was not possible. But with the advent of bitmap indices, and assuming your DBMS will use all indices applicable to fields being joined, why can't the filtering be done directly on the fact table ?

    Sorry for the beginner question. Pointers to some good theoretical exposition of star schemas would be appreciated.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.