PDA

View Full Version : [Mondrian] Lots of Dimensions & Lots of Rows



Paul Ramsey
02-06-2007, 07:41 PM
Hi All,

I recently stumbled onto the fact that the problem space I am working on
is in fact a somewhat common and well understood one... the one where
Mondrian is a solution. :)

The problem I am working on is providing a form of OLAP for multiple
dimensions of spatial information in the province of BC. For our
design, each square hectare (a square 100m on a side) in the province is
a fact entry. That fact entry will have some measurements: length of
roads, area of water, average slope, minimum winter temperature. It will
also have some categorizations: what watershed is it in, what land use
classification is it in, what forest administration area is it in.

It looks an awful lot like a standard OLAP problem. The only thing that
might make it non-standard that I can see is the data volumes and the
query requirements:

Volumes:
- 110 million hectares in the province (so, 110 million facts)
- 100s of dimensions (as many as people provide)

Query requirements:
- What is the area of logging on steep slopes in the Blue creek watershed?
- What is the road density (length of road / (total area - water area))
in the Blue creek watershed?

My concern is that the number of dimensions on top of the number of
facts will make the whole thing unworkable, since people are going to
expect to query the thing through a *lot* of different paths. As an
added bonus, the deployment environment will be PostgreSQL.

Any thoughts from the experts?

Paul

--

Paul Ramsey
Refractions Research
http://www.refractions.net
pramsey (AT) refractions (DOT) net
Phone: 250-383-3022
Cell: 250-885-0632
_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Paul Ramsey
02-06-2007, 07:41 PM
Sorry, I neglected to add that for people interested there is background
information here:

http://office.refractions.net/wiki/fields/habc/index.php


Paul Ramsey wrote:
> I recently stumbled onto the fact that the problem space I am working on
> is in fact a somewhat common and well understood one... the one where
> Mondrian is a solution. :)



--

Paul Ramsey
Refractions Research
http://www.refractions.net
pramsey (AT) refractions (DOT) net
Phone: 250-383-3022
Cell: 250-885-0632
_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Julian Hyde
02-06-2007, 10:04 PM
> My concern is that the number of dimensions on top of the number of
> facts will make the whole thing unworkable, since people are going to
> expect to query the thing through a *lot* of different paths. As an
> added bonus, the deployment environment will be PostgreSQL.

The number of dimensions is not a huge problem per se. If mondrian operating
in a ROLAP mode (that is, generating a SQL query for each set of cells) then
each dimension is a POTENTIAL thing to slice on but it's only the dimensions
ACTUALLY sliced on which affect the performance of the SQL.

If you create aggregate tables -- and you probably will need to, for that
data volume -- a large number of dimensions becomes more of a problem -
because you will need a correspondingly large number of aggregate tables.

There may be some tricks you can use when designing your aggregate tables.
If your DBMS supports special indexes for GIS (just the kind of thing that
PostgreSQL does very well) you should try to design the agg tables so that
those indexes get used.

Also, if a lot of your queries are localized (e.g. queries for data within
10 km of a given town) index your fact table so that this data set can be
readily retrieved.

Databases -- mondrian included -- don't handle ranges as well as they handle
discrete values. So, splitting spatial coordinates into the integral and
fractional part (e.g. 34.56 N, 123.45 W becomes lat_whole=34
lat_fraction=.56 long_whole=-123 long_fraction=.45) is a trick which might
tend to create the right number and kind of 'buckets' in mondrian's
workspace.

There has been some research to extend mondrian for GIS applications: see
"An open source and web based framework for geographic and multidimensional
processing" (da Silva, Times, Salgado, 2006),
http://portal.acm.org/citation.cfm?id=1141292

Julian

_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Matt Campbell
02-07-2007, 11:10 AM
To add some anecdotal info-- we've successfully run Mondrian with nearly 700
dimensions using fact tables with 100s of millions of rows. We've noticed
no performance degradation or instability from the number of dimensions.

On 2/6/07, Julian Hyde <julianhyde (AT) speakeasy (DOT) net> wrote:
>
>
>
> > My concern is that the number of dimensions on top of the number of
> > facts will make the whole thing unworkable, since people are going to
> > expect to query the thing through a *lot* of different paths. As an
> > added bonus, the deployment environment will be PostgreSQL.
>
> The number of dimensions is not a huge problem per se. If mondrian
> operating
> in a ROLAP mode (that is, generating a SQL query for each set of cells)
> then
> each dimension is a POTENTIAL thing to slice on but it's only the
> dimensions
> ACTUALLY sliced on which affect the performance of the SQL.
>
> If you create aggregate tables -- and you probably will need to, for that
> data volume -- a large number of dimensions becomes more of a problem -
> because you will need a correspondingly large number of aggregate tables.
>
> There may be some tricks you can use when designing your aggregate tables.
> If your DBMS supports special indexes for GIS (just the kind of thing that
> PostgreSQL does very well) you should try to design the agg tables so that
> those indexes get used.
>
> Also, if a lot of your queries are localized (e.g. queries for data within
> 10 km of a given town) index your fact table so that this data set can be
> readily retrieved.
>
> Databases -- mondrian included -- don't handle ranges as well as they
> handle
> discrete values. So, splitting spatial coordinates into the integral and
> fractional part (e.g. 34.56 N, 123.45 W becomes lat_whole=34
> lat_fraction=.56 long_whole=-123 long_fraction=.45) is a trick which might
> tend to create the right number and kind of 'buckets' in mondrian's
> workspace.
>
> There has been some research to extend mondrian for GIS applications: see
> "An open source and web based framework for geographic and
> multidimensional
> processing" (da Silva, Times, Salgado, 2006),
> http://portal.acm.org/citation.cfm?id=1141292
>
> Julian
>
> _______________________________________________
> Mondrian mailing list
> Mondrian (AT) pentaho (DOT) org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>

_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Paul Ramsey
02-07-2007, 02:00 PM
Matt, thanks for the anecdote, that is amazing (and great news)! What
kind of hardware do you have backing your system?

Julian, thanks for the detailed response, it will be interesting trying
to fit our problem neatly into the OLAP space, but I see lots of
advantages to doing so in terms of pre-existing functionality. On the
spatial side, by the time stuff hits the fact table, it is essentially
non-spatial. All the spatial relationships (cell->region,
cell->watershed) are expressed in the dimensions at that point.

Yours,

Paul

Matt Campbell wrote:
> To add some anecdotal info-- we've successfully run Mondrian with nearly
> 700 dimensions using fact tables with 100s of millions of rows. We've
> noticed no performance degradation or instability from the number of
> dimensions.
>
> On 2/6/07, *Julian Hyde* <julianhyde (AT) speakeasy (DOT) net
> <mailto:julianhyde (AT) speakeasy (DOT) net>> wrote:
>
>
>
> > My concern is that the number of dimensions on top of the number of
> > facts will make the whole thing unworkable, since people are going to
> > expect to query the thing through a *lot* of different paths. As an
> > added bonus, the deployment environment will be PostgreSQL.
>
> The number of dimensions is not a huge problem per se. If mondrian
> operating
> in a ROLAP mode (that is, generating a SQL query for each set of
> cells) then
> each dimension is a POTENTIAL thing to slice on but it's only the
> dimensions
> ACTUALLY sliced on which affect the performance of the SQL.
>
> If you create aggregate tables -- and you probably will need to, for
> that
> data volume -- a large number of dimensions becomes more of a problem -
> because you will need a correspondingly large number of aggregate
> tables.
>
> There may be some tricks you can use when designing your aggregate
> tables.
> If your DBMS supports special indexes for GIS (just the kind of
> thing that
> PostgreSQL does very well) you should try to design the agg tables
> so that
> those indexes get used.
>
> Also, if a lot of your queries are localized ( e.g. queries for data
> within
> 10 km of a given town) index your fact table so that this data set
> can be
> readily retrieved.
>
> Databases -- mondrian included -- don't handle ranges as well as
> they handle
> discrete values. So, splitting spatial coordinates into the integral
> and
> fractional part (e.g. 34.56 N, 123.45 W becomes lat_whole=34
> lat_fraction=.56 long_whole=-123 long_fraction=.45) is a trick which
> might
> tend to create the right number and kind of 'buckets' in mondrian's
> workspace.
>
> There has been some research to extend mondrian for GIS
> applications: see
> "An open source and web based framework for geographic and
> multidimensional
> processing" (da Silva, Times, Salgado, 2006),
> http://portal.acm.org/citation.cfm?id=1141292
>
> Julian
>
> _______________________________________________
> Mondrian mailing list
> Mondrian (AT) pentaho (DOT) org <mailto:Mondrian (AT) pentaho (DOT) org>
> http://lists.pentaho.org/mailman/listinfo/mondrian
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Mondrian mailing list
> Mondrian (AT) pentaho (DOT) org
> http://lists.pentaho.org/mailman/listinfo/mondrian


--

Paul Ramsey
Refractions Research
http://www.refractions.net
pramsey (AT) refractions (DOT) net
Phone: 250-383-3022
Cell: 250-885-0632
_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Matt Campbell
02-07-2007, 02:30 PM
We were running Mondrian on a dual core Win2003 server with 3Gig of memory.
The RDBMs for the tests with 100s of millions of fact rows was Teradata,
which gives very good performance for the sorts of queries that Mondrian
fires.


On 2/7/07, Paul Ramsey <pramsey (AT) refractions (DOT) net> wrote:
>
> Matt, thanks for the anecdote, that is amazing (and great news)! What
> kind of hardware do you have backing your system?
>
> Julian, thanks for the detailed response, it will be interesting trying
> to fit our problem neatly into the OLAP space, but I see lots of
> advantages to doing so in terms of pre-existing functionality. On the
> spatial side, by the time stuff hits the fact table, it is essentially
> non-spatial. All the spatial relationships (cell->region,
> cell->watershed) are expressed in the dimensions at that point.
>
> Yours,
>
> Paul
>
> Matt Campbell wrote:
> > To add some anecdotal info-- we've successfully run Mondrian with nearly
> > 700 dimensions using fact tables with 100s of millions of rows. We've
> > noticed no performance degradation or instability from the number of
> > dimensions.
> >
> > On 2/6/07, *Julian Hyde* <julianhyde (AT) speakeasy (DOT) net
> > <mailto:julianhyde (AT) speakeasy (DOT) net>> wrote:
> >
> >
> >
> > > My concern is that the number of dimensions on top of the number
> of
> > > facts will make the whole thing unworkable, since people are
> going to
> > > expect to query the thing through a *lot* of different paths. As
> an
> > > added bonus, the deployment environment will be PostgreSQL.
> >
> > The number of dimensions is not a huge problem per se. If mondrian
> > operating
> > in a ROLAP mode (that is, generating a SQL query for each set of
> > cells) then
> > each dimension is a POTENTIAL thing to slice on but it's only the
> > dimensions
> > ACTUALLY sliced on which affect the performance of the SQL.
> >
> > If you create aggregate tables -- and you probably will need to, for
> > that
> > data volume -- a large number of dimensions becomes more of a
> problem -
> > because you will need a correspondingly large number of aggregate
> > tables.
> >
> > There may be some tricks you can use when designing your aggregate
> > tables.
> > If your DBMS supports special indexes for GIS (just the kind of
> > thing that
> > PostgreSQL does very well) you should try to design the agg tables
> > so that
> > those indexes get used.
> >
> > Also, if a lot of your queries are localized ( e.g. queries for data
> > within
> > 10 km of a given town) index your fact table so that this data set
> > can be
> > readily retrieved.
> >
> > Databases -- mondrian included -- don't handle ranges as well as
> > they handle
> > discrete values. So, splitting spatial coordinates into the integral
> > and
> > fractional part (e.g. 34.56 N, 123.45 W becomes lat_whole=34
> > lat_fraction=.56 long_whole=-123 long_fraction=.45) is a trick which
> > might
> > tend to create the right number and kind of 'buckets' in mondrian's
> > workspace.
> >
> > There has been some research to extend mondrian for GIS
> > applications: see
> > "An open source and web based framework for geographic and
> > multidimensional
> > processing" (da Silva, Times, Salgado, 2006),
> > http://portal.acm.org/citation.cfm?id=1141292
> >
> > Julian
> >
> > _______________________________________________
> > Mondrian mailing list
> > Mondrian (AT) pentaho (DOT) org <mailto:Mondrian (AT) pentaho (DOT) org>
> > http://lists.pentaho.org/mailman/listinfo/mondrian
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Mondrian mailing list
> > Mondrian (AT) pentaho (DOT) org
> > http://lists.pentaho.org/mailman/listinfo/mondrian
>
>
> --
>
> Paul Ramsey
> Refractions Research
> http://www.refractions.net
> pramsey (AT) refractions (DOT) net
> Phone: 250-383-3022
> Cell: 250-885-0632
> _______________________________________________
> Mondrian mailing list
> Mondrian (AT) pentaho (DOT) org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>

_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Paul Ramsey
02-07-2007, 02:50 PM
Matt, how much does Teradata cost? Ballpark?

Matt Campbell wrote:
>
> We were running Mondrian on a dual core Win2003 server with 3Gig of
> memory. The RDBMs for the tests with 100s of millions of fact rows was
> Teradata, which gives very good performance for the sorts of queries
> that Mondrian fires.
>
>
> On 2/7/07, *Paul Ramsey* <pramsey (AT) refractions (DOT) net
> <mailto:pramsey (AT) refractions (DOT) net>> wrote:
>
> Matt, thanks for the anecdote, that is amazing (and great news)! What
> kind of hardware do you have backing your system?
>
> Julian, thanks for the detailed response, it will be interesting trying
> to fit our problem neatly into the OLAP space, but I see lots of
> advantages to doing so in terms of pre-existing functionality. On the
> spatial side, by the time stuff hits the fact table, it is essentially
> non-spatial. All the spatial relationships (cell->region,
> cell->watershed) are expressed in the dimensions at that point.
>
> Yours,
>
> Paul
>
> Matt Campbell wrote:
> > To add some anecdotal info-- we've successfully run Mondrian with
> nearly
> > 700 dimensions using fact tables with 100s of millions of
> rows. We've
> > noticed no performance degradation or instability from the number of
> > dimensions.
> >


--

Paul Ramsey
Refractions Research
http://www.refractions.net
pramsey (AT) refractions (DOT) net
Phone: 250-383-3022
Cell: 250-885-0632
_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Matt Campbell
02-08-2007, 06:20 PM
It's very expensive. We've tested on a development license, but I found the
following snippet in an article: "a standard, two-server data warehouse
able to store 300 billion bytes starts at $486,000. "

On 2/7/07, Paul Ramsey <pramsey (AT) refractions (DOT) net> wrote:
>
> Matt, how much does Teradata cost? Ballpark?
>
> Matt Campbell wrote:
> >
> > We were running Mondrian on a dual core Win2003 server with 3Gig of
> > memory. The RDBMs for the tests with 100s of millions of fact rows was
> > Teradata, which gives very good performance for the sorts of queries
> > that Mondrian fires.
> >
> >
> > On 2/7/07, *Paul Ramsey* <pramsey (AT) refractions (DOT) net
> > <mailto:pramsey (AT) refractions (DOT) net>> wrote:
> >
> > Matt, thanks for the anecdote, that is amazing (and great news)!
> What
> > kind of hardware do you have backing your system?
> >
> > Julian, thanks for the detailed response, it will be interesting
> trying
> > to fit our problem neatly into the OLAP space, but I see lots of
> > advantages to doing so in terms of pre-existing functionality. On
> the
> > spatial side, by the time stuff hits the fact table, it is
> essentially
> > non-spatial. All the spatial relationships (cell->region,
> > cell->watershed) are expressed in the dimensions at that point.
> >
> > Yours,
> >
> > Paul
> >
> > Matt Campbell wrote:
> > > To add some anecdotal info-- we've successfully run Mondrian with
> > nearly
> > > 700 dimensions using fact tables with 100s of millions of
> > rows. We've
> > > noticed no performance degradation or instability from the number
> of
> > > dimensions.
> > >
>
>
> --
>
> Paul Ramsey
> Refractions Research
> http://www.refractions.net
> pramsey (AT) refractions (DOT) net
> Phone: 250-383-3022
> Cell: 250-885-0632
> _______________________________________________
> Mondrian mailing list
> Mondrian (AT) pentaho (DOT) org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>

_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian