View Full Version : Ann: Mondrian Extension for Biology
10-15-2005, 09:38 PM
I have launched a new project on sourceforge to extend Mondrian for use with large biology data sets.
What may be of general interest is that we are using Mondrian to aggregate non-traditional data
types such as strings, graphs, images & c. which are very common in biology data. Also, JPivot works nicely to report non-numeric data in the cells, such as strings or images. I can imagine other applications that might want to have graphs or other kinds of non-numeric data in their measure cells so some of the modifications we are doing here may be of interest.
10-16-2005, 06:13 AM
This looks great!
On the pure Mondrian side, I have a few questions and observations.
1. It seems to me that we need extend Mondrian to have pluggable aggregator functions in the same way as we now have pluggable MDX functions. Something like this in the schema:
We would have to change some management of the aggregate attribute on the Measure tag, but that should be minor.
2. You mentioned initial performance issues with 44K sequences. Was this 44K fact table rows? Was the performance problem related to the new aggregate function?
It is good to see that the aggregate tables (Oracle materialized views) worked well for you. Do you see issues with going to a larger data set, like the 2M seqence iProClass database?
I have some other comments, but they are related to JPivot, so I will do that on the JPivot forum.
10-16-2005, 03:11 PM
Sherman - thanks for the positive feedback! I'm hoping these extensions will be of more general use - plus I think the ability of Mondrian to be extended to nonstandard data types is really a fairly unique advantage based on the tools I looked at.
>1. It seems to me that we need extend Mondrian to have pluggable aggregator functions in the same way as we now have pluggable MDX functions.
I would love to see an extension to Mondrian that allowed new user defined functions without having to dig in the code the way I do now. The code changes are small enough that it seems possible, I just don't know how to do it. One thing I would add is that defining a new function actually has two parts:
A. Name of function (assumed to be the same as in the DB - though there may be reasons to allow DB specific aliasing since already DBs define functions like variance with different names).
B. Name of the "rollup aggregator" (getRollup). This is the function that is called when you use aggregation tables to roll up the values. Here I've had a bit of trouble because what I want the default to be is "No Rollup" - i.e. only use values in a pre-calculated table if they are populated o/w call the database. Currently the code is designed to assume that functions will roll up themselves (i.e. a sum function can add up sub-sums). This is true for some functions, but it does not hold for other functions like average and can't be assumed for user defined functions.
> 2. You mentioned initial performance issues with 44K sequences.
Well, two issues here. 1st is that the user aggregation function I defined is still slower than most simple functions like count(). The second is that the go hierarchy I incorporated is actually a fairly tough test. The GO hierarchy of protein functions is actually a big directed acyclic graph where proteins can take part in multiple functions and are represented in a parent-child hierarchy (fortunately with a big closure table as can be seen here
This is O.K., but the effect is that when you try to compute an aggregation table to speed things up, rather than just being able to do the first level - you end up pre-calculating all the available levels at once!! So, the result ends up being a bit big and perhaps less than ideal... Anyway, given this I think Mondrian performed quite well and was able to give good response times with a few aggregations.