Fellow Mondrian developers and users,

One month has already passed since the new year festivities, and while most
of you have been trying to renew your gym membership or hold on to your new
year resolutions the best you could, so did the Mondrian team. Our
resolutions, although not requiring personal sacrifices, are none the less
starting to bear fruit.

For you see, our resolution for the year was to provide Mondrian developers
and integrators means to achieve better understanding, scalability and
control. We have many ideas on how to reach those goals. Some of them are
still in their infancy, yet some of them have already been committed to the
source. Last month, we worked on the first phase. We added means for system
architects to externalize and share a pluggable segment cache. What does
this mean exactly? Let's take a step back in order to better understand.

Internally, Mondrian splits the tuples in segments. A typical segment could
be described as a measure crossjoined by a series of predicates. As an
example, a textual representation of a segment contents could be:

Measure = [ Sales ]
Predicates = {
[ Products = * ],
[ State = California ],
[ Gender = Male ] }
Data = [ 1346.34, 234.00, ... ]

In the case above, the segment would represent the Sales data of all males
in California, for all products. It is a lot more effective to deal with
those data structures. If Mondrian was to internally represent each data
cell individually, the unique identifier of that cell would be of a greater
size than the data itself, thus creating a whole lot more problems in terms
of data efficiency. This is therefore why Mondrian deals with groups of
cells, which it loads in batches, rather than individually. There is a lot
of voodoo magic and heuristics in the background trying to figure out how
best to group those segments and how to reduce the number of segments to
load, ultimately reducing the number of SQL queries to be executed. Mondrian
will group all segments with the same predicates but with a different
measure into a segment group. Mondrian will also tend to remove as many
predicates as it possibly can in order to optimize the data payload. Lets
say that a segment covers all products except a single one, Mondrian will
still include the product in the segment but filter it out when a specific
query requires it.

Once those segments are populated, Mondrian keeps those in a collection of
weak references in local memory. All required segment references are pinned
down during the resolving of a particular query, but as soon as the query is
done executing, the references are returned to their weak state, thus ready
to be garbage collected if needed. This simple mechanism allows Mondrian to
answer just about any query, as long as the memory allocated is big enough
to answer that particular query. This works really well in fact, since in
most small deployments, the maximum amount of memory is never reached. And
if it ever gets filled, old segments will be evicted to make some room for
the new ones.

Now, there are obvious gotchas. First off, what if it takes a long time for
a segment to be populated by the RDBMS. This means that if a particular
segment ever gets picked up by the garbage collector, the MDX query sent to
Mondrian *might* take longer to execute, whether it was in the segment cache
or not. This is not acceptable, simply because this makes all performance
predictions impossible.

This is where the SegmentCache SPI comes in. It is essentially a pluggable
cache for segments. The algorithm behind the segment loader becomes this:

- Lookup segments in local cache and pin those required.
- Optimize / group segments
- Lookup segments from the SPI cache
- Load the segments found from the SPI cache
- Populate the remaining unloaded segments from the RDBMS
- Put the segments which come from the RDBMS into the SPI cache
- Pin all loaded segments
- Resolve the query
- Unpin all segments in the local cache

But wait! There is more! The SegmentCache SPI is trivial to implement.

Future<Boolean> contains(SegmentHeader header);

Future<SegmentBody> get(SegmentHeader header);

Future<Boolean> put(SegmentHeader header, SegmentBody body);

There are two assumptions that are made towards the implementation. The
first obvious one is that the cache must assume that many Mondrian instances
might access the cache concurrently, form different threads. We therefore
recommend using the Actor Pattern or anything similar in order to enforce
thread safety. The second is that SegmentCache implementations will be
instantiated very often. We therefore recommend using a facade object which
relays calls to the actual segment cache code.

As for the storage of the SegmentHeader and SegmentBody objects, we tried to
make it as simple and flexible as possible. Both objects are fully
serializable and are immutable. They are also specially crafted to use dense
arrays of primitive data types. We also tried to make extensive use of Java
native functions when copying the data to / from the cache within Mondrian
internals.

The bottom line is that from now on the Mondrian community will be free to
implement segment caches to fit their needs. We will be rolling out a few
default implementations and examples, obviously. One neat implementation
could be one which pages the segments to a super fast array of SSD drives.
Another one could be to store the segments in Terracota or ehCache or
Infinispan, or just about any scalable caching system there is out there. So
if any of you out there are interested in implementing this SPI for your
business and would like to either share your experiences or contribute those
implementations, don't hesitate to contact us. Or me directly.

There is more goodness to come, but that's it for now. Stay tuned!

_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian