Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: [Mondrian] Mondrian SegmentCache SPI

  1. #1
    Luc Boudreau Guest

    Default [Mondrian] Mondrian SegmentCache SPI

    Fellow Mondrian developers and users,

    One month has already passed since the new year festivities, and while most
    of you have been trying to renew your gym membership or hold on to your new
    year resolutions the best you could, so did the Mondrian team. Our
    resolutions, although not requiring personal sacrifices, are none the less
    starting to bear fruit.

    For you see, our resolution for the year was to provide Mondrian developers
    and integrators means to achieve better understanding, scalability and
    control. We have many ideas on how to reach those goals. Some of them are
    still in their infancy, yet some of them have already been committed to the
    source. Last month, we worked on the first phase. We added means for system
    architects to externalize and share a pluggable segment cache. What does
    this mean exactly? Let's take a step back in order to better understand.

    Internally, Mondrian splits the tuples in segments. A typical segment could
    be described as a measure crossjoined by a series of predicates. As an
    example, a textual representation of a segment contents could be:

    Measure = [ Sales ]
    Predicates = {
    [ Products = * ],
    [ State = California ],
    [ Gender = Male ] }
    Data = [ 1346.34, 234.00, ... ]

    In the case above, the segment would represent the Sales data of all males
    in California, for all products. It is a lot more effective to deal with
    those data structures. If Mondrian was to internally represent each data
    cell individually, the unique identifier of that cell would be of a greater
    size than the data itself, thus creating a whole lot more problems in terms
    of data efficiency. This is therefore why Mondrian deals with groups of
    cells, which it loads in batches, rather than individually. There is a lot
    of voodoo magic and heuristics in the background trying to figure out how
    best to group those segments and how to reduce the number of segments to
    load, ultimately reducing the number of SQL queries to be executed. Mondrian
    will group all segments with the same predicates but with a different
    measure into a segment group. Mondrian will also tend to remove as many
    predicates as it possibly can in order to optimize the data payload. Lets
    say that a segment covers all products except a single one, Mondrian will
    still include the product in the segment but filter it out when a specific
    query requires it.

    Once those segments are populated, Mondrian keeps those in a collection of
    weak references in local memory. All required segment references are pinned
    down during the resolving of a particular query, but as soon as the query is
    done executing, the references are returned to their weak state, thus ready
    to be garbage collected if needed. This simple mechanism allows Mondrian to
    answer just about any query, as long as the memory allocated is big enough
    to answer that particular query. This works really well in fact, since in
    most small deployments, the maximum amount of memory is never reached. And
    if it ever gets filled, old segments will be evicted to make some room for
    the new ones.

    Now, there are obvious gotchas. First off, what if it takes a long time for
    a segment to be populated by the RDBMS. This means that if a particular
    segment ever gets picked up by the garbage collector, the MDX query sent to
    Mondrian *might* take longer to execute, whether it was in the segment cache
    or not. This is not acceptable, simply because this makes all performance
    predictions impossible.

    This is where the SegmentCache SPI comes in. It is essentially a pluggable
    cache for segments. The algorithm behind the segment loader becomes this:

    - Lookup segments in local cache and pin those required.
    - Optimize / group segments
    - Lookup segments from the SPI cache
    - Load the segments found from the SPI cache
    - Populate the remaining unloaded segments from the RDBMS
    - Put the segments which come from the RDBMS into the SPI cache
    - Pin all loaded segments
    - Resolve the query
    - Unpin all segments in the local cache

    But wait! There is more! The SegmentCache SPI is trivial to implement.

    Future<Boolean> contains(SegmentHeader header);

    Future<SegmentBody> get(SegmentHeader header);

    Future<Boolean> put(SegmentHeader header, SegmentBody body);

    There are two assumptions that are made towards the implementation. The
    first obvious one is that the cache must assume that many Mondrian instances
    might access the cache concurrently, form different threads. We therefore
    recommend using the Actor Pattern or anything similar in order to enforce
    thread safety. The second is that SegmentCache implementations will be
    instantiated very often. We therefore recommend using a facade object which
    relays calls to the actual segment cache code.

    As for the storage of the SegmentHeader and SegmentBody objects, we tried to
    make it as simple and flexible as possible. Both objects are fully
    serializable and are immutable. They are also specially crafted to use dense
    arrays of primitive data types. We also tried to make extensive use of Java
    native functions when copying the data to / from the cache within Mondrian
    internals.

    The bottom line is that from now on the Mondrian community will be free to
    implement segment caches to fit their needs. We will be rolling out a few
    default implementations and examples, obviously. One neat implementation
    could be one which pages the segments to a super fast array of SSD drives.
    Another one could be to store the segments in Terracota or ehCache or
    Infinispan, or just about any scalable caching system there is out there. So
    if any of you out there are interested in implementing this SPI for your
    business and would like to either share your experiences or contribute those
    implementations, don't hesitate to contact us. Or me directly.

    There is more goodness to come, but that's it for now. Stay tuned!

    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

  2. #2
    Julian Hyde Guest

    Default RE: [Mondrian] Mondrian SegmentCache SPI

    Luc,

    This is a great start to the pluggable-cache project. The SPI is clear and
    high-level. Thanks for seizing the initiative.

    I think we will need one more method:

    Future<List<SegmentHeader>> listSegments()

    This will allow Mondrian to connect to a cache and see what it contains. (An
    external cache may have been running longer than the Mondrian server.)

    This method is necessary because of a peculiarity of Mondrian's caching
    strategy, wherein there is not a simple mapping from a cell to the segment
    that contains it. For example, consider a more conventional cache: a CPU
    that brings L3 (level 3) cache that caches 64K blocks of RAM. The byte value
    0xABCD1234 belongs to one and only one cache block, the one that starts at
    0xABCD0000 and ends at 0xABCDFFFF.

    Now consider Mondrian's cache. On one day, the cell ([Unit Sales], [CA],
    [2010]) might be in the segment ([Unit Sales], {[CA], [OR}], {[2009],
    [2010]}); the next day, it might be in ([Unit Sales], {[CA]}, {year=*}).
    What segments exist depend on what queries have run earlier in the day. This
    is different from a typical cache, but it works well, and is absolutely
    appropriate for a ROLAP system. The listSegments method makes the cache
    work; Mondrian can then index the segments and quickly find

    If a cache can add and remove segments without Mondrian knowing about it, we
    may also need to give the cache some way to notify Mondrian about changes to
    the list of segments.

    Julian




    _____

    From: mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org] On
    Behalf Of Luc Boudreau
    Sent: Thursday, February 03, 2011 1:31 PM
    To: Mondrian developer mailing list
    Subject: [Mondrian] Mondrian SegmentCache SPI


    Fellow Mondrian developers and users,

    One month has already passed since the new year festivities, and while most
    of you have been trying to renew your gym membership or hold on to your new
    year resolutions the best you could, so did the Mondrian team. Our
    resolutions, although not requiring personal sacrifices, are none the less
    starting to bear fruit.

    For you see, our resolution for the year was to provide Mondrian developers
    and integrators means to achieve better understanding, scalability and
    control. We have many ideas on how to reach those goals. Some of them are
    still in their infancy, yet some of them have already been committed to the
    source. Last month, we worked on the first phase. We added means for system
    architects to externalize and share a pluggable segment cache. What does
    this mean exactly? Let's take a step back in order to better understand.

    Internally, Mondrian splits the tuples in segments. A typical segment could
    be described as a measure crossjoined by a series of predicates. As an
    example, a textual representation of a segment contents could be:


    Measure = [ Sales ]
    Predicates = {
    [ Products = * ],
    [ State = California ],
    [ Gender = Male ] }
    Data = [ 1346.34, 234.00, ... ]

    In the case above, the segment would represent the Sales data of all males
    in California, for all products. It is a lot more effective to deal with
    those data structures. If Mondrian was to internally represent each data
    cell individually, the unique identifier of that cell would be of a greater
    size than the data itself, thus creating a whole lot more problems in terms
    of data efficiency. This is therefore why Mondrian deals with groups of
    cells, which it loads in batches, rather than individually. There is a lot
    of voodoo magic and heuristics in the background trying to figure out how
    best to group those segments and how to reduce the number of segments to
    load, ultimately reducing the number of SQL queries to be executed. Mondrian
    will group all segments with the same predicates but with a different
    measure into a segment group. Mondrian will also tend to remove as many
    predicates as it possibly can in order to optimize the data payload. Lets
    say that a segment covers all products except a single one, Mondrian will
    still include the product in the segment but filter it out when a specific
    query requires it.

    Once those segments are populated, Mondrian keeps those in a collection of
    weak references in local memory. All required segment references are pinned
    down during the resolving of a particular query, but as soon as the query is
    done executing, the references are returned to their weak state, thus ready
    to be garbage collected if needed. This simple mechanism allows Mondrian to
    answer just about any query, as long as the memory allocated is big enough
    to answer that particular query. This works really well in fact, since in
    most small deployments, the maximum amount of memory is never reached. And
    if it ever gets filled, old segments will be evicted to make some room for
    the new ones.

    Now, there are obvious gotchas. First off, what if it takes a long time for
    a segment to be populated by the RDBMS. This means that if a particular
    segment ever gets picked up by the garbage collector, the MDX query sent to
    Mondrian *might* take longer to execute, whether it was in the segment cache
    or not. This is not acceptable, simply because this makes all performance
    predictions impossible.

    This is where the SegmentCache SPI comes in. It is essentially a pluggable
    cache for segments. The algorithm behind the segment loader becomes this:


    * Lookup segments in local cache and pin those required.

    * Optimize / group segments

    * Lookup segments from the SPI cache

    * Load the segments found from the SPI cache

    * Populate the remaining unloaded segments from the RDBMS

    * Put the segments which come from the RDBMS into the SPI cache

    * Pin all loaded segments

    * Resolve the query

    * Unpin all segments in the local cache

    But wait! There is more! The SegmentCache SPI is trivial to implement.


    Future<Boolean> contains(SegmentHeader header);

    Future<SegmentBody> get(SegmentHeader header);

    Future<Boolean> put(SegmentHeader header, SegmentBody body);

    There are two assumptions that are made towards the implementation. The
    first obvious one is that the cache must assume that many Mondrian instances
    might access the cache concurrently, form different threads. We therefore
    recommend using the Actor Pattern or anything similar in order to enforce
    thread safety. The second is that SegmentCache implementations will be
    instantiated very often. We therefore recommend using a facade object which
    relays calls to the actual segment cache code.

    As for the storage of the SegmentHeader and SegmentBody objects, we tried to
    make it as simple and flexible as possible. Both objects are fully
    serializable and are immutable. They are also specially crafted to use dense
    arrays of primitive data types. We also tried to make extensive use of Java
    native functions when copying the data to / from the cache within Mondrian
    internals.

    The bottom line is that from now on the Mondrian community will be free to
    implement segment caches to fit their needs. We will be rolling out a few
    default implementations and examples, obviously. One neat implementation
    could be one which pages the segments to a super fast array of SSD drives.
    Another one could be to store the segments in Terracota or ehCache or
    Infinispan, or just about any scalable caching system there is out there. So
    if any of you out there are interested in implementing this SPI for your
    business and would like to either share your experiences or contribute those
    implementations, don't hesitate to contact us. Or me directly.

    There is more goodness to come, but that's it for now. Stay tuned!


    _______________________________________________
    Mondrian mailing list
    Mondrian (AT) pentaho (DOT) org
    http://lists.pentaho.org/mailman/listinfo/mondrian

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.