PDA

View Full Version : [Mondrian] RE: List of Members memory usage



Julian Hyde
01-09-2007, 12:01 AM
We discussed this change over the phone this afternoon, but I wanted to
write up what we decided.

There's no question that mondrian has problems dealing with large datasets.
The problem requires a number of solutions, some tactical and some
strategic. Solutions include: more efficient structures for representing
lists (such as your MemberList); use of iterators, both in the
implementation and mondrian's public API; and paging to disk. These
solutions are not orthogonal; in particular, iterators tend to make
disk-based algorithms more efficient, because they encourage locality of
<http://en.wikipedia.org/wiki/Locality_of_reference> reference.

Also, I claim that the use case of a large result set is less common in
practice than the use case of a large intermediate set - for the simple
reason that a result set needs to be read and comprehended by a human being.

1. mondrian.olap.Result

We decided to change the mondrian.olap.Result API. Currently, Result
contains a number of Axis objects, each of which has a member 'Position[]
positions', and each Position has a member 'Member[] members'. We decided to
make both Axis and Position interfaces:

interface Result {
Axes[] getAxes();
...
}

interface Axis {
List<Position> getPositions();
}

interface Position extends List<Member> {
}

This will give us more flexibility in how results are implemented. In
particular, if an axis is huge, say 1M positions each with 3 members, we
won't necessarily have to allocate ~3M elements of space. We will also be
able to use an implementation which returns an iterator, and make it appear
as a list from the client's perspective.

We will examine large axes again in the olap4j specification. I suspect that
olap4j will allow the client to choose between list-based access,
unidirectional cursors and bidirectional cursors.

2. Iterators

Mondrian's expression evaluation framework represents MDX sets as
java.util.List objects: List<Member> for sets of members, List<Member[]> for
sets of tuples.

Richard pointed out that the result of CrossJoin( <set of 1000 products>,
<set of 400 customers> ) requires ~800K (= 1000 x 400 x 2) units of memory,
an suggested an innovative custom implementation of List could just store
the original lists, and use ~1.4K (= 1000 + 400) units of memory.

Julian pointed out that an iterator implementation of CrossJoin would be
better: it would require ~1K (= max(1000, 400)) units of memory in the worst
case.

If the result of the CrossJoin is the ultimate result, then CrossJoinList is
more efficient. But CrossJoinIter allows us to start introducing
iterator-based algorithms, incrementally, throughout mondrian.

ExpCompiler has an enum ResultStyle { ANY, MUTABLE_LIST, LIST, VALUE }.
Let's add a new value ITERABLE, and a new ExpCompiler method

IterCalc compileIter(Exp exp)

This method will implement an expression as an Iterable if possible,
otherwise implement it as a List an return an iterator over that list. In
other words, the consumer is calling the shots, and the producer does its
best to comply. If the client wants to say 'what works best for you?', it
should call ExpCompiler.compile(Exp,ResultStyle[]) and give a ranked list of
styles.

Also, let's make Axis.getPositions() return different implementations of
List depending on whether the result is (a) a list, or (b) an Iterable.

Julian




> -----Original Message-----
> From: Richard Emberson [mailto:remberson (AT) edgedynamics (DOT) com]
> Sent: Friday, January 05, 2007 12:44 PM
> To: Julian Hyde
> Subject: List of Members memory usage
>
> Julian,
>
> I was looking into memory use and the aggregate cache and
> got side tracked.
> Last Summer we had a query that for our medium-size customer test
> database would always run out of memory - it had a large crossjoin.
> So, I started looking at how one might return lists of Members
> and lists of arrays of Members using less memory.
> Attached is a collection of files (please read the README file)
> that demonstrates a possible approach. I used just such an approach
> and the size of data structure returned by the crossjoin function
> went from 712072200 bytes to 401824 bytes.
>
> The actual class I use to test this idea in the Mondrian code derives
> from List<Member[]> and I only changed the CrossJoinFunDef,
> RolapResult and Position classes (and all classes that
> directly accessed
> a Position's instance variables). To implement it across the
> board would
> require changing the type returned by the evaluateList method.
>
> That said, the modified Mondrian code passes all junits except those
> where code is involved that assumes that the Member array returned
> by the list of Members for a given index is always the same array;
> such code uses the Member array as a Map key and with the new
> MemoryList
> classes, it may not be the case that two calls for the same
> list of Members
> element returns the same object.
>
> Whats nice about having the list of Members, Position and,
> possibly, the
> array of Positions created in RolapResult be based upon
> interfaces is that
> one can adjust which implementation to use base upon available memory
> (and in java 5 one can use memory notifications to change the
> implementation as it is being used).
>
> I certainly was not going to make such a sweeping change without
> a little discussion and more testing. Thoughts?
>
> Richard
>
> --
> This email message is for the sole use of the intended
> recipient(s) and
> may contain confidential information. Any unauthorized review, use,
> disclosure or distribution is prohibited. If you are not the intended
> recipient, please contact the sender by reply email and destroy all
> copies of the original message.
>


_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Richard Emberson
01-09-2007, 11:30 AM
Concerning the new Position interface:

interface Position extends List<Member> {
}

we also agreed that it would be an unmodifiable list
and that most of its methods would throw an
UnsupportedOperationException.

Specifically, the only methods supported would be:

boolean isEmpty();
int size();
Iterator<Member> iterator();
Member get(int index);

and possibly

ListIterator<Member> listIterator();
ListIterator<Member> listIterator(int index);

where the iterators would be readonly.

Concerning ITERABLE:

1) ITERABLEs are readonly,

2) it should be noted that when a 'sort' is required,
then the collection of Members must be materialized
because a sort requires the ability to modify the
order of the Members in the collection.
[With xslt's one can stream the transformation of
a document unless one does a 'sort', in which case, the
whole document has to be read into memory at the same
time. There is a similar issue when using LISTs
or ITERABLEs when a sort is encountered.]

3) Iterable is a Java5 interface:

public interface Iterable<T> {
Iterator<T> iterator();
}

which means that the consumer can not tell how many members it
has without iterating through them.

Richard



Julian Hyde wrote:
> We discussed this change over the phone this afternoon, but I wanted to
> write up what we decided.
>
> There's no question that mondrian has problems dealing with large
> datasets. The problem requires a number of solutions, some tactical and
> some strategic. Solutions include: more efficient structures for
> representing lists (such as your MemberList); use of iterators, both in
> the implementation and mondrian's public API; and paging to disk. These
> solutions are not orthogonal; in particular, iterators tend to make
> disk-based algorithms more efficient, because they encourage locality of
> reference <http://en.wikipedia.org/wiki/Locality_of_reference>.
>
> Also, I claim that the use case of a large /result set /is less common
> in practice than the use case of a large /intermediate /set - for the
> simple reason that a result set needs to be read and comprehended by a
> human being.
>
> 1. mondrian.olap.Result
>
> We decided to change the mondrian.olap.Result API. Currently, Result
> contains a number of Axis objects, each of which has a member
> 'Position[] positions', and each Position has a member 'Member[]
> members'. We decided to make both Axis and Position interfaces:
>
> interface Result {
> Axes[] getAxes();
> ...
> }
>
> interface Axis {
> List<Position> getPositions();
> }
>
> interface Position extends List<Member> {
> }
>
> This will give us more flexibility in how results are implemented. In
> particular, if an axis is huge, say 1M positions each with 3 members, we
> won't necessarily have to allocate ~3M elements of space. We will also
> be able to use an implementation which returns an iterator, and make it
> appear as a list from the client's perspective.
>
> We will examine large axes again in the olap4j specification. I suspect
> that olap4j will allow the client to choose between list-based access,
> unidirectional cursors and bidirectional cursors.
>
> 2. Iterators
>
> Mondrian's expression evaluation framework represents MDX sets as
> java.util.List objects: List<Member> for sets of members, List<Member[]>
> for sets of tuples.
>
> Richard pointed out that the result of CrossJoin( <set of 1000
> products>, <set of 400 customers> ) requires ~800K (= 1000 x 400 x 2)
> units of memory, an suggested an innovative custom implementation of
> List could just store the original lists, and use ~1.4K (= 1000 + 400)
> units of memory.
>
> Julian pointed out that an iterator implementation of CrossJoin would be
> better: it would require ~1K (= max(1000, 400)) units of memory in the
> worst case.
>
> If the result of the CrossJoin is the ultimate result, then
> CrossJoinList is more efficient. But CrossJoinIter allows us to start
> introducing iterator-based algorithms, incrementally, throughout mondrian.
>
> ExpCompiler has an enum ResultStyle { ANY, MUTABLE_LIST, LIST, VALUE }.
> Let's add a new value ITERABLE, and a new ExpCompiler method
>
> IterCalc compileIter(Exp exp)
>
> This method will implement an expression as an Iterable if possible,
> otherwise implement it as a List an return an iterator over that
> list. In other words, the consumer is calling the shots, and the
> producer does its best to comply. If the client wants to say 'what works
> best for you?', it should call ExpCompiler.compile(Exp,ResultStyle[])
> and give a ranked list of styles.
>
> Also, let's make Axis.getPositions() return different implementations of
> List depending on whether the result is (a) a list, or (b) an Iterable.
>
> Julian
>
>
>
>
> > -----Original Message-----
> > From: Richard Emberson [mailto:remberson (AT) edgedynamics (DOT) com]
> > Sent: Friday, January 05, 2007 12:44 PM
> > To: Julian Hyde
> > Subject: List of Members memory usage
> >
> > Julian,
> >
> > I was looking into memory use and the aggregate cache and
> > got side tracked.
> > Last Summer we had a query that for our medium-size customer test
> > database would always run out of memory - it had a large crossjoin.
> > So, I started looking at how one might return lists of Members
> > and lists of arrays of Members using less memory.
> > Attached is a collection of files (please read the README file)
> > that demonstrates a possible approach. I used just such an approach
> > and the size of data structure returned by the crossjoin function
> > went from 712072200 bytes to 401824 bytes.
> >
> > The actual class I use to test this idea in the Mondrian code derives
> > from List<Member[]> and I only changed the CrossJoinFunDef,
> > RolapResult and Position classes (and all classes that
> > directly accessed
> > a Position's instance variables). To implement it across the
> > board would
> > require changing the type returned by the evaluateList method.
> >
> > That said, the modified Mondrian code passes all junits except those
> > where code is involved that assumes that the Member array returned
> > by the list of Members for a given index is always the same array;
> > such code uses the Member array as a Map key and with the new
> > MemoryList
> > classes, it may not be the case that two calls for the same
> > list of Members
> > element returns the same object.
> >
> > Whats nice about having the list of Members, Position and,
> > possibly, the
> > array of Positions created in RolapResult be based upon
> > interfaces is that
> > one can adjust which implementation to use base upon available memory
> > (and in java 5 one can use memory notifications to change the
> > implementation as it is being used).
> >
> > I certainly was not going to make such a sweeping change without
> > a little discussion and more testing. Thoughts?
> >
> > Richard
> >
> > --
> > This email message is for the sole use of the intended
> > recipient(s) and
> > may contain confidential information. Any unauthorized review, use,
> > disclosure or distribution is prohibited. If you are not the intended
> > recipient, please contact the sender by reply email and destroy all
> > copies of the original message.
> >
>


--
This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all
copies of the original message.
_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian