PDA

View Full Version : [Mondrian] Multithreading etc



michael bienstein
03-09-2007, 06:42 AM
I sent this to the list but it gets bounced because I attached the code in a zip file. How do I send code through without checking it in because it is still orthogonal to the codebase?

Michael
----

Well, I have code that works for multi-threading infrastructure so I
would like to know if it is worth continuing with this or not.

As for ROLLUP/CUBE my thoughts are:
1)
Either we keep the codebase simple by sticking to a standard (SQL2003)
even if this standard is not yet implemented widely and certain
databases have better special features than others, or we allow a
per-database SQL generation system. The argument for the second makes
sense only if the developer resources to write and maintain each
dialect comes from the database vendor or their community. Mondrian is
probably at a stage that such discussions can be undertaken with the
database vendors.
2) Architecturally this implies loading multiple
Aggregations from one SQL query. That requires a rethink of the way
the cell cache loading is done because at the moment an Aggregation is
loaded one at a time and in a synchronized block on the Aggregation.
Similar concerns have to be dealt with for in-memory rollups. I think
that synchronized is too forceful. We need something more like a Lock
from java.util.concurrent so we can do tryLock(). Look at the TxLock
idea I have in the code I'm attaching.

As for multi-threading:
I
have only written most of the base infrastructure, not the cell
loading. To integrate would require a significant amount of work in
Mondrian's code to pass all interaction with Mondrian through
TxSystem.runWithTx().

Basic concerns are:
1) Threads should be able to share data
related to the request across the threads.
2) A
Thread should be loaned to a request and returned in a way that is
well-nigh fail-safe (i.e. the thread shouldn

Laurent Valdes
03-09-2007, 08:20 AM
Hi,

as far as I'm concerned, I think it would be a very good idea to use threads
more extensively.
Would it be possible to provide a schematic of Mondrian cache architecture ?

Considering Tasks and JDBC, is there some threads suscription code to allow
them to gain datas from a pool of objects ?

Please forgive me, as I'm a newbie with Mondrian.
Have a good day !

Laurent.


2007/3/9, michael bienstein <mbienstein (AT) yahoo (DOT) fr>:[color=blue]
>
> I sent this to the list but it gets bounced because I attached the code in
> a zip file. How do I send code through without checking it in because it is
> still orthogonal to the codebase?
>
> Michael
> ----
>
> Well, I have code that works for multi-threading infrastructure so I would
> like to know if it is worth continuing with this or not.
>
> As for ROLLUP/CUBE my thoughts are:
> 1) Either we keep the codebase simple by sticking to a standard (SQL2003)
> even if this standard is not yet implemented widely and certain databases
> have better special features than others, or we allow a per-database SQL
> generation system. The argument for the second makes sense only if the
> developer resources to write and maintain each dialect comes from the
> database vendor or their community. Mondrian is probably at a stage that
> such discussions can be undertaken with the database vendors.
> 2) Architecturally this implies loading multiple Aggregations from one SQL
> query. That requires a rethink of the way the cell cache loading is done
> because at the moment an Aggregation is loaded one at a time and in a
> synchronized block on the Aggregation. Similar concerns have to be dealt
> with for in-memory rollups. I think that synchronized is too forceful. We
> need something more like a Lock from java.util.concurrent so we can do
> tryLock(). Look at the TxLock idea I have in the code I'm attaching.
>
> As for multi-threading:
> I have only written most of the base infrastructure, not the cell
> loading. To integrate would require a significant amount of work in
> Mondrian's code to pass all interaction with Mondrian through
> TxSystem.runWithTx().
>
> Basic concerns are:
> 1) Threads should be able to share data related to the request across
> the threads.
> 2) A Thread should be loaned to a request and returned in a way that
> is well-nigh fail-safe (i.e. the thread shouldn't keep running of the
> request fails in some way).
> 3) We should be able in a parameter of some sort decide to NOT use
> threads at all.
> 4) The number of threads should be configurable.
> 5) There should be an independence from the rest of the code base.
> 6) We should be able to make use of custom thread pools or use
> managed thread pools from the application server.
> 7) Then there is a relatively minor issue with read-consistency for
> near-real-time data that turns out to be a real head-ache. This can be
> done by either: using the transaction semantics of the underlying data store
> *or* modifying all SQL requests and cache interactions with a timestamp
> and/or transaction id of some sort. E.g. when an MDX requests begins it
> asks the underlying data store for the id of the last completed transaction
> that modified data and keeps this in a request-scope available to all
> threads. Then it appends "changedTxId <= ${lastTxIdWhenFirstEntered}" to
> each WHERE clause. If however we use the underlying data store's
> transactions then we must keep open the JDBC Connection for the duration of
> the request reusing it on the same thread for each interaction with that
> data store.
> Now, I think that the best way to take advantage of multiple threads in
> the storage system is NOT launching multiple SQLs on the same star schema
> but different aggregations but rather to use *partitioning* of data. That
> is to segment the cell data (and maybe dimension data) based on values of
> certain columns. For example year<2007 and year=2007 in two different
> partitions. This can be introduced slowly by simply making a RolapStar
> one Partition for the moment. Having said that aggregation tables are
> also a type of Partition and hitting two of them at once should be quite
> easy.
> So the design I am introducing has the following features:
> 1) A scope for "request" or "interaction" that is larger than the Thread
> that begins it. Since this is similar to a transaction I've called it a
> Tx. See the mondrian.tx package. Each sub-system in Mondrian can enlist
> a representation of itself in the Tx.
> 2) Break up the different tasks performed into Task objects that can be
> run potentially in parallel. Allow a set of Tasks to be tied to the same
> Thread so that the same JDBC Connection can be used for all of them for
> read-consistency and cleaned up at the end of the Tx. This is done
> declaratively so the implementation can be changed easily. The
> implementation can also ensure that the J2EE context is passed onto separate
> threads (JNDI, context class loader etc).
> 3) A system of fail-quick locks at the Tx scope rather than just Thread
> scope.
>
> If this is worth persuing as a design for the next version then good. If
> not I'll stop now.
>
> Michael
>
> ------------------------------
> D

Julian Hyde
03-11-2007, 03:33 AM
_____

From: mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org]
On Behalf Of michael bienstein
Sent: Friday, March 09, 2007 2:36 AM
To: Mondrian developer mailing list
Subject: [Mondrian] Multithreading etc


I sent this to the list but it gets bounced because I attached the code
in a zip file. How do I send code through without checking it in
because it is still orthogonal to the codebase?

Have you tried attaching a zip file to a forum thread?

Alternatively, you could send to mondrian-devel. That list still works,
although it's not much used anymore.



Well, I have code that works for multi-threading infrastructure so I
would like to know if it is worth continuing with this or not.

As for ROLLUP/CUBE my thoughts are:
1) Either we keep the codebase simple by sticking to a standard
(SQL2003) even if this standard is not yet implemented widely and
certain databases have better special features than others, or we allow
a per-database SQL generation system. The argument for the second makes
sense only if the developer resources to write and maintain each dialect
comes from the database vendor or their community. Mondrian is probably
at a stage that such discussions can be undertaken with the database
vendors.


I've been talking with Matt Campbell (mkambol) about how this could be
implemented. Apparently Oracle, DB2 and Teradata (the main platforms of
interest to Matt) implement the "GROUP BY GROUPING SETS" construct which
we will need, with the same syntax.

Grouping sets are good because they allow us to specify exactly which
groups we want the DBMS to return. If we had used the ROLLUP construct,
we would have had to write logic in mondrian to figure out which
aggregations could be grouped together in the same query. But with
GROUPING SETS, the DBMS can figure out which aggregations can be
computed by rolling others.

We will also need the GROUPING function.

Since these three databases support what we need, I am inclined to stick
to the standard. I haven't checked whether other databases support this
syntax, but I am hopeful that they do, or soon will.



2) Architecturally this implies loading multiple Aggregations from one
SQL query. That requires a rethink of the way the cell cache loading is
done because at the moment an Aggregation is loaded one at a time and in
a synchronized block on the Aggregation. Similar concerns have to be
dealt with for in-memory rollups. I think that synchronized is too
forceful. We need something more like a Lock from java.util.concurrent
so we can do tryLock(). Look at the TxLock idea I have in the code I'm
attaching.

Yes, this issue came to light in our design discussions also.

I look forward to reading your code, but it occurs to me that we can
leverage aggregations' state of 'ready' or 'loading'. We could upgrade
this to a lock, so another thread can wait for a loading aggregation to
become ready.

Synchronized will still need to be used, and carefully, to ensure that
no thread ever sees the system in an inconsistent state.



As for multi-threading:
I have only written most of the base infrastructure, not the cell
loading. To integrate would require a significant amount of work in
Mondrian's code to pass all interaction with Mondrian through
TxSystem.runWithTx().

Basic concerns are:

1) Threads should be able to share data related to the request
across the threads.
2) A Thread should be loaned to a request and returned in a way
that is well-nigh fail-safe (i.e. the thread shouldn't keep running of
the request fails in some way).
3) We should be able in a parameter of some sort decide to NOT use
threads at all.
4) The number of threads should be configurable.
5) There should be an independence from the rest of the code base.
6) We should be able to make use of custom thread pools or use
managed thread pools from the application server.
7) Then there is a relatively minor issue with read-consistency for
near-real-time data that turns out to be a real head-ache. This can be
done by either: using the transaction semantics of the underlying data
store or modifying all SQL requests and cache interactions with a
timestamp and/or transaction id of some sort. E.g. when an MDX requests
begins it asks the underlying data store for the id of the last
completed transaction that modified data and keeps this in a
request-scope available to all threads. Then it appends "changedTxId <=
${lastTxIdWhenFirstEntered}" to each WHERE clause. If however we use
the underlying data store's transactions then we must keep open the JDBC
Connection for the duration of the request reusing it on the same thread
for each interaction with that data store.
Now, I think that the best way to take advantage of multiple threads in
the storage system is NOT launching multiple SQLs on the same star
schema but different aggregations but rather to use partitioning of
data. That is to segment the cell data (and maybe dimension data) based
on values of certain columns. For example year<2007 and year=2007 in
two different partitions. This can be introduced slowly by simply
making a RolapStar one Partition for the moment. Having said that
aggregation tables are also a type of Partition and hitting two of them
at once should be quite easy.
So the design I am introducing has the following features:
1) A scope for "request" or "interaction" that is larger than the Thread
that begins it. Since this is similar to a transaction I've called it a
Tx. See the mondrian.tx package. Each sub-system in Mondrian can
enlist a representation of itself in the Tx.
2) Break up the different tasks performed into Task objects that can be
run potentially in parallel. Allow a set of Tasks to be tied to the
same Thread so that the same JDBC Connection can be used for all of them
for read-consistency and cleaned up at the end of the Tx. This is done
declaratively so the implementation can be changed easily. The
implementation can also ensure that the J2EE context is passed onto
separate threads (JNDI, context class loader etc).
3) A system of fail-quick locks at the Tx scope rather than just Thread
scope.

If this is worth persuing as a design for the next version then good.
If not I'll stop now.


This definitely sounds plausible... I'd like to read through your code
before I answer in detail.

Julian

_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Pappyn Bart
03-13-2007, 04:40 AM
Hi Michael,



I don't see any problems for you to make changes to mondrian. But I have some concerns:



I think the changes you are about to make are quite huge and will have an impact of how mondrian will behave. Since this is the first source contribution you are about to make, I urge you not to check anything into perforce before it is actually working and passing all regression tests.



I think most developers of mondrian have ongoing projects that are using mondrian, I think this is

becoming more and more an important issue.



For me: it must be able to flush aggregates and member cache using the plug-in and cubes not maintaining cache should be able to load their own data, without messing with global cache.



And since a dynamic database cannot easily be simulated in a regression test, I think if you are serious about tackling the read-consistency for near-real-time data, you need a realistic (dynamic) database to test against. And the database must be large enough to be able to see realistic performance. It is also advised to test with virtual cubes, cubes maintaining cache (with aggregate tables) in combination with cubes not maintaining cache (without aggregate tables), shared dimensions and so on...



I released my project 4 weeks ago, not even using the latest version of perforce, since at a given point mondrian-head was completely broken for me. While I know software always has some bugs that need to be patched, things that are not tested and are breaking mondrian should not be checked in. All too often I had to sync with perforce to solve a bug and this ended up in a nightmare, spending most of my time finding out what change was causing mondrian to break.



When mondrian 2.3 will be released, it is most likely that there will be some 2.3.x version containing some patches. I think it must be possible to make those patches without having to drag new huge features along.



Bart


________________________________

From: mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org] On Behalf Of michael bienstein
Sent: vrijdag 9 maart 2007 11:36
To: Mondrian developer mailing list
Subject: [Mondrian] Multithreading etc


I sent this to the list but it gets bounced because I attached the code in a zip file. How do I send code through without checking it in because it is still orthogonal to the codebase?

Michael
----

Well, I have code that works for multi-threading infrastructure so I would like to know if it is worth continuing with this or not.

As for ROLLUP/CUBE my thoughts are:
1) Either we keep the codebase simple by sticking to a standard (SQL2003) even if this standard is not yet implemented widely and certain databases have better special features than others, or we allow a per-database SQL generation system. The argument for the second makes sense only if the developer resources to write and maintain each dialect comes from the database vendor or their community. Mondrian is probably at a stage that such discussions can be undertaken with the database vendors.
2) Architecturally this implies loading multiple Aggregations from one SQL query. That requires a rethink of the way the cell cache loading is done because at the moment an Aggregation is loaded one at a time and in a synchronized block on the Aggregation. Similar concerns have to be dealt with for in-memory rollups. I think that synchronized is too forceful. We need something more like a Lock from java.util.concurrent so we can do tryLock(). Look at the TxLock idea I have in the code I'm attaching.

As for multi-threading:
I have only written most of the base infrastructure, not the cell loading. To integrate would require a significant amount of work in Mondrian's code to pass all interaction with Mondrian through TxSystem.runWithTx().

Basic concerns are:

1) Threads should be able to share data related to the request across the threads.
2) A Thread should be loaned to a request and returned in a way that is well-nigh fail-safe (i.e. the thread shouldn't keep running of the request fails in some way).
3) We should be able in a parameter of some sort decide to NOT use threads at all.
4) The number of threads should be configurable.
5) There should be an independence from the rest of the code base.
6) We should be able to make use of custom thread pools or use managed thread pools from the application server.
7) Then there is a relatively minor issue with read-consistency for near-real-time data that turns out to be a real head-ache. This can be done by either: using the transaction semantics of the underlying data store or modifying all SQL requests and cache interactions with a timestamp and/or transaction id of some sort. E.g. when an MDX requests begins it asks the underlying data store for the id of the last completed transaction that modified data and keeps this in a request-scope available to all threads. Then it appends "changedTxId <= ${lastTxIdWhenFirstEntered}" to each WHERE clause. If however we use the underlying data store's transactions then we must keep open the JDBC Connection for the duration of the request reusing it on the same thread for each interaction with that data store.
Now, I think that the best way to take advantage of multiple threads in the storage system is NOT launching multiple SQLs on the same star schema but different aggregations but rather to use partitioning of data. That is to segment the cell data (and maybe dimension data) based on values of certain columns. For example year<2007 and year=2007 in two different partitions. This can be introduced slowly by simply making a RolapStar one Partition for the moment. Having said that aggregation tables are also a type of Partition and hitting two of them at once should be quite easy.
So the design I am introducing has the following features:
1) A scope for "request" or "interaction" that is larger than the Thread that begins it. Since this is similar to a transaction I've called it a Tx. See the mondrian.tx package. Each sub-system in Mondrian can enlist a representation of itself in the Tx.
2) Break up the different tasks performed into Task objects that can be run potentially in parallel. Allow a set of Tasks to be tied to the same Thread so that the same JDBC Connection can be used for all of them for read-consistency and cleaned up at the end of the Tx. This is done declaratively so the implementation can be changed easily. The implementation can also ensure that the J2EE context is passed onto separate threads (JNDI, context class loader etc).
3) A system of fail-quick locks at the Tx scope rather than just Thread scope.

If this is worth persuing as a design for the next version then good. If not I'll stop now.

Michael

________________________________

D

Julian Hyde
03-16-2007, 01:11 PM
If changes to mondrian are breaking your application, I sympathize. How
can we prevent that from happening? Unless we restrict ourselves to
trivial enhancements, the we obviously need to test the new
functionality against existing apps, or at least tests which exercise
existing apps' requirements.

Ideally, these tests would be in the standard regression suite. But
since some tests are too complicated to be in the standard suite, to
testing which cannot be done nightly at least has to be done once per
release.

We already have a process in place for much of this. For example:

*

Developers ensure that code changes don't break the regression
suite.
*

I run the regression suite nightly, in a wide set of
configurations, and let the developers know next day if things break.
*

All code changes - bug fixes, enhancements and ad hoc feature -
must be accompanied by a regression test which exercises the change.
(That means it should fail if the change is not present.)

The extra things we need to do:

*

If you are using mondrian in your apps, contribute test suites
which exercise your app's functionality. LucidEra have already done
thist (ClearTestSuite) and Thomson-Medstat are working on it. Sure it's
a lot of work, but it's less work than taking a new release where things
have stopped working. It's an insurance policy.
*

Would it help if we invented the notion of 'QA partner'
companies? We could run through the test suite of each QA partner before
each release, as a pre-condition for making that release.
*

Write better tests for new features. Complex features require
complex tests.
*

Set up continuous integration (E.g. CruiseControl) to ensure
that the code always builds and compiles. Thiyagu is already working on
this.

Bart, On point #3: The feature you are interested in, for working on top
of dynamic databases, is VERY complex to test. We have spent a lot of
time discussing how to implement this feature, but very little time
designing a testing infrastructure. It should be little surprise that
the feature is fragile at this point.

Julian




_____

From: mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org]
On Behalf Of Pappyn Bart
Sent: Tuesday, March 13, 2007 1:34 AM
To: Mondrian developer mailing list
Subject: RE: [Mondrian] Multithreading etc


Hi Michael,



I don't see any problems for you to make changes to mondrian. But I
have some concerns:



I think the changes you are about to make are quite huge and will have
an impact of how mondrian will behave. Since this is the first source
contribution you are about to make, I urge you not to check anything
into perforce before it is actually working and passing all regression
tests.



I think most developers of mondrian have ongoing projects that are using
mondrian, I think this is

becoming more and more an important issue.



For me: it must be able to flush aggregates and member cache using the
plug-in and cubes not maintaining cache should be able to load their own
data, without messing with global cache.



And since a dynamic database cannot easily be simulated in a regression
test, I think if you are serious about tackling the read-consistency for
near-real-time data, you need a realistic (dynamic) database to test
against. And the database must be large enough to be able to see
realistic performance. It is also advised to test with virtual cubes,
cubes maintaining cache (with aggregate tables) in combination with
cubes not maintaining cache (without aggregate tables), shared
dimensions and so on

Pappyn Bart
03-21-2007, 07:20 AM
Julian,

The last few weeks, I have been busy thinking about ways to contribute a test suite that works as an insurance policy for me. Since I can only contribute a small part of my time to mondrian, I cannot watch each change to ensure my application is still working. All extra suggestions you have made are definitely very helpful.

1) Test suite :

My application now has a schema of more than 5000 lines. It uses many (almost all) features that are present in mondrian. And it combines them, in a way I am not sure all of them are covered by existing regression tests. It does also things that are not covered by the foodmart test case (like combining two cubes of identical layout in one virtual cube).

I think I will try to do three things :

*
Try to check against the foodmart database whether the features I use are supported by a test, if not, add one to the standard regression test suite. This is a hard one, since most features are already tested, but not in every combination with other features. I noticed in the past, most thing that silently fail in my application (without triggering the regression tests) are mostly due to combination of many facts (virtual cubes, properties, user defined functions, cache turned on/off, complex format expressions,...)

*
Create a regression test suite that runs here, using my schema and my database. This in combination with continuous integration could alarm me if anything is failing.

*
Try to contribute a database and schema to test against a dynamic database. While the test will never be able to simulate a real dynamic environment, I will try to change the database in between two queries. I will try out the change listener plug-in, in order to see if things are flushing and results are correct. Maybe a good place to test your new cache control part.

I think I would implement it using an database that is copied before the test (so the test starts with the same view each time) and doing jdbc calls to fill the database. If you have a better idea, please tell me. I am not sure what kind of database I should choose (access, ...). I noticed that the test suite contains something to load a DB from csv or xml, but I am not sure how this is working and how I can modify the data on the fly?

2) QA partner

Sounds like a good idea

3) Continuous integration

Looking forward to see this, I hope the setup environment for e.g. CruiseControl is made available, so I can use it to setup up an environment to test my own cube and regression tests.

4) Dynamic database

Indeed, this is very complex to test. But one needs to start somewhere, so I will try to contribute a test environment (see 1)) that will behave the same way as my application does.

And yes, there are many pitfalls, but due to a specific setup of the database most things are working for me, with some known issues that are acceptable in my kind of application.

Most things I did for 2.3 are about making it possible to :

* Let cubes not maintaining cache, not mess with the global cache.

* Cubes loading cache, will not interfere with concurrent threads (only check into global cache when all other threads are done).

* Being able to flush both aggregate and member cache using a plug-in.

Bart


________________________________

From: mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org] On Behalf Of Julian Hyde
Sent: vrijdag 16 maart 2007 18:04
To: 'Mondrian developer mailing list'
Subject: RE: [Mondrian] Multithreading etc


If changes to mondrian are breaking your application, I sympathize. How can we prevent that from happening? Unless we restrict ourselves to trivial enhancements, the we obviously need to test the new functionality against existing apps, or at least tests which exercise existing apps' requirements.

Ideally, these tests would be in the standard regression suite. But since some tests are too complicated to be in the standard suite, to testing which cannot be done nightly at least has to be done once per release.

We already have a process in place for much of this. For example:

*
Developers ensure that code changes don't break the regression suite.
*
I run the regression suite nightly, in a wide set of configurations, and let the developers know next day if things break.
*
All code changes - bug fixes, enhancements and ad hoc feature - must be accompanied by a regression test which exercises the change. (That means it should fail if the change is not present.)

The extra things we need to do:

*
If you are using mondrian in your apps, contribute test suites which exercise your app's functionality. LucidEra have already done thist (ClearTestSuite) and Thomson-Medstat are working on it. Sure it's a lot of work, but it's less work than taking a new release where things have stopped working. It's an insurance policy.
*
Would it help if we invented the notion of 'QA partner' companies? We could run through the test suite of each QA partner before each release, as a pre-condition for making that release.
*
Write better tests for new features. Complex features require complex tests.
*
Set up continuous integration (E.g. CruiseControl) to ensure that the code always builds and compiles. Thiyagu is already working on this.

Bart, On point #3: The feature you are interested in, for working on top of dynamic databases, is VERY complex to test. We have spent a lot of time discussing how to implement this feature, but very little time designing a testing infrastructure. It should be little surprise that the feature is fragile at this point.

Julian




________________________________

From: mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org] On Behalf Of Pappyn Bart
Sent: Tuesday, March 13, 2007 1:34 AM
To: Mondrian developer mailing list
Subject: RE: [Mondrian] Multithreading etc


Hi Michael,



I don't see any problems for you to make changes to mondrian. But I have some concerns:



I think the changes you are about to make are quite huge and will have an impact of how mondrian will behave. Since this is the first source contribution you are about to make, I urge you not to check anything into perforce before it is actually working and passing all regression tests.



I think most developers of mondrian have ongoing projects that are using mondrian, I think this is

becoming more and more an important issue.



For me: it must be able to flush aggregates and member cache using the plug-in and cubes not maintaining cache should be able to load their own data, without messing with global cache.



And since a dynamic database cannot easily be simulated in a regression test, I think if you are serious about tackling the read-consistency for near-real-time data, you need a realistic (dynamic) database to test against. And the database must be large enough to be able to see realistic performance. It is also advised to test with virtual cubes, cubes maintaining cache (with aggregate tables) in combination with cubes not maintaining cache (without aggregate tables), shared dimensions and so on...



I released my project 4 weeks ago, not even using the latest version of perforce, since at a given point mondrian-head was completely broken for me. While I know software always has some bugs that need to be patched, things that are not tested and are breaking mondrian should not be checked in. All too often I had to sync with perforce to solve a bug and this ended up in a nightmare, spending most of my time finding out what change was causing mondrian to break.



When mondrian 2.3 will be released, it is most likely that there will be some 2.3.x version containing some patches. I think it must be possible to make those patches without having to drag new huge features along.



Bart


________________________________

From: mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org] On Behalf Of michael bienstein
Sent: vrijdag 9 maart 2007 11:36
To: Mondrian developer mailing list
Subject: [Mondrian] Multithreading etc


I sent this to the list but it gets bounced because I attached the code in a zip file. How do I send code through without checking it in because it is still orthogonal to the codebase?

Michael
----

Well, I have code that works for multi-threading infrastructure so I would like to know if it is worth continuing with this or not.

As for ROLLUP/CUBE my thoughts are:
1) Either we keep the codebase simple by sticking to a standard (SQL2003) even if this standard is not yet implemented widely and certain databases have better special features than others, or we allow a per-database SQL generation system. The argument for the second makes sense only if the developer resources to write and maintain each dialect comes from the database vendor or their community. Mondrian is probably at a stage that such discussions can be undertaken with the database vendors.
2) Architecturally this implies loading multiple Aggregations from one SQL query. That requires a rethink of the way the cell cache loading is done because at the moment an Aggregation is loaded one at a time and in a synchronized block on the Aggregation. Similar concerns have to be dealt with for in-memory rollups. I think that synchronized is too forceful. We need something more like a Lock from java.util.concurrent so we can do tryLock(). Look at the TxLock idea I have in the code I'm attaching.

As for multi-threading:
I have only written most of the base infrastructure, not the cell loading. To integrate would require a significant amount of work in Mondrian's code to pass all interaction with Mondrian through TxSystem.runWithTx().

Basic concerns are:

1) Threads should be able to share data related to the request across the threads.
2) A Thread should be loaned to a request and returned in a way that is well-nigh fail-safe (i.e. the thread shouldn't keep running of the request fails in some way).
3) We should be able in a parameter of some sort decide to NOT use threads at all.
4) The number of threads should be configurable.
5) There should be an independence from the rest of the code base.
6) We should be able to make use of custom thread pools or use managed thread pools from the application server.
7) Then there is a relatively minor issue with read-consistency for near-real-time data that turns out to be a real head-ache. This can be done by either: using the transaction semantics of the underlying data store or modifying all SQL requests and cache interactions with a timestamp and/or transaction id of some sort. E.g. when an MDX requests begins it asks the underlying data store for the id of the last completed transaction that modified data and keeps this in a request-scope available to all threads. Then it appends "changedTxId <= ${lastTxIdWhenFirstEntered}" to each WHERE clause. If however we use the underlying data store's transactions then we must keep open the JDBC Connection for the duration of the request reusing it on the same thread for each interaction with that data store.
Now, I think that the best way to take advantage of multiple threads in the storage system is NOT launching multiple SQLs on the same star schema but different aggregations but rather to use partitioning of data. That is to segment the cell data (and maybe dimension data) based on values of certain columns. For example year<2007 and year=2007 in two different partitions. This can be introduced slowly by simply making a RolapStar one Partition for the moment. Having said that aggregation tables are also a type of Partition and hitting two of them at once should be quite easy.
So the design I am introducing has the following features:
1) A scope for "request" or "interaction" that is larger than the Thread that begins it. Since this is similar to a transaction I've called it a Tx. See the mondrian.tx package. Each sub-system in Mondrian can enlist a representation of itself in the Tx.
2) Break up the different tasks performed into Task objects that can be run potentially in parallel. Allow a set of Tasks to be tied to the same Thread so that the same JDBC Connection can be used for all of them for read-consistency and cleaned up at the end of the Tx. This is done declaratively so the implementation can be changed easily. The implementation can also ensure that the J2EE context is passed onto separate threads (JNDI, context class loader etc).
3) A system of fail-quick locks at the Tx scope rather than just Thread scope.

If this is worth persuing as a design for the next version then good. If not I'll stop now.

Michael

________________________________

D

Laurent Valdes
03-21-2007, 08:50 AM
Would it be possible to create another SVN branch for the Mondrian
multithreaded version ?
Afterwards, it would be possible to merge it with the main branch ;)

Hava a good day,

Laurent

2007/3/21, Pappyn Bart <Bart.Pappyn (AT) vandewiele (DOT) com>:
>
> Julian,
>
> The last few weeks, I have been busy thinking about ways to contribute a
> test suite that works as an insurance policy for me. Since I can only
> contribute a small part of my time to mondrian, I cannot watch each change
> to ensure my application is still working. All extra suggestions you have
> made are definitely very helpful.
>
> 1) Test suite :
>
> My application now has a schema of more than 5000 lines. It uses many
> (almost all) features that are present in mondrian. And it combines them,
> in a way I am not sure all of them are covered by existing regression
> tests. It does also things that are not covered by the foodmart test case
> (like combining two cubes of identical layout in one virtual cube).
>
> I think I will try to do three things :
>
> - Try to check against the foodmart database whether the features I
> use are supported by a test, if not, add one to the standard regression test
> suite. This is a hard one, since most features are already tested, but not
> in every combination with other features. I noticed in the past, most thing
> that silently fail in my application (without triggering the regression
> tests) are mostly due to combination of many facts (virtual cubes,
> properties, user defined functions, cache turned on/off, complex format
> expressions,...)
> - Create a regression test suite that runs here, using my schema and
> my database. This in combination with continuous integration could alarm me
> if anything is failing.
> - Try to contribute a database and schema to test against a dynamic
> database. While the test will never be able to simulate a real dynamic
> environment, I will try to change the database in between two queries. I
> will try out the change listener plug-in, in order to see if things are
> flushing and results are correct. Maybe a good place to test your new cache
> control part.
>
> I think I would implement it using an database that is copied before
> the test (so the test starts with the same view each time) and doing jdbc
> calls to fill the database. If you have a better idea, please tell me. I
> am not sure what kind of database I should choose (access, ...). I noticed
> that the test suite contains something to load a DB from csv or xml, but I
> am not sure how this is working and how I can modify the data on the fly?
>
> 2) QA partner
>
> Sounds like a good idea
>
> 3) Continuous integration
>
> Looking forward to see this, I hope the setup environment for e.g.
> CruiseControl is made available, so I can use it to setup up an environment
> to test my own cube and regression tests.
>
> 4) Dynamic database
>
> Indeed, this is very complex to test. But one needs to start somewhere,
> so I will try to contribute a test environment (see 1)) that will behave the
> same way as my application does.
>
> And yes, there are many pitfalls, but due to a specific setup of the
> database most things are working for me, with some known issues that are
> acceptable in my kind of application.
>
> Most things I did for 2.3 are about making it possible to :
>
> * Let cubes not maintaining cache, not mess with the global cache.
>
> * Cubes loading cache, will not interfere with concurrent threads (only
> check into global cache when all other threads are done).
>
> * Being able to flush both aggregate and member cache using a plug-in.
>
> Bart
>
> ------------------------------
> *From:* mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org]
> *On Behalf Of *Julian Hyde
> *Sent:* vrijdag 16 maart 2007 18:04
> *To:* 'Mondrian developer mailing list'
> *Subject:* RE: [Mondrian] Multithreading etc
>
> If changes to mondrian are breaking your application, I sympathize. How
> can we prevent that from happening? Unless we restrict ourselves to trivial
> enhancements, the we obviously need to test the new functionality against
> existing apps, or at least tests which exercise existing apps' requirements.
>
> Ideally, these tests would be in the standard regression suite. But since
> some tests are too complicated to be in the standard suite, to testing which
> cannot be done nightly at least has to be done once per release.
>
> We already have a process in place for much of this. For example:
>
> - Developers ensure that code changes don't break the regression
> suite.
> - I run the regression suite nightly, in a wide set of
> configurations, and let the developers know next day if things break.
> - All code changes - bug fixes, enhancements and ad hoc feature -
> must be accompanied by a regression test which exercises the change. (That
> means it should fail if the change is not present.)
>
> The extra things we need to do:
>
> - If you are using mondrian in your apps, contribute test suites
> which exercise your app's functionality. LucidEra have already done thist
> (ClearTestSuite) and Thomson-Medstat are working on it. Sure it's a lot of
> work, but it's less work than taking a new release where things have stopped
> working. It's an insurance policy.
> - Would it help if we invented the notion of 'QA partner' companies?
> We could run through the test suite of each QA partner before each release,
> as a pre-condition for making that release.
> - Write better tests for new features. Complex features require
> complex tests.
> - Set up continuous integration (E.g. CruiseControl) to ensure that
> the code always builds and compiles. Thiyagu is already working on this.
>
> Bart, On point #3: The feature you are interested in, for working on top
> of dynamic databases, is VERY complex to test. We have spent a lot of time
> discussing how to implement this feature, but very little time designing a
> testing infrastructure. It should be little surprise that the feature is
> fragile at this point.
>
> Julian
>
>
> ------------------------------
> *From:* mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org]
> *On Behalf Of *Pappyn Bart
> *Sent:* Tuesday, March 13, 2007 1:34 AM
> *To:* Mondrian developer mailing list
> *Subject:* RE: [Mondrian] Multithreading etc
>
> Hi Michael,
>
>
>
> I don't see any problems for you to make changes to mondrian. But I have
> some concerns:
>
>
>
> I think the changes you are about to make are quite huge and will have an
> impact of how mondrian will behave. Since this is the first source
> contribution you are about to make, I urge you not to check anything into
> perforce before it is actually working and passing all regression tests.
>
>
>
> I think most developers of mondrian have ongoing projects that are using
> mondrian, I think this is
>
> becoming more and more an important issue.
>
>
>
> For me: it must be able to flush aggregates and member cache using the
> plug-in and cubes not maintaining cache should be able to load their own
> data, without messing with global cache.
>
>
>
> And since a dynamic database cannot easily be simulated in a regression
> test, I think if you are serious about tackling the read-consistency for
> near-real-time data, you need a realistic (dynamic) database to test
> against. And the database must be large enough to be able to see realistic
> performance. It is also advised to test with virtual cubes, cubes
> maintaining cache (with aggregate tables) in combination with cubes not
> maintaining cache (without aggregate tables), shared dimensions and so on�
>
>
>
> I released my project 4 weeks ago, not even using the latest version of
> perforce, since at a given point mondrian-head was completely broken for
> me. While I know software always has some bugs that need to be patched,
> things that are not tested and are breaking mondrian should not be checked
> in. All too often I had to sync with perforce to solve a bug and this ended
> up in a nightmare, spending most of my time finding out what change was
> causing mondrian to break.
>
>
>
> When mondrian 2.3 will be released, it is most likely that there will
> be some 2.3.x version containing some patches. I think it must be
> possible to make those patches without having to drag new huge features
> along.
>
>
>
> Bart
>
> ------------------------------
> *From:* mondrian-bounces (AT) pentaho (DOT) org [mailto:mondrian-bounces (AT) pentaho (DOT) org]
> *On Behalf Of *michael bienstein
> *Sent:* vrijdag 9 maart 2007 11:36
> *To:* Mondrian developer mailing list
> *Subject:* [Mondrian] Multithreading etc
>
> I sent this to the list but it gets bounced because I attached the code
> in a zip file. How do I send code through without checking it in because it
> is still orthogonal to the codebase?
>
> Michael
> ----
>
> Well, I have code that works for multi-threading infrastructure so I would
> like to know if it is worth continuing with this or not.
>
> As for ROLLUP/CUBE my thoughts are:
> 1) Either we keep the codebase simple by sticking to a standard (SQL2003)
> even if this standard is not yet implemented widely and certain databases
> have better special features than others, or we allow a per-database SQL
> generation system. The argument for the second makes sense only if the
> developer resources to write and maintain each dialect comes from the
> database vendor or their community. Mondrian is probably at a stage that
> such discussions can be undertaken with the database vendors.
> 2) Architecturally this implies loading multiple Aggregations from one SQL
> query. That requires a rethink of the way the cell cache loading is done
> because at the moment an Aggregation is loaded one at a time and in a
> synchronized block on the Aggregation. Similar concerns have to be dealt
> with for in-memory rollups. I think that synchronized is too forceful. We
> need something more like a Lock from java.util.concurrent so we can do
> tryLock(). Look at the TxLock idea I have in the code I'm attaching.
>
> As for multi-threading:
> I have only written most of the base infrastructure, not the cell
> loading. To integrate would require a significant amount of work in
> Mondrian's code to pass all interaction with Mondrian through
> TxSystem.runWithTx().
>
> Basic concerns are:
> 1) Threads should be able to share data related to the request across
> the threads.
> 2) A Thread should be loaned to a request and returned in a way that
> is well-nigh fail-safe (i.e. the thread shouldn�t keep running of the
> request fails in some way).
> 3) We should be able in a parameter of some sort decide to NOT use
> threads at all.
> 4) The number of threads should be configurable.
> 5) There should be an independence from the rest of the code base.
> 6) We should be able to make use of custom thread pools or use
> managed thread pools from the application server.
> 7) Then there is a relatively minor issue with read-consistency for
> near-real-time data that turns out to be a real head-ache. This can be
> done by either: using the transaction semantics of the underlying data store
> *or* modifying all SQL requests and cache interactions with a timestamp
> and/or transaction id of some sort. E.g. when an MDX requests begins it
> asks the underlying data store for the id of the last completed transaction
> that modified data and keeps this in a request-scope available to all
> threads. Then it appends �changedTxId <= ${lastTxIdWhenFirstEntered}� to
> each WHERE clause. If however we use the underlying data store�s
> transactions then we must keep open the JDBC Connection for the duration of
> the request reusing it on the same thread for each interaction with that
> data store.
> Now, I think that the best way to take advantage of multiple threads in
> the storage system is NOT launching multiple SQLs on the same star schema
> but different aggregations but rather to use *partitioning* of data. That
> is to segment the cell data (and maybe dimension data) based on values of
> certain columns. For example year<2007 and year=2007 in two different
> partitions. This can be introduced slowly by simply making a RolapStar
> one Partition for the moment. Having said that aggregation tables are
> also a type of Partition and hitting two of them at once should be quite
> easy.
> So the design I am introducing has the following features:
> 1) A scope for "request" or "interaction" that is larger than the Thread
> that begins it. Since this is similar to a transaction I've called it a
> Tx. See the mondrian.tx package. Each sub-system in Mondrian can enlist
> a representation of itself in the Tx.
> 2) Break up the different tasks performed into Task objects that can be
> run potentially in parallel. Allow a set of Tasks to be tied to the same
> Thread so that the same JDBC Connection can be used for all of them for
> read-consistency and cleaned up at the end of the Tx. This is done
> declaratively so the implementation can be changed easily. The
> implementation can also ensure that the J2EE context is passed onto separate
> threads (JNDI, context class loader etc).
> 3) A system of fail-quick locks at the Tx scope rather than just Thread
> scope.
>
> If this is worth persuing as a design for the next version then good. If
> not I'll stop now.
>
> Michael
>
> ------------------------------
> D�couvrez une nouvelle fa�on d'obtenir des r�ponses � toutes vos questions
> ! Profitez des connaissances, des opinions et des exp�riences des
> internautes sur Yahoo! Questions/R�ponses<http://fr.rd.yahoo.com/evt=42054/*http://fr.answers.yahoo.com>.
>
> ______________________________________________________________________
> This email has been scanned by the Email Security System.
> ______________________________________________________________________
>
>
> ______________________________________________________________________
> This email has been scanned by the Email Security System.
> ______________________________________________________________________
>
> _______________________________________________
> Mondrian mailing list
> Mondrian (AT) pentaho (DOT) org
> http://lists.pentaho.org/mailman/listinfo/mondrian
>
>


--
We are drowning in information, but starved for knowledge.
«Germain» @<http://www.le-valdo.com>
_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian

Thiyagu Palanisamy
04-02-2007, 09:00 AM
Hi Michael, Julian,

Where are we with the MultiThreading sql execution?

Michael, have you made any further progress on this?


- Thiyagu

_______________________________________________
Mondrian mailing list
Mondrian (AT) pentaho (DOT) org
http://lists.pentaho.org/mailman/listinfo/mondrian