PDA

View Full Version : OLAP'ing (bonsai) CVS'db



maxcsaucdk
02-22-2003, 12:56 AM
Hi!
I've just discovered Mondrian (the only open source OLAP project I could find !), and I'm in the process of learning to use it - and for that purpose I've chosen to try to OLAP'ing our MySQL database that contains every add,delete,commit,tag,branching of our CVS (simply using ViewCVS and it's database integration).
As I'm quite a newbie regarding "transforming" a standard OLTP relational database schema to a OLAP'ed version I would like to get some hints and pointer on how you would attack this one :)
The CVSdb schema goes as follows (the second "group" is an example of data for this table):
people(id,who) (1, "max")
dirs(id,dir) (10, "devel/mondrian")
files(id,file) (4, "build.xml")
descs(id,desc, hash) (7, "Added loging", 13)
repositories(id, repository) (1, "/home/cvsroot")
branches(id, branch) (2, "release-1_1")
All of the above I see as simple lookup tables, the core data is in the checkins table which look like this:
checkins(type, ci_when, whoid, repositoryid, dirid, fileid, revision, stickytag, branchid, addedlines, removedlines, descid)
Example:
("Commit", "2002-06-12 12:07:01", 1, 1, 10, 4, "1.2", NULL, 2, 23, 5, 7)
This describes that at "2002-06-12 12:07:01", max commited devel/mondrian/build.xml with tag 1.2 and he added 23 lines and removed 5 - and the comment was "Added loging".
I guess the lookup tables could be used out of the box in Mondrian + I should create a datatime table to be able to use the ci_when usefully, but which kind of new tables/extracts of the checkins table would you make for such a "beast" ? :)
Do you even have a suggestion for the xml file needed for mondrian in such a case ? (small examples is enough - i'll complete the rest if I can :)
Appreciate your comments.

maxcsaucdk
02-22-2003, 02:07 AM
A more specific question:
How would you model the recursive nature of the FileSystem ?
One way is to add a recursive foregin key "parentdir" to the dirs table, but can mondrian support such an recursive relationship with no definite levels (but still with a single fixpoint (the root "/")).
(I saw a reference to MemberSource in the docs which one could implement for such a thing but I could not find that interface in the javadocs :(

jhyde
02-22-2003, 11:14 AM
I'll answer both of your questions in one posting.
Looks like a good multidimensional schema. But it's always good to double-check by looking at things from the end-user's perspective: figure out the top 3 questions you'd like the database to answer, figure out how it should appear in the pivot table, and design your cube(s) accordingly.
You might discover that you need other hierarchies, like 'file type' (derived from the file suffix), and 'bug' (derived by mining descriptions for text like 'bug 12345').
Yes, you should create a time dimension. (At some point I will add a time dimension generator to Mondrian, which will generate the necessary members in memory. But it's not there yet.)
Even though your timestamps are at second granularity, your time dimension should probably be at the daily level. (If one of your pressing questions is 'At what hour of the day do most checkins occur?', you'll need to do extra work.)
As regards writing schema XML files, did you take a look at
http://apoptosis.dyndns.org:8080/open/mondrian/doc/schema.html
? There are some holes in it, but I am to answer the common questions.
Regarding file hierarchies, what you're asking for is called a parent-child hierarchy, which we haven't implemented yet. Right now, mondrian only supports hierarchies with a fixed number of levels. You'll have make do with a two level hierarchy (file, directory) for now.
By the way, how are you going to connect the 'files' table to 'dirs'? You seem to be missing a foreign key.
Please let me know how you get on with this project, and if I can be of any more help. It would be great to post your CVSdb.xml alongside FoodMart.xml, when you have it working.

maxcsaucdk
02-22-2003, 07:03 PM
The schema is actually just the one bonsai and ViewCVS is running on - so if I could go about Data mining this schema without to many "revolutinizing" changes it would certainly make for an intersting project for the opensource people using CVS (IMHO :)
Is the basis for creating a "time-dimension-generator" in place ? Do you have an "on-the-fly"-dimension-generator interface I could implement and then have mondrian support it ?
My timedimension I would probably like to have at time level (maybe even minutes, but that might be stretching it ...), but isn't that just done by having a very large time dimension table with columns defining which "period" they are in ? (e.g. "morning", "noon", "evening", "night", "Q1", "Q2", etc. etc.)
I did take a look at the schema.html page and it kicked me in the right direction - so, i'll start testing as soon as posible :)
A two-level hiearchy is probably not enough. Or at least the file in dir hiearchy is to low a detail. I'll probably just realize a dir in some "toplevel dir" to be able to categorize the CVS checkins in smaller pieces. This is needed at our place since we got One CVS repository for all our projects and they are just "sepearted" by being in different toplevel dirs.
And no, i'm not missing a foreign key in the files table. The files only contains unique file names and can thus appear in multiple directories. e.g. "build.xml" is only present in the files table once.
The connection between dirs and files is done in the checkins table where it points to both a file and a dir.
And yes, I'll gladly provide you with my CVSdb.xml when I have something to provide :)
And thanks for your comments!

jhyde
02-24-2003, 09:28 AM
The schema is actually just the one bonsai and
> ViewCVS is running on - so if I could go about
> Data mining this schema without to many
> "revolutinizing" changes it would certainly make
> for an intersting project for the opensource
> people using CVS (IMHO :)
Absolutely. People get much more excited about "their own" data than about some hypothetical supermarket. :)
> Is the basis for creating a "time-dimension-generator"
> in place ? Do you have an "on-the-fly"-dimension-generator
> interface I could implement and then have
> mondrian support it ?
The interface is mondrian.rolap.MemberReader. The idea is that you write a class which implements this interface, and define a dimension with this class as its reader. For example, see the dimension "Has bought dairy", in FoodMart.xml. This example works, but isn't very substantial.
You're welcome to try building a time dimension generator, and I'll give you all the help I can.
If you decide to go for a real time dimension table, you may have practical problems if you resolve down to the minute: you would have a dimension table ~2 orders of magnitude larger than your fact table, and that is unusual, to say the least. Better to create a 'day_number' column in the fact table, join to a time dimension table at day granularity, and make the final level of the time dimension the timestamp in the fact table.
> A two-level hiearchy is probably not enough. Or
> at least the file in dir hiearchy is to low a detail.
> I'll probably just realize a dir in some "toplevel
> dir" to be able to categorize the CVS checkins in
> smaller pieces. This is needed at our place since
> we got One CVS repository for all our projects
> and they are just "sepearted" by being in different
> toplevel dirs.
That would work ok. I hope we'll have parent-child hierarchies in the near future, so then you can move to that.

jhyde
04-10-2003, 08:42 PM
Don't know whether I mentioned... parent-child hierarchies were implemented in change 406 (2003/3/29).