View Full Version : Data federation

08-02-2012, 07:31 PM
Dear Kettle friends,
For a while now we’ve been getting requests from users to support a system called “Data Federation” a.k.a. a “Virtual Database”. Even though it has been possible for a while to create reports on top of a Kettle transformation, this system could hardly be considered a virtual anything since the Pentaho reporting engine runs the transformation on the spot to get to the data.
The problem? A real virtual database would have to understand SQL and a data transformation engine typically doesn’t. It’s usually great at generating it, parsing it not so.
So after a lot of consideration and hesitation (you don’t really want to spend too much time in the neighborhood of SQL/JDBC code unless you want to go insane) we decided to build this anyway, mainly because folks kept asking about it and because it’s a nice challenge.
The ultimate goal is to create a virtual database that is clever enough to understand the SQL that the Mondrian ROLAP engine generates.
Here is the architecture we’re in need of:
In other words, here’s what the user should be able to do:

He/she should be able to create any kind of transformation that generates rows of data, coming from any sort of database.
It should be possible to use any kind of software that understands the JDBC and SQL standards
It should have a minimal set of dependencies as far as libraries are concerned
Data should be streamed to allow for massive amounts of data to be passed from server to client
The SQL should be able to understand basic SQL including advanced WHERE, GROUP BY, ORDER BY, HAVING clauses. (anything that an OLAP engine needs)

Not for the first time, I though to myself (and the patient ##pentaho community on IRC) : “This can’t be that hard!!”. After all, you only need to parse SQL that gets data from a single (virtual) database table since joining and so on can be done in the service transformation.
So I started pounding on my keyboard for a few weeks (rudely interrupted by a week of vacation in France) and a solution is now more or less ready for more testing…
You can read all details about it on the following wiki page:

The cool thing about Kettle data federation is that anyone can test this in half an hour time following the next few simple steps:

Download a recent 5.0-M1 development build from our CI system (http://ci.pentaho.com/job/Kettle/) (any left failed unit tests are harmless but an indication that you are in fact dealing with non-stable software in development)
Create a simple transformation (in .ktr file format) reading from a spreadsheet or some other nice and simple data source
Create a Carte configuration file as described in the Server Configuration chapter on the driver page (http://wiki.pentaho.com/display/EAI/The+Thin+Kettle+JDBC+driver) specifying

The name of the service (for example “Service”)
the transformation file name
the name of the step that will deliver the data

Then start Carte
Then configure your client as indicated on the driver page.

For example, I created a transformation to test with that delivered some simple static data:
I have been testing with Mondrian on the EE BI Server 4.1.0-GA, and as indicated on the driver page, simply replaced all the kettle jar files in the server/biserver-ee/tomcat/webapps/pentaho/WEB-INF/lib/ folder.
Then you can do everything from inside the user interface.

Create the data source database connection:
Follow the data source wizard, select “Reporting and Analyses” at the bottom:
Select one table only and specify that table as the fact table:
Then you are about ready to start the reporting & analyses action. Simply keep the default model (you
can customize it later)…
You are now ready to create interactive reports…
… and analyzer views:
So get started on this and make sure to give us a lot of feedback, your success stories and failures as well. You can comment on the driver page or in the corresponding JIRA case PDI-8231 (http://jira.pentaho.com/browse/PDI-8231)
The future plans are:

Offer easy integration with the unified repository for our EE users so that they won’t have to enter XML or have to restart a server when they want to add or change the services list. (arguably an important requisite for anyone seriously considering this to be run in production)
Implement service and SQL data caching on the server.
Allow writable services and “insert into” statements on the JDBC client


More... (http://www.ibridge.be/?p=210)