Dear Kettle friends,
For a while now we’ve been getting requests from users to support a system called “Data Federation” a.k.a. a “Virtual Database”. Even though it has been possible for a while to create reports on top of a Kettle transformation, this system could hardly be considered a virtual anything since the Pentaho reporting engine runs the transformation on the spot to get to the data.
The problem? A real virtual database would have to understand SQL and a data transformation engine typically doesn’t. It’s usually great at generating it, parsing it not so.
So after a lot of consideration and hesitation (you don’t really want to spend too much time in the neighborhood of SQL/JDBC code unless you want to go insane) we decided to build this anyway, mainly because folks kept asking about it and because it’s a nice challenge.
The ultimate goal is to create a virtual database that is clever enough to understand the SQL that the Mondrian ROLAP engine generates.
Here is the architecture we’re in need of:

In other words, here’s what the user should be able to do:

  • He/she should be able to create any kind of transformation that generates rows of data, coming from any sort of database.
  • It should be possible to use any kind of software that understands the JDBC and SQL standards
  • It should have a minimal set of dependencies as far as libraries are concerned
  • Data should be streamed to allow for massive amounts of data to be passed from server to client
  • The SQL should be able to understand basic SQL including advanced WHERE, GROUP BY, ORDER BY, HAVING clauses. (anything that an OLAP engine needs)

Not for the first time, I though to myself (and the patient ##pentaho community on IRC) : “This can’t be that hard!!”. After all, you only need to parse SQL that gets data from a single (virtual) database table since joining and so on can be done in the service transformation.
So I started pounding on my keyboard for a few weeks (rudely interrupted by a week of vacation in France) and a solution is now more or less ready for more testing…
You can read all details about it on the following wiki page:
The cool thing about Kettle data federation is that anyone can test this in half an hour time following the next few simple steps:

  • Download a recent 5.0-M1 development build from our CI system (any left failed unit tests are harmless but an indication that you are in fact dealing with non-stable software in development)
  • Create a simple transformation (in .ktr file format) reading from a spreadsheet or some other nice and simple data source
  • Create a Carte configuration file as described in the Server Configuration chapter on the driver page specifying
    • The name of the service (for example “Service”)
    • the transformation file name
    • the name of the step that will deliver the data

  • Then start Carte
  • Then configure your client as indicated on the driver page.

For example, I created a transformation to test with that delivered some simple static data:

I have been testing with Mondrian on the EE BI Server 4.1.0-GA, and as indicated on the driver page, simply replaced all the kettle jar files in the server/biserver-ee/tomcat/webapps/pentaho/WEB-INF/lib/ folder.
Then you can do everything from inside the user interface.

Create the data source database connection:

Follow the data source wizard, select “Reporting and Analyses” at the bottom:

Select one table only and specify that table as the fact table:

Then you are about ready to start the reporting & analyses action. Simply keep the default model (you
can customize it later)…

You are now ready to create interactive reports…

… and analyzer views:

So get started on this and make sure to give us a lot of feedback, your success stories and failures as well. You can comment on the driver page or in the corresponding JIRA case PDI-8231
The future plans are:

  • Offer easy integration with the unified repository for our EE users so that they won’t have to enter XML or have to restart a server when they want to add or change the services list. (arguably an important requisite for anyone seriously considering this to be run in production)
  • Implement service and SQL data caching on the server.
  • Allow writable services and “insert into” statements on the JDBC client