Dear Kettlers,

A couple of years ago I wrote a post about key/value tables and how they can ruin the day of any honest person that wants to create BI solutions. The obvious advice I gave back then was to not use those tables in the first place if you’re serious about a BI solution.

However, there are occasions where you need to query a source system and get some report going on them. Let’s take a look at an example :

mysql> select * from person;+----+-------+----------+| id | name | lastname |+----+-------+----------+| 1 | Lex | Luthor || 2 | Clark | Kent || 3 | Lois | Lane |+----+-------+----------+3 rows in set (0.00 sec)mysql> select * from person_attribute;+----+-----------+---------------+------------+| id | person_id | attr_key | attr_value |+----+-----------+---------------+------------+| 1 | 1 | GENDER | M || 2 | 2 | GENDER | M || 3 | 3 | GENDER | F || 4 | 1 | NATURE | EVIL || 6 | 2 | NATURE | KIND || 7 | 3 | NATURE | KIND || 8 | 1 | PEOPLE_SAVED | 0 || 9 | 2 | PEOPLE_SAVED | 394239324 || 10 | 3 | PEOPLE_SAVED | 263403 || 11 | 1 | PEOPLE_HARMED | 983439 || 12 | 2 | PEOPLE_HARMED | 0 || 13 | 3 | PEOPLE_HARMED | 29 |+----+-----------+---------------+------------+12 rows in set (0.00 sec)mysql> select * from attribute_description;+----+---------------+-------------------------+-----------+| id | attr_key | attr_description | attr_type |+----+---------------+-------------------------+-----------+| 1 | GENDER | Gender (M/F) | String || 2 | NATURE | Nature | String || 3 | PEOPLE_SAVED | Number of people saved | Integer || 4 | PEOPLE_HARMED | Number of people harmed | Integer |+----+---------------+-------------------------+-----------+It was Pentaho partner OpenBI that gave me the use-case but the data is obviously fictive. In the real-world case, the “person_attribute” table contains over one hundred million rows of data.

The challenge is that you want to have the following columns in your query: person ID, name, last name, gender, nature, number of saved people and number of people harmed. However, it has to be implemented in such a way that if a new attribute is added to table “attribute_description” a new column will appear in the output. It has to be 100% maintenance free.

Fortunately, we’ve done some interesting things recently with dynamic transformation to allow us to create the following transformation:



This transformation reads the metadata from the “attribute_description” table and injects that into the following template:



The Inject step is instructed to specify the list of columns to denormalize and take the output from that specific step:



As you can see in the next screen shot image from the previewing of data, all defined attributes are de-normalized correctly with the appropriate data types:



Now we can add an attribute to see if our transformation takes it into account:

mysql> insert into attribute_description(attr_key, attr_description, attr_type) values(’NUM_FRIENDS’, ‘Number of friends’, ‘Integer’);
Query OK, 1 row affected (0.06 sec)

mysql> insert into person_attribute(person_id, attr_key, attr_value) values(1, ‘NUM_FRIENDS’, ‘5′);
Query OK, 1 row affected (0.00 sec)

mysql> insert into person_attribute(person_id, attr_key, attr_value) values(2, ‘NUM_FRIENDS’, ‘34′);
Query OK, 1 row affected (0.00 sec)

mysql> insert into person_attribute(person_id, attr_key, attr_value) values(3, ‘NUM_FRIENDS’, ‘12′);
Query OK, 1 row affected (0.00 sec)
After updating our tables like this we can now preview the exact same step. As you can tell from the following preview screen shot the extra column is automatically being picked up:



Now obviously, from a maintenance perspective this helps a great deal in when you are building a data warehouse. However, it also allows us to start thinking about doing reporting and analyses on these kinds of dynamic data sources. How can we explain to Pentaho Metadata that attributes should be picked up automatically? How can we refresh Mondrian schema information on-the-fly when the new attribute is introduced? While the answers to those questions are not yet 100% clear, I’m sure it’s going to be an interesting discussion.

Until next time,

Matt

P.S. Please note you’ll need a very recent build of PDI 4.2.0-M2 to be able to pass along data from a dynamic transformation.



More...