PDA

View Full Version : Parse nasty XLS with dynamic ETL



MattCasters
02-25-2011, 02:00 PM
Dear Kettle friends,

Last year, right after the summer in version 4.1 of Pentaho Data Integration, we introduced the notion of dynamically inserted ETL metadata (Youtube video here (http://www.youtube.com/watch?v=EjzgzOanq1o)). Since then we received a lot of positive feedback on this functionality which encouraged me to extend it to a few more steps. Already with support for “CSV Input” and “Select Values” we could do a lot of dynamic things. However, we can clearly do a lot better by extending our initiative to a few more steps: “Microsoft Excel Input” (which can also read ODS by the way), “Row Normalizer” and “Row De-normalizer”.

Below I’ll describe an actual (obfuscated) example that you will probably recognize as it is equally hideous as simple in it’s horrible complexity.

Take a look at this file:

http://www.ibridge.be/images/dynamic-xls/nasty-xls.png

Let’s assume that this spreadsheet describes the number of products sold on a given date. Spreadsheets like these are usually automatically generated by some kind of pivoting program. Because of that they contains a varying number of columns with a dimension value in the column header. In our sample we have one column for each date since the beginning of this year. Not all dates are listed however, that would too simple, only days on which products were actually sold are listed. In our sample that means: no weekend days are present in the spreadsheet. In short: this spreadsheet will look different all year round.

Surely, this can’t be too hard to read in by an ETL tool, right? You just want to load this data into a database somewhere and be done with it.

Now hang on. Databases don’t support tables with varying number of columns. So we actually want to only get 3 columns from this spreadsheet: The product code, the date of sales and the number of goods sold (the metric).

In order to do that, we need to actually know the layout of the spreadsheet before we do the ETL. We also need to un-pivot or normalize the data. Again, to be able to do that, we need to know the exact layout of the file.

Once at this point any serious ETL consultant will be doing one of the following:



Convince the customer that this is a “bad file” and that the un-pivoted source data is needed. Days will be spend convincing and acquiring this data… when in fact it’s already there.
Ask the file to be delivered in CSV file format so that some clever Perl-Python-JavaScript-Ruby-awk-bash script can be applied to the problem… Days will be spent by the customer to convert the file and weeks to get the script right, to debug and maintain that. The ones having heard of the PDI “Split Field to rows” step might know (from our forum) how to solve the problem by reading the whole line and by splitting it into un-pivoted form with a wad of JavaScript.
A seniour ETL consultant (with evil tendencies) will try to convince the customer that the requirement is “off scope” or shift it to the next “iteration”. It’s probably something along those lines.
Ignore the problem, it’s clearly too hard to do.
Pull out hair (not an option in my case)
Any of the above.

Now, most open source software start with this realization: “There has got to be a better way!” This is no different…

If only we could somehow determine the layout of the file automatically, extract the date for each column and then inject that information into another transformation. Mmm…

http://www.ibridge.be/images/dynamic-xls/top-level-job.pngThis job describes the 2 main transformations used to solve our pesky little problem. Both consist of a transformation that feeds ETL metadata into a transformation template.

To determine the layout of the file we need to read the header of the spreadsheet file. Here’s how that should work:

http://www.ibridge.be/images/dynamic-xls/determine-layout-template.png

What this transformation (template) does is read the PROD_CODE column from the spreadsheet as well as a fixed number of columns from the first line of the spreadsheet. Then the data is normalized. Finally we only need to retain those columns where there is a non-null date

The only problem with this transformation is that the number of columns is not dynamic so we need to either update it every day or you need to type in a a lot of columns up-front, say a thousand or so. Since we don’t like to type we want to generate that ETL metadata dynamically. You could try to manually “generate” this data in a spreadsheet and then copy it over into the step grids or … this way:

http://www.ibridge.be/images/dynamic-xls/determine-layout.png

This transformation generates 1001 rows of field ETL metadata for 2 steps: “Read Spreadsheet” and “Row Normaliser”. The output of the “Append streams” step looks exactly like the Fields section in the “Read Spreadsheet” step:

http://www.ibridge.be/images/dynamic-xls/determine-layout-preview-append.png

This fields metadata is mapped and injected into our transformation template with the “ETL Metadata Injection” step:

http://www.ibridge.be/images/dynamic-xls/determine-layout-inject.png

So our transformation to determine the layout of a spreadsheet actually executes another templated transformation after we injected the appropriate ETL metadata. The resulting “layout.txt” file contains the following data:


typefield;date
0001;2011-01-03
0002;2011-01-04
0003;2011-01-05
0004;2011-01-06
0005;2011-01-07
0006;2011-01-10
0007;2011-01-11



So with this transformation we learned which column after the product code contains which date. We can use this in the next transformation template to correspond our column number with the correct date:

http://www.ibridge.be/images/dynamic-xls/read-spreadsheet-template.pngThe layout of the spreadsheet and the row normaliser metadata can be injected into the template with the following transformation:

http://www.ibridge.be/images/dynamic-xls/read-spreadsheet.png

The file that comes out of this can obviously be stored into a database but for now it looks like this:


PROD_CODE;date;metric
PROD001;2011-01-03;5
PROD001;2011-01-04;7
PROD001;2011-01-05;9
PROD001;2011-01-06;1
PROD001;2011-01-07;3
PROD001;2011-01-10;9
PROD001;2011-01-11;1



PROD030;2011-02-11;4
PROD030;2011-02-14;10
PROD030;2011-02-15;2
PROD030;2011-02-16;4
PROD030;2011-02-17;6
PROD030;2011-02-18;8
PROD030;2011-02-21;4
PROD030;2011-02-22;6
PROD030;2011-02-23;8
PROD030;2011-02-24;10

This is the format we wanted. We did have to solve a few complex problems and as usual you can’t oversimplify solutions. However, our solution will last forever, is low maintenance and is fairly generic in the sense that you could parameterize the whole job to read more than just this one spreadsheet.

If you want to play with this example yourself, check out the attached zip (http://jira.pentaho.com/secure/attachment/24920/PDI-5422.zip) test-case of PDI-5422 (http://jira.pentaho.com/browse/PDI-5422). Please note that you need a recent (as in the one from today) nightly build (http://ci.pentaho.com/job/Kettle/) of Kettle/PDI to run.

As always, let me know what you think and what other steps you need to support ETL Metadata Injection.

Until next time,

Matt



More... (http://www.ibridge.be/?p=194)