PDA

View Full Version : Resource exporter



MattCasters
03-10-2009, 09:50 PM
Dear Kettle fans,

One of the things that’s been on my TODO list for a while was the creation of a resource exporter (http://wiki.pentaho.com/display/EAI/Exporting+resources)…

Resource exporter?

It’s called “Resource exporter” and not “Job exporter” or “Transformation exporter” because it is intended to export more than just a single job or transformation. It exports all linked resources of a job or transformation.

The means that if you have a job that has 5 transformation job entries, you will be exporting 6 resources (1 job and 5 transformations). If those transformations use 3 sub-transformations (mappings) you will in total export 9 resources.

The whole idea behind this exercise is to be able to create a package (for example to send to someone) that has all needed resources contained in a single zip file.

Let’s look at an example! We have a job to load/update a complete data warehouse. It loads source files, updates dimensions and a fact table, in total 31 transformations and jobs.

The top level job we want to export is the “Load data warehouse.kjb” (can be in a repository too!). Thanks to the very recently added “export” option in Kitchen, we can run this:


sh kitchen.sh -file=’/parking/TDWI/PDI/Load data warehouse.kjb’ -export=/tmp/foo.zip

This generates the file “/tmp/foo.zip” that contains all the used resources. Please note you can also do this in Spoon under the “File” menu.

What about job and transformation filenames?

If you look in the ZIP file with “unzip -l” you will notice entries like this one:


33107 03-10-09 18:31 Update_Customer_Dimension_023.ktr Originating file : /parking/TDWI/PDI/Update Customer Dimension.ktr (file:///parking/TDWI/PDI/Update Customer Dimension.ktr) This resource gets called in the “Update dimensions” job, so let’s look inside of the generated XML to see how this is solved. We see that the entries have been replaced by the correct link:


${Internal.Job.Filename.Directory}/Update_Customer_Dimension_023.ktr This is interesting, because the originating transformation could have been located anywhere. Once it’s exported to the zip file, it’s referenced with a relative path (using PDI internal variables). That in turn means you can locate the zip file anywhere, even on a remote web server and it would still be executable. In fact, Kitchen gives us advice on how to run the “Load data warehouse” job in the ZIP file:


This resource can be executed inside the exported ZIP file without extraction. You can do this by executing the following command: sh kitchen.sh -file=’zip:file:///tmp/foo.zip!Load_data_warehouse_001.kjb’ What about input file names?

Obviously, you can’t go about zipping input files that can sometimes be quite large. So we opted to create a set of named parameters that you can use to define the location of the input files.

In our example, we have a set of files read with “CSV Input” and “Text File Input” steps that are located in 2 folders: “/parking/TDWI/” and “/parking/TDWI/Source Data”. During the export, the step metadata will be changed to read:


${DATA_PATH_x}/ In this specific case we then create 2 parameters in the job, sub-jobs and sub-transformations:


DATA_PATH_1, default=/parking/TDWI/Source Data DATA_PATH_2, default=/parking/TDWI These named parameters can then be used during execution with kitchen. If you send the “foo.zip” file to someone else along with the data in “/bar and “/bar/Source Data” you can execute the job as follows:


sh kitchen.sh -file=’zip:file:///bar/foo.zip!Load_data_warehouse_001.kjb’
-param:DATA_PATH_1=”/bar/Source Data”
-param:DATA_PATH_2=”/bar”


The subject of named parameters is worthy of a complete article all by itself. It’s the brain child of Kettle star Sven Boden. It would take us too far to explain the details, but you can see what parameters are defined for the job like this:


sh kitchen.sh -file=’zip:file:///bar/foo.zip!Load_data_warehouse_001.kjb’ -listparam Parameter: DATA_PATH_1=, default=/parking/TDWI/Source Data : Data file path discovered during export Parameter: DATA_PATH_2=, default=/parking/TDWI : Data file path discovered during export Because the default values are set you can in fact test the job before you send it over.

What’s next?

Next on the agenda (after the 3.2 release) is to make this function available in the execution dialog so that we can more easily do remote execution. Another interesting execution option is to store the generated zip files in a folder or even in a database so that we can always see exactly what was executed at a certain given time.

Until next time,
Matt



More... (http://www.ibridge.be/?p=159)