PDA

View Full Version : running XSLT transformation within pentaho map reduce job



ronans
03-08-2012, 11:16 PM
Hi all
When attempting to run an XSLT transformation within a pentaho map reduce job, I noted that the xslt libraries included in the PHD distribution are saxon 8, whereas the xslt libraries installed in PDI are saxon 9. As a result of this, transformations which take advantage of xslt 2.0 features fail when running under map reduce.

In theory replacing the saxon8 jar with the correponding saxon 9 jars should resolve it but are any there any other dependencies with phd on saxon8 other than implementing the xslt transform step ?

Any advice or comments welcome

RonanS

jganoff
03-09-2012, 12:01 PM
There should not be any dependencies to saxon 8 that would not be satisfied with saxon 9 within the Kettle environment required to run Pentaho MapReduce. In fact, this dependency will be updated to saxon 9 in the next release to match what Kettle ships.

jganoff
03-12-2012, 10:16 AM
I've updated the saxon version included in the Pentaho Big Data Plugin to 9.1.0.8. Do you also require saxon-dom for your work?

ronans
03-12-2012, 03:37 PM
For scalability and memory efficiency reasons, StAX or SAX interfaces are preferable but there may be other users who rely on DOM.

Thanks for asking

RonanS

ronans
03-13-2012, 02:55 PM
here are some of the other issues I have come across when using the XSLT transformation running within PHD on hadoop. I will open Jira bugs on these

1 - if the xslt file is located on hdfs, in the prerelease it does not seem to process correctly even after adding the correct libraries to the path. It is necessary to load both the xls file and the source data to be transformed into memory and process these from fields within the stream rather than pointing the xslt transform directly to the files

2 - in the 4.3 prerelease, I have intermittent errors with the xslt transform failing silently in cases where it did not fail under 4.2. I dont have an reliable repro scenario here so it may be some form of out of memory error

3 - the latest versions of saxon have removed some of the schema aware features of xslt from the open source version that were supported in earlier versions. It would be useful to have other alternative xslt engines and interfaces supported for example xalan, oracle xdk, Trax etc.

4 - error reporting from the xsl compiler is poor - often I get an error simply that the compilation of the xsl failed without much if any further supporting information other than the stack trace

RonanS

jganoff
03-13-2012, 10:31 PM
Thanks for the feedback. I'll have a look at the cases once they're logged and see what could be going on.

For #1, could it be a pathing issue? You could use the distributed cache to distributed the files around with the job. Copy the xslt to HDFS then set the mapred.cache.files configuration property as a User Defined property. Our How To on using a Custom Partitioner describes how to use the Distributed Cache to propagate files around so their local to the executing process: http://wiki.pentaho.com/display/BAD/Using+a+Custom+Partitioner+in+Pentaho+MapReduce

More info on the distributed cache is documented here: http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#DistributedCache

All that being said, you should be able to load the file from hdfs using the appropriate syntax so that sounds like a bug.