Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: running XSLT transformation within pentaho map reduce job

  1. #1
    ronans Guest

    Default running XSLT transformation within pentaho map reduce job

    Hi all
    When attempting to run an XSLT transformation within a pentaho map reduce job, I noted that the xslt libraries included in the PHD distribution are saxon 8, whereas the xslt libraries installed in PDI are saxon 9. As a result of this, transformations which take advantage of xslt 2.0 features fail when running under map reduce.

    In theory replacing the saxon8 jar with the correponding saxon 9 jars should resolve it but are any there any other dependencies with phd on saxon8 other than implementing the xslt transform step ?

    Any advice or comments welcome

    RonanS

  2. #2
    Join Date
    Aug 2010
    Posts
    87

    Default

    There should not be any dependencies to saxon 8 that would not be satisfied with saxon 9 within the Kettle environment required to run Pentaho MapReduce. In fact, this dependency will be updated to saxon 9 in the next release to match what Kettle ships.

  3. #3
    Join Date
    Aug 2010
    Posts
    87

    Default

    I've updated the saxon version included in the Pentaho Big Data Plugin to 9.1.0.8. Do you also require saxon-dom for your work?

  4. #4
    ronans Guest

    Default

    For scalability and memory efficiency reasons, StAX or SAX interfaces are preferable but there may be other users who rely on DOM.

    Thanks for asking

    RonanS

  5. #5
    ronans Guest

    Default issues

    here are some of the other issues I have come across when using the XSLT transformation running within PHD on hadoop. I will open Jira bugs on these

    1 - if the xslt file is located on hdfs, in the prerelease it does not seem to process correctly even after adding the correct libraries to the path. It is necessary to load both the xls file and the source data to be transformed into memory and process these from fields within the stream rather than pointing the xslt transform directly to the files

    2 - in the 4.3 prerelease, I have intermittent errors with the xslt transform failing silently in cases where it did not fail under 4.2. I dont have an reliable repro scenario here so it may be some form of out of memory error

    3 - the latest versions of saxon have removed some of the schema aware features of xslt from the open source version that were supported in earlier versions. It would be useful to have other alternative xslt engines and interfaces supported for example xalan, oracle xdk, Trax etc.

    4 - error reporting from the xsl compiler is poor - often I get an error simply that the compilation of the xsl failed without much if any further supporting information other than the stack trace

    RonanS

  6. #6
    Join Date
    Aug 2010
    Posts
    87

    Default

    Thanks for the feedback. I'll have a look at the cases once they're logged and see what could be going on.

    For #1, could it be a pathing issue? You could use the distributed cache to distributed the files around with the job. Copy the xslt to HDFS then set the mapred.cache.files configuration property as a User Defined property. Our How To on using a Custom Partitioner describes how to use the Distributed Cache to propagate files around so their local to the executing process: http://wiki.pentaho.com/display/BAD/...taho+MapReduce

    More info on the distributed cache is documented here: http://hadoop.apache.org/common/docs...stributedCache

    All that being said, you should be able to load the file from hdfs using the appropriate syntax so that sounds like a bug.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.