Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: Kettle Transform as Web Service - Faster from a file vs repo?

  1. #1

    Default Kettle Transform as Web Service - Faster from a file vs repo?

    Wondering if anyone has seen this...using a kettle transform as a web service, calling the ExecuteTrans method we can see a significant (in webservice time) difference of 2-3 seconds in duration when calling a transform from the repository vs calling the transform from a file. For example:

    http://servername:9080/pentaho-di/ke.../transformname - takes about 2-3 seconds longer

    http://servername:9080/pentaho-di/ke...nsformname.ktr - than this

    All the time seems to be consumed in the up front "Dispatching started for transformation" phase like this:
    2014/03/17 21:50:01 - RepositoriesMeta - Reading repositories XML file: /home/pentaho/.kettle/repositories.xml
    2014/03/17 21:50:01 - Creating repository meta store interface
    2014/03/17 21:50:01 - Connected to the enterprise repository
    2014/03/17 21:50:02 - transformname - Dispatching started for transformation [transformname]
    2014/03/17 21:50:03 - Get Variables.0 - Finished processing (I=0, O=0, R=1, W=1, U=0, E=0)

    What goes on in these up front processes? Is there any way to optimize them, or is it better to call transforms from files versus the repository when doing web service calls (where every second counts)? Thanks...!

  2. #2

    Default

    Yes, definitely. We stopped using Enterprise Repository for this very reason--we couldn't sacrifice speed in our jobs. We have a master job that runs nightly and calls numerous sub jobs, then transformations. Switching from Enterprise Repository to a file repository saved us over an hour of processing time each night. The time to load each individual transformation isn't horrible, typically a few seconds as you note, it was the accumulation of this over time that was unacceptable. For low-latency web service calls you're in very much the same situation.

    The root cause seems to be that the repository abstraction in Kettle doesn't cache any metadata, not the repository metadata (it repeatedly loads repositories.xml unnecessarily), nor does it cache job or transformation metadata. On a file repository, this would be straightforward to implement by checking the file mtime before reloading from disk.

    See PDI-8742 which includes my comments on the issue.

    Some repository types are more efficient than others. We found the performance of a basic file repository acceptable for our needs. The Enterprise Repository incurs additional latency, possibly due to the web service layer which is called, but perhaps also due to the security subsystem and/or persistence (JCR). It may be faster in 5.x which we have not tested yet.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.