Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: DB Connection Pooling

  1. #1
    Join Date
    Jan 2006
    Posts
    25

    Default DB Connection Pooling

    Is the purpose of connection pooling in Kettle

    1. to allow reuse of connections declared at the Job level across all transformations withing a job?
    2. to allow reuse of connections within a transformation?
    3. both of the above?

    From the documentation and forums it looks like #1 is not the intent. In other words I cannot define a single connection with pooling at the Job level and then reuse the pool for transformations within the job.

    Is it even possible to define connection only at the job level without having to enter at each transformation level within a job?

    Can somebody shed some light?

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    not 100% sure but #1 should work, but not as you intend it to... you will not get the same connection in the transformation as the 1 you got in the job above. The pool means that a connection will be reused but that's transparant to you, there's no sharing of active connections. It's a low level JDBC pool.

    You can share connections in 1 transformation between steps (using unique connections) but that ends at the transformation level.

    Or do I interpret your question in another way?

    Regards,
    Sven

  3. #3
    Join Date
    Jan 2006
    Posts
    25

    Default

    Thanks Sven. The first part of your response is what I was looking for. So lets say I have a job that contains 4 transformations. And lets say each of the 4 transformations uses the connection to the same database.

    Can I just define the db connection at the Job level and then reference the same connection within the db related steps of the 4 transformations? If yes then what happens if I want to independently preview a transformation? Would it blow up because there is no connection defined at the transformation level?

  4. #4
    Join Date
    May 2006
    Posts
    4,882

    Default

    And the answer to your last part is no (I wasn't involved in the pooling thing so YMMV)... that's not what pooling is for... Pooling is that you get a connection from the pool and return it when you're finished with it, so that another part of your application can you use the connection with less overhead. But you can't choose which connection you get returned.

    Regards,
    Sven

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Sven, your interpretation is correct. It's just an connection usage optimization. It helps certain people with limited user licenses to put a max number of connections in there and there seems to be a performance advantage for "slow connecting" databases such as Oracle.
    The data pooling concept comes from the server world where an app server can be bombarded with requests and database connections can be opened and closed all the time.

    We do have a "Unique connections" option in the transformation settings that will use a single database connection per database name for a single transformation. It does not work yet across jobs.

    Matt

  6. #6
    Join Date
    Sep 2008
    Posts
    26

    Default

    I think there is an issue with connection pooling in Kettle when using variable substitution for connection parameters. Looks like connections are get from the pool simply by Connection Name.
    I've took a look at class ConnectionPoolUtil(kettle 4.3.0), which as I understand responsible for connection pooling. It use following code to get a connection from pool:
    return DriverManager.getConnection("jdbc:apache:commons:dbcp:"+dbMeta.getName());

    So it ignore the fact that I have dynamic connection definition for each run of the transformation. I'm loading data from several databases and run same set of transformations with different connection information(host, user and password).

  7. #7
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    The issue is with your Kettle job, I think. You should turn off connection pooling, because you can't expect to fetch a preexisting connection from the pool in your case anyway.
    So long, and thanks for all the fish.

  8. #8
    Join Date
    Sep 2008
    Posts
    26

    Default

    I'm not currently using Connection Pooling in Kettle, just was thinking if I can do it and what advantages I may get.
    But if it were allowed to substitute variables in connection name I think that it could be possible to use connection pooling.
    However I understand that allowing variable substitution in Connection Name field can have huge impact, and that need to be analysed carefully before implementing.

  9. #9
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    You talked about variables in connection parameters in your previous posting, and that's what I commented on.
    If you miss variable substitution in connection names, it's something else, but I still don't see how you could benefit from the connection pool, either.
    Anyway, if you come up with a use case, you always can file a JIRA feature request.
    Your previously described use case seems to call for parallel execution, instead of serialization with changing connections.
    So long, and thanks for all the fish.

  10. #10
    Join Date
    Sep 2008
    Posts
    26

    Default

    By "variables in connection parameters" I mean following scenario:

    Suppose you have a job with 2 transformations. Each of transformations get data from same database and processing it(Let's say it's a database of an online store and first transformation loads orders and second one sales data). And transformations run in sequence.

    Then I want to apply same job to other Online Stores which keep data in a different databases but with same schema.
    I've implemented that by creating a top level job, which call my first job for each online store's database, and pass database connection information for the store as a variables(host, user and password). And each of transformations use this variables to connect to appropriate database.
    Nothing runs in parallel here, we process stores one by one, and transformations also run in sequence.

    So in this case we have 2 transformations and each of them will establish a database connection. Creating new connection to database(Oracle) takes time.
    Now suppose that for each store I have to run 20-40 transformations. Each of them will open new connection. Now time for opening connections becomes more important, because we repeat this operation more times.

    So the issue is that I can't use connection pool in this settings. Because it store connections in pool by Connection Name - which will be same for all databases. And we will run into issue that we can't predict which database connection the pool will return. I think it will always return connection to the first database.

    I've suggested to substitute variables in connection name as a solution to this issue.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.