Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: Hive JDBC Query too slow: too many fetches after query execution: Kettle Xform

  1. #1
    Join Date
    Feb 2014
    Posts
    1

    Default Hive JDBC Query too slow: too many fetches after query execution: Kettle Xform

    Here is how the balances are different between the reports.

    I have setup a kettle tranform with only one step "Table Input" that fires a query on a Hive table.Attach hive job screenshot.
    Name:  KettleJob.jpg
Views: 709
Size:  9.7 KB

    The connection sets up fine (except that it only connects to default database. Known issue, there is a jira on it). Attached connection screenshot.
    Name:  KettleHiveConnection.jpg
Views: 710
Size:  27.1 KB

    It works fine but is very slow. I checked the hiveserver2 logs and looks like the query executes really fast but after that there are a lot fetchresults executed and it takes forever for the data to make it back. Its fine if the resultset is small but for ~300000 rows it takes like 10 mins. We are using Cloudera4.5 (hence the bigdata plugin for kettle cdh42), out network is top of the line. Went I execute the same query on Hue(CDH query portal) on my desktop, it is very fast to return the same resultset, no fetchresults are seen in the logs.

    I checked the Hive-JDBC code for cdh42. Looks like the fetchsize is hardcoded to 50. And I can't figure out how and if I can send it down as a parameter.
    https://github.com/pentaho/hive/blob...Statement.java

    I am conjecturing that fetchsize is the issue but cannot say for sure since I could not figure if I have the right piece of code above and how I can build it. It is not setup like the other pentaho projects in github.
    Can someone please help me out? Everything about the setup works fine and I am getting killed in performance. Thanks!!!

    Hiverserver2 log for reference below:
    5:46:24.270 PM INFO org.apache.hadoop.hive.ql.exec.Task
    2014-02-20 17:46:24,265 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 85.06 sec
    5:46:25.285 PM INFO org.apache.hadoop.hive.ql.exec.Task
    2014-02-20 17:46:25,281 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 85.06 sec
    5:46:26.301 PM INFO org.apache.hadoop.hive.ql.exec.Task
    2014-02-20 17:46:26,296 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 85.06 sec
    5:46:26.303 PM INFO org.apache.hadoop.hive.ql.exec.Task
    MapReduce Total cumulative CPU time: 1 minutes 25 seconds 60 msec
    5:46:26.322 PM INFO org.apache.hadoop.hive.ql.exec.Task
    Ended Job = job_201402072147_0180
    5:46:26.330 PM INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator
    Moving tmp dir: hdfs://nameservice1/tmp/hive-hive/hive_2014-02-20_17-45-52_736_1233364556268491863-2/_tmp.-ext-10001 to: hdfs://nameservice1/tmp/hive-hive/hive_2014-02-20_17-45-52_736_1233364556268491863-2/_tmp.-ext-10001.intermediate
    5:46:26.338 PM INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator
    Moving tmp dir: hdfs://nameservice1/tmp/hive-hive/hive_2014-02-20_17-45-52_736_1233364556268491863-2/_tmp.-ext-10001.intermediate to: hdfs://nameservice1/tmp/hive-hive/hive_2014-02-20_17-45-52_736_1233364556268491863-2/-ext-10001
    5:46:26.346 PM INFO org.apache.hadoop.hive.ql.Driver
    </PERFLOG method=Driver.execute start=1392936353985 end=1392936386346 duration=32361>
    5:46:26.346 PM INFO org.apache.hadoop.hive.ql.Driver
    MapReduce Jobs Launched:
    5:46:26.347 PM INFO org.apache.hadoop.hive.ql.Driver
    Job 0: Map: 5 Cumulative CPU: 85.06 sec HDFS Read: 979418229 HDFS Write: 664486 SUCCESS
    5:46:26.347 PM INFO org.apache.hadoop.hive.ql.Driver
    Total MapReduce CPU Time Spent: 1 minutes 25 seconds 60 msec
    5:46:26.347 PM INFO org.apache.hadoop.hive.ql.Driver
    OK
    5:46:26.348 PM INFO org.apache.hadoop.hive.ql.Driver
    <PERFLOG method=releaseLocks>
    5:46:26.365 PM INFO org.apache.hadoop.hive.ql.Driver
    </PERFLOG method=releaseLocks start=1392936386348 end=1392936386365 duration=17>
    5:46:26.365 PM INFO org.apache.hadoop.hive.ql.Driver
    </PERFLOG method=Driver.run start=1392936353955 end=1392936386365 duration=32410>
    5:46:26.365 PM INFO org.apache.hive.service.cli.CLIService
    SessionHandle [2a370498-ee64-43c2-9add-284f2e61b53b]: executeStatement()
    5:46:26.365 PM INFO org.apache.hive.service.cli.CLIService
    OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=1be1c7de-1521-4c10-b393-f39457fb2603]: getOperationStatus()
    5:46:26.418 PM WARN org.apache.hadoop.hive.conf.HiveConf
    DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
    5:46:26.420 PM INFO org.apache.hive.service.cli.CLIService
    OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=1be1c7de-1521-4c10-b393-f39457fb2603]: getResultSetMetadata()
    5:46:26.498 PM WARN org.apache.hadoop.hive.conf.HiveConf
    DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
    5:46:26.556 PM WARN org.apache.hadoop.hive.conf.HiveConf
    DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
    5:46:26.565 PM INFO org.apache.hadoop.mapred.FileInputFormat
    Total input paths to process : 5
    5:46:26.713 PM WARN org.apache.hadoop.hive.conf.HiveConf
    DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
    5:46:26.780 PM INFO org.apache.hive.service.cli.CLIService
    OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=1be1c7de-1521-4c10-b393-f39457fb2603]: fetchResults()
    5:46:26.920 PM WARN org.apache.hadoop.hive.conf.HiveConf
    DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
    5:46:26.987 PM INFO org.apache.hive.service.cli.CLIService
    OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=1be1c7de-1521-4c10-b393-f39457fb2603]: fetchResults()
    ...
    (Lot more fetchresults)
    ...
    ...

  2. #2
    Join Date
    Sep 2012
    Posts
    71

    Default

    There are a few things going on here. Fetch size for "all" Hive drivers defaults to 50. PDI does a check to see if we're allowed to set the fetch size for the connection, and if so will set it to the larger of 10000 or getMaxRows() from the driver. Hive Driver max rows default to 0 (no limit). PDI sets max rows to your row limit but only AFTER it checks to see if allowed to set the fetch size. However the fetch size check asks if max rows > 0, which is false in this case.

    This appears to be a bug in PDI where we should probably call setMaxRows() before checking/setting fetch size. Would you mind writing up a Jira case with your findings? If not I can do it but I think it is better if reported from customers or the community (especially if you have the Enterprise Edition).

  3. #3
    Join Date
    Jun 2015
    Posts
    15

    Default

    Any update on this defect. Is it solved or is it still in queue as I am also facing the same problem.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.