Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: Hadoop Data Source for Cubes and Reports

  1. #1

    Question Hadoop Data Source for Cubes and Reports

    I am checking the new ETL which is providing the support for hadoop. Now my questions is that suppose i get my data from hadoop, clean it and put back to hadoop (instead of mysql which i am using now) then can we use hadoop as data source for cube or reports generation?


  2. #2
    Join Date
    Mar 2003


    Hive is a JDBC driver that sits on top of Hadoop. Just use it as you would use any other JDBC driver.

    However, if you want to do datawarehousing or let users run reports, then Hive is not for you.

    What Hive is NOT

    Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling. As a result - latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as Oracle where analyses are conducted on a significantly smaller amount of data but the analyses proceed much more iteratively with the response times between iterations being less than a few minutes. Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries. Hive also does not provide sort of data or query cache to make repeated queries over the same data set faster.
    Do in most cases, you will still be better off having a classical relational datawarehouse with your aggregated data ready for reporting and OLAP.
    Get the latest news and tips and tricks for Pentaho Reporting at the Pentaho Reporting Blog.

  3. #3


    Thanks Taqua.
    Actually My scenario is that i am getting thousands of log files daily which need to be processed to clean the data and get useful information, resulting in millions of rows daily. If i use classical relational data warehousing then after some days, i had billions of billions rows in the fact table and i think classical relational database will fail as days passed. so thats why i am thinking to put all data in hadoop clusters and then generate cubes and reports from here.

    What do you think, in this scenario which approach will be better??


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.