Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: Inserting Data into Hive Partitioned Tables

  1. #1

    Default Inserting Data into Hive Partitioned Tables

    Hello all,

    I'm using Pentaho EE 7.1

    I'm trying to insert data into Hive using TableOutput Transformation.
    Its work fine with Hive Table not partitioned, but problem occurs when Output hive table is partitioned.
    I'm trying tu use check called, Partition data over tables, but it doesn't work as espected for one Hive Table partitioned.


    For instance. I'm using as table output like this ...

    CREATE EXTERNAL TABLE messages
    PARTITONED BY (period string)
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS INPUTFORMAT
    'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION
    'hdfs://XXXX/project/.../T_messages'
    TBLPROPERTIES ('hdfs://XXXX/project/.../databases/schema/T_messages'.avsc)


    Insert values works fine when I create this table without partition, but it doesn't works when hive table is partitioned.

    What is the correct approach to do this ?


    Regards,
    David

  2. #2
    Join Date
    Nov 2009
    Posts
    627

    Default

    When using EE edition yuo can go to the Pentaho support

  3. #3
    Join Date
    Dec 2017
    Posts
    1

    Default

    Thank you for sharing useful information.

  4. #4

    Default

    Has anyone actually made PDI work with a partitioned hive table?

  5. #5

    Default

    When I used Hive on a daily basis, the ability to insert individual rows into tables was an experimental feature, and it was awfully slow (due to the file-per-inserted-row requirement). Maybe it's changed, maybe it hasn't, but using the Table Output step with Hive is not something that I'd consider to be a good practice.

    That said, in looking at Apache Hive's documentation on the INSERT statement, it appears that they have a separate clause dealing with partitions. PDI by itself will typically just generate "INSERT INTO ... () VALUES ()" statements, but Hive appears to need an additional "PARTITION" clause in the query.

    One of the examples from the link above shows the PARTITION fields being the last values in each row, where it appears that Hive identifies it automatically as the partitioning field.

    INSERT INTO TABLE pageviews
    VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21');
    Otherwise, the quickest solution I'd imagine would be to dynamically generate the INSERT statements within the transformation (e.g. Java or Javascript steps), and then execute them in an "Execute row SQL script" step.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2017 Pentaho Corporation. All Rights Reserved.