hive_logo

Moving from Hive 0.12 to Hive 0.14 Installation & Acid

This entry was posted in Big Data, Blog, Hive on by .   0 Comment[s]

Yarn has allowed all new engines to emanate hadoop. The most popular integration point with hadoop is always SQL. We have various types of SQL on hadoop but Apache Hive still the defacto standard.

Last month, the apache Hive community released apache Hive 0.14, which has the result of the first phase in the Stringer.next initiative.

30 developers involved in this version and resolved more than 1,015 issues. Although there are many new features in Hive 0.14, there are a few notable changes which we’d like to highlight. For the complete list of features, improvements, and bug fixes, see the release notes. Release Notes of Apache Hive

hive1

NOTABLE IMPROVEMENTS IN APACHE HIVE 0.14

  • Transaction with ACID semantics
  • Cost based optimizer
  • SQL temporary tables
  • Apache Accumulo integration

 

TRANSACTION WITH ACID USE CASE

  • Hive is a SQL query engine on top of hadoop and so far there was no ACID in Hive but now Hive 0.14 has ACID.
  • Once an hour, a set of inserts and updates (up to 100k rows) for various dimension tables (e.g. Retail, Healthcare) needs to be processed.
  • Once a day a small set (up to 200k rows) of records need to be deleted for regulatory compliance.
  • Streaming inserts coming from Flume (log processing) or Storm applications into a Hive table, which are available to be read by Hive queries in less than 15 seconds.

You can find some more additional information in Transactions in Hive

 

COST BASED OPTIMIZER

Cost-based optimization may reduce query latency by reducing the cardinality of the intermediate result set. The Cost-based Optimizer improves usability because it removes the need for query hints for good query execution plans.

Optiq currently has over fifty cost based optimization rules. Some of the prominent cost based optimization rules are listed below:

  • Push Join through Union
  • Push Filter past Table Function
  • Join Reordering
  • Semi Join selection
  • Push Aggregate through Union
  • Pull Aggregate through Union
  • Pull Constants through Aggregate
  • Merge Unions

You can find some more additional information in Cost Based Optimizer in Hive

 

SQL TEMPORARY TABLES

SQL temporary table is the one of the added advantages in this version of Hive which helps us to play with the tables without affecting the original tables. A table that has been created as a temporary table will only be visible to the current session. Data will be stored in the user’s scratch directory, and deleted at the end of the session.

If a temporary table is created with a database/table name of a permanent table which already exists in the database, then within that session any references to that table will resolve to the temporary table, rather than to the permanent table. The user will not be able to access the original table within that session without either dropping the temporary table, or renaming it to a non-conflicting name. But these temporary tables have the some limitations like

  • Partition columns are not supported
  • No support for creation of indexes

To have detailed information about it have a look into Temporary table in Hive

 

HIVE-0.14 INSTALLATION & ADD ACID PROPERTIES

Step 1
HDFS root scratch dir for Hive jobs which gets created with write all (777) permission

hive2

Step 2
Add some ACID properties in Hive-site.xml (OR) USE SET Command

<property>
            <name>Hive.support.concurrency</name>
            <value>true</value>
</property>
<property>
            <name>Hive.enforce.bucketing</name>
            <value>true</value>
</property>
<property>
            <name>Hive.exec.dynamic.partition.mode</name>
            <value>nonstrict</value>
</property>
<property>
            <name>Hive.txn.manager</name>
            <value>org.apache.hadoop.Hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
            <name>Hive.compactor.initiator.on</name>
            <value>true</value>
</property>
<property>
            <name>Hive.compactor.worker.threads</name>
            <value>1</value>
</property>

Step – 3
Start your Hadoop Cluster

Step – 4
bin/hive

hive3

Step – 5

You will be facing the error communicating with the metastore, when you give show tables. This is because the extra metastore table to support ACID properties won’t be created automatically as this ACID is not default in Hive 0.14.

mysql -u root -p
mysql>use metastore (Hive Metastore)
mysql>source /home/username/hadoop2/Hive-0.14.0/scripts/metastore/upgrade/mysql/Hive-txn-schema-0.14.0.mysql.sql;
mysql>show tables;

Already Hive-0.14 has default meta tables 29, Now Creating 8 txn tables. So we will have total of 37 tables

Step -6
Hive>show tables;

hive4

Megastore Problem Solved

 

SOME EXAMPLES FOR TRANSACTION WITH ACID

ABOUT THE TABLE
Table Name: HiveAcid
Data set Used: Patient data

1,Brandon Buckner,avil,female,525
2,Veda Hopkins,avil,male,633
3,Zia Underwood,paracetamol,male,980
4,Austin Mayer,paracetamol,female,338
5,Mara Higgins,avil,female,153
6,Sybill Crosby,avil,male,193
7,Tyler Rosales,paracetamol,male,778
8,Ivan Hale,avil,female,454
9,Alika Gilmore,paracetamol,female,833
10,Len Burgess,metacin,male,325

CREATE TABLE
create table HiveAcid(pid INT, pname STRING, drug STRING,gender STRING, tot_amt INT) clustered by (pid) into 3 buckets  stored as orc TBLPROPERTIES (‘transactional’=’true’) ;

hive5

LOAD TO TABLE
Load the existing table to Acid table (HiveAcid)
From patient insert into table HiveAcid select pid,pname,drug,gender,tot_amt;

hive6

UPDATE
UPDATE HiveAcid SET tot_amt = 300 WHERE pid = 9;

hive7

INSERT
insert into table HiveAcid values(11,’datadotz’,’avil’,’male’,50000);

hive8

DELETE
delete from HiveAcid where pid=6;

hive9

SOME EXAMPLE FOR SQL TEMPORARY TABLES

CREATE TEMPORARY TABLE tmpTable(pid INT, pname STRING, drug STRING,gender STRING, tot_amt INT);
INSERT OVERWRITE TABLE tmpTable select * from patient;

hive10

 

 

Written by SaravanaKumar , Data Engineer @ DataDotz.

DataDotz is a Chennai based BigData Team primarily focussed on consulting and training on technologies such as Apache Hadoop, Apache Spark , NoSQL(HBase, Cassandra, MongoDB), Search and Cloud Computing.