Yarn has allowed all new engines to emanate hadoop. The most popular integration point with hadoop is always SQL. We have various types of SQL on hadoop but Apache Hive still the defacto standard.
Last month, the apache Hive community released apache Hive 0.14, which has the result of the first phase in the Stringer.next initiative.
30 developers involved in this version and resolved more than 1,015 issues. Although there are many new features in Hive 0.14, there are a few notable changes which we’d like to highlight. For the complete list of features, improvements, and bug fixes, see the release notes. Release Notes of Apache Hive
NOTABLE IMPROVEMENTS IN APACHE HIVE 0.14
- Transaction with ACID semantics
- Cost based optimizer
- SQL temporary tables
- Apache Accumulo integration
TRANSACTION WITH ACID USE CASE
- Hive is a SQL query engine on top of hadoop and so far there was no ACID in Hive but now Hive 0.14 has ACID.
- Once an hour, a set of inserts and updates (up to 100k rows) for various dimension tables (e.g. Retail, Healthcare) needs to be processed.
- Once a day a small set (up to 200k rows) of records need to be deleted for regulatory compliance.
- Streaming inserts coming from Flume (log processing) or Storm applications into a Hive table, which are available to be read by Hive queries in less than 15 seconds.
You can find some more additional information in Transactions in Hive
COST BASED OPTIMIZER
Cost-based optimization may reduce query latency by reducing the cardinality of the intermediate result set. The Cost-based Optimizer improves usability because it removes the need for query hints for good query execution plans.
Optiq currently has over fifty cost based optimization rules. Some of the prominent cost based optimization rules are listed below:
- Push Join through Union
- Push Filter past Table Function
- Join Reordering
- Semi Join selection
- Push Aggregate through Union
- Pull Aggregate through Union
- Pull Constants through Aggregate
- Merge Unions
You can find some more additional information in Cost Based Optimizer in Hive
SQL TEMPORARY TABLES
SQL temporary table is the one of the added advantages in this version of Hive which helps us to play with the tables without affecting the original tables. A table that has been created as a temporary table will only be visible to the current session. Data will be stored in the user’s scratch directory, and deleted at the end of the session.
If a temporary table is created with a database/table name of a permanent table which already exists in the database, then within that session any references to that table will resolve to the temporary table, rather than to the permanent table. The user will not be able to access the original table within that session without either dropping the temporary table, or renaming it to a non-conflicting name. But these temporary tables have the some limitations like
- Partition columns are not supported
- No support for creation of indexes
To have detailed information about it have a look into Temporary table in Hive
HIVE-0.14 INSTALLATION & ADD ACID PROPERTIES
HDFS root scratch dir for Hive jobs which gets created with write all (777) permission
Add some ACID properties in Hive-site.xml (OR) USE SET Command
Step – 3
Start your Hadoop Cluster
Step – 4
Step – 5
You will be facing the error communicating with the metastore, when you give show tables. This is because the extra metastore table to support ACID properties won’t be created automatically as this ACID is not default in Hive 0.14.
mysql -u root -p
mysql>use metastore (Hive Metastore)
Already Hive-0.14 has default meta tables 29, Now Creating 8 txn tables. So we will have total of 37 tables
Megastore Problem Solved
SOME EXAMPLES FOR TRANSACTION WITH ACID
ABOUT THE TABLE
Table Name: HiveAcid
Data set Used: Patient data
create table HiveAcid(pid INT, pname STRING, drug STRING,gender STRING, tot_amt INT) clustered by (pid) into 3 buckets stored as orc TBLPROPERTIES (‘transactional’=’true’) ;
LOAD TO TABLE
Load the existing table to Acid table (HiveAcid)
From patient insert into table HiveAcid select pid,pname,drug,gender,tot_amt;
UPDATE HiveAcid SET tot_amt = 300 WHERE pid = 9;
insert into table HiveAcid values(11,’datadotz’,’avil’,’male’,50000);
delete from HiveAcid where pid=6;
SOME EXAMPLE FOR SQL TEMPORARY TABLES
CREATE TEMPORARY TABLE tmpTable(pid INT, pname STRING, drug STRING,gender STRING, tot_amt INT);
INSERT OVERWRITE TABLE tmpTable select * from patient;
Written by SaravanaKumar , Data Engineer @ DataDotz.
DataDotz is a Chennai based BigData Team primarily focussed on consulting and training on technologies such as Apache Hadoop, Apache Spark , NoSQL(HBase, Cassandra, MongoDB), Search and Cloud Computing.