pig-on-elephant

Moving from Pig 0.12 to Pig 0.14

This entry was posted in Big Data, Blog, Pig on by .   0 Comment[s]

Apache Hadoop continues to grab new engines with yarn, as center architecture to run within the platform.

The Apache community released Apache Pig 0.14.0 and the main important feature is Pig on Tez. More than 334 JIRA tickets from 35 Pig contributors are solved with this latest version.
You can have some more additional information in Apache Pig 0.14

 

NOTABLE IMPROVEMENTS IN APACHE PIG 0.14.0

  • Pig on Tez
  • ORC Storage
  • Predictive Pushdown
  • Automatic UDF-dependent jars
  • Jar refactoring

PIG ON TEZ

Apache Tez is an alternative execution engine focusing on performance so Pig can compile into a better execution plan. The result is consistent performance improvements in both large and small queries.
To run Pig script in Tez mode, we simply put “-x tez” in Pig command line

There are 2 known limitations in Tez mode

  • Illustrate is not yet supported.
  • The Tez UI is not ready, but you will be able to see AM/container logs in NodeManager Web UI.

 

ORC STORAGE

ORC Storage provides a way to read and write an ORC file directly in Pig. You can specify a number of options on the storage side, about how to write your ORC file, such as the size of a stripe or whether or not to use compression.

Example

OrcStorage:

            A = load ‘/input.orc’ using OrcStorage();
            describe A;
            store B into ‘/datagen/datagen_10.orc’ using OrcStorage(‘–Options 240′);

 

Data types:
Most Orc data type has one to one mapping to Pig data type. Several exceptions are:

 

Loader side:

Orc STRING/CHAR/VARCHAR all map to Pig varchar
Orc BYTE/BINARY all map to Pig bytearray
Orc TIMESTAMP/DATE all maps to Pig datetime
Orc DECIMAL maps to Pig bigdecimal

 

Storer side:

Pig chararray maps to Orc STRING
Pig datetime maps to Orc TIMESTAMP
Pig bigdecimal/biginteger all map to Orc DECIMAL
Pig bytearray maps to Orc BINARY

 

PREDICTIVE PUSHDOWN 

With predicate pushdown, we can utilize the stats stored in ORC file/stripe/row group, and eliminate some blocks entirely.

Example

    A = load ‘/input.orc’ using OrcStorage();
    B = filter A by $4 > 25 and $0 < 3; 
    dump B;

 

AUTOMATIC UDF-DEPENDENT JARS

Some LoadFunc/ StoreFunc/ EvalFunc depend on external jars at runtime. Pig user needs to register those jars manually in the Pig script. With Pig 0.14, it is possible to declare runtime dependency inside UDF so user don’t need to manually registering.

 

JAR REFACTORING

Due to the problem faced in downstream projects they haven’t published Pig-withouthadoop.jar to maven

They decided to remove Pig-withouthadoop.jar and Pig-withouthadoop-h2.jar in this release, and instead we shipped Pig-core.jar and Pig-core-h2.jar, along with dependent jars in the lib directory. There are also lib/h1 and lib/h2 directories, which contain jars only applicable for hadoop 1 or hadoop 2. Pig script will find out which version of Hadoop you are using and weaving the right jars into the CLASSPATH.

 

Written by Saravanan, Data Engineer @ DataDotz.

DataDotz is a Chennai based BigData Team primarily focussed on consulting and training on technologies such as Apache Hadoop, Apache Spark , NoSQL(HBase, Cassandra, MongoDB), Search and Cloud Computing.