Category Archives: Pig

datadotzweekly

DataDotz Bigdata Weekly

AMAZON KINESIS VS APACHE KAFKA FOR BIG DATA ANALYSIS
==========

Data processing today is done in form of pipelines which include various steps like aggregation, sanitization, filtering and finally generating insights by applying various statistical models. Amazon Kinesis is a platform to build pipelines for streaming data at the scale of terabytes per hour. Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

APACHE FLINK
==========

Build a Real-time Stream Processing Pipeline with Apache Flink on AWS

Apache Flink is an open source project that is well-suited to form the basis of such a stream processing pipeline. It offers unique capabilities that are tailored to the continuous analysis of streaming data Continue reading

Read More
pig_logo

Apache Log Processing with Apache Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. An abstraction over MapReduce which uses his own querying language called as PigLatin. Pig can work with any type of data, i.e with all structured, semi-structured and unstructured datasets. Best use cases to understand the apache pig, is log processing. In this blog we will use Apache Pig to examine the downloaded Apache Logs

Continue reading

Read More
pig-on-elephant

Moving from Pig 0.12 to Pig 0.14

Apache Hadoop continues to grab new engines with yarn, as center architecture to run within the platform.

The Apache community released Apache Pig 0.14.0 and the main important feature is Pig on Tez. More than 334 JIRA tickets from 35 Pig contributors are solved with this latest version.
You can have some more additional information in Apache Pig 0.14

 

NOTABLE IMPROVEMENTS IN APACHE PIG 0.14.0

  • Pig on Tez
  • ORC Storage
  • Predictive Pushdown
  • Automatic UDF-dependent jars
  • Jar refactoring

Continue reading

Read More