Big Data Landscape has been growing rapidly with lot of efforts from open source communities. It has been difficult for the big data developers across the globe to keep themselves with new releases. we have decided to write a blog on the same every month. Continue reading
AMAZON KINESIS VS APACHE KAFKA FOR BIG DATA ANALYSIS
Data processing today is done in form of pipelines which include various steps like aggregation, sanitization, filtering and finally generating insights by applying various statistical models. Amazon Kinesis is a platform to build pipelines for streaming data at the scale of terabytes per hour. Continue reading
Build a Real-time Stream Processing Pipeline with Apache Flink on AWS
Apache Flink is an open source project that is well-suited to form the basis of such a stream processing pipeline. It offers unique capabilities that are tailored to the continuous analysis of streaming data Continue reading
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. An abstraction over MapReduce which uses his own querying language called as PigLatin. Pig can work with any type of data, i.e with all structured, semi-structured and unstructured datasets. Best use cases to understand the apache pig, is log processing. In this blog we will use Apache Pig to examine the downloaded Apache Logs
Apache Hadoop continues to grab new engines with yarn, as center architecture to run within the platform.
The Apache community released Apache Pig 0.14.0 and the main important feature is Pig on Tez. More than 334 JIRA tickets from 35 Pig contributors are solved with this latest version.
You can have some more additional information in Apache Pig 0.14
NOTABLE IMPROVEMENTS IN APACHE PIG 0.14.0
- Pig on Tez
- ORC Storage
- Predictive Pushdown
- Automatic UDF-dependent jars
- Jar refactoring