DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Big Data, Blog, DataDotz Weekly, Hadoop, Hbase, Hive, Kafka, Pig, Spark on by .   0 Comment[s]


Build a Real-time Stream Processing Pipeline with Apache Flink on AWS

Apache Flink is an open source project that is well-suited to form the basis of such a stream processing pipeline. It offers unique capabilities that are tailored to the continuous analysis of streaming dataHowever, building and maintaining a pipeline based on Flink often requires considerable expertise, in addition to physical resources and operational efforts. This post outlines reference architecture for a consistent, scalable, and reliable stream processing pipeline that is based on Apache Flink using Amazon EMR, Amazon Kinesis, and Amazon Elasticsearch Service. A GitHub repository provides the artifacts that are required to explore the reference architecture in action.


SHC can be used to write data out to HBase cluster for further downstream processing. It supports Avro serialization for input and output data and defaults to a custom serialization using a simple native encoding mechanism. When reading input data, SHC pushes down filters to HBase for efficient scans of data. Given the popularity of Phoenix data in HBase, it seems natural to support Phoenix data as input to HBase in addition to Avro data. Also, defaulting to the simple native binary encoding seems susceptible to future changes and is a risk for users who write data from SHC into HBase.

Data Pipelines in Hadoop

As data technology continues to improve, many companies are realizing that Hadoop offers them the ability to create insights that lead to better business decisions. Moving to Hadoop is not without its challenges—there are so many options, from tools to approaches, that can have a significant impact on the future success of a business’ strategy. Data management and data pipelining can be particularly difficult. Yet, at the same time, there is a need to move quickly so that the business can benefit as soon as possible.

Making Sense of Stream Processing

The easiest way to explain stream processing is in relation to its predecessor, batch processing. Much data processing in the past was oriented around processing regular, predictable batches of data – the nightly job that, during quiet time, would process the previous day’s transactions the monthly report that provided summary statistics for dashboards, etc. Batch processing was straightforward, scalable and predictable, and enterprises tolerated the latency inherent in the model it could be hours, or even days, before an event was processed and visible in downstream data stores.

SQL-on-Hadoop: Impala vs Drill

About Oracle’s Analytic Views and how those can be used in order to provide a simple SQL interface to end users with data stored in a relational database. In today’s post I’m expanding a little bit on my horizons by looking at how to effectively query data in Hadoop using SQL. The SQL-on-Hadoop interface is key for many organizations – it allows querying the Big Data world using existing tools (like OBIEE, Tableau, DVD) and skills (SQL).

Deep Learning Frameworks on CDH and Cloudera Data Science Workbench

Cloudera Data Science Workbench is a comprehensive tool to apply fast and interactive data analysis to evolving models and algorithms as the new data and insights present themselves. It can operate either on on-premise or across public clouds and is a capability of the CDH platform. Cloudera Data Science Workbench provides a predictable, isolated file system and networking setup via Docker containers, across R, Python and Scala users. Users do not have to worry about which libraries are installed on the host, port contention with other user’s processes on the host, and the admins do not have to worry about users adversely impacting the host or other user’s workloads.

Pipeline from PostgreSQL to Kafka

Using PostgreSQL 9.4+’s logical decoding for change data capture (CDC), the team at Simple streams data to Kafka and ultimately to Amazon Redshift. Additionally, applications can do async processing based on data change events by subscribing to the CDC queue in Kafka. The post covers the architecture in depth, how their implementation relates to bottled water, and what their plans are for future enhancement.

Real-Time End-to-End Integration with Apache Kafka

Structured Streaming APIs enable building end-to-end streaming applications called continuous applications in a consistent, fault-tolerant manner that can handle all of the complexities of writing such applications. It does so without having to reason about the nitty-gritty details of streaming itself and by allowing the usage of familiar concepts within Spark SQL such as DataFrames and Datasets. All of this has led to a high interest in use cases wanting to tap into it. From introduction, to ETL, to complex data formats, there has been a wide coverage of this topic. Structured Streaming is also integrated with third party components such as Kafka, HDFS, S3, RDBMS, etc.


Hive LLAP (Low Latency Analytical Processing) is Hive’s new architecture that delivers MPP performance at Hadoop scale through a combination of optimized in-memory caching and persistent query executors that scale elastically within YARN clusters.Hive LLAP (Low Latency Analytical Processing) is Hive’s new architecture that delivers MPP performance at Hadoop scale through a combination of optimized in-memory caching and persistent query executors that scale elastically within YARN clusters.

Transform Data in StreamSets Data Collector

The past few months about the more advanced aspects of data manipulation in StreamSets Data Collector (SDC) – writing custom processors, calling Java libraries from JavaScript, Groovy & Python, and even using Java and Scala with the Spark Evaluator. As a developer, it’s always great fun to break out the editor and get to work, but we should be careful not to jump the gun. Just because you can solve a problem with code, doesn’t mean you should. Using SDC’s built-in processor stages is not only easier than writing code, it typically results in better performance. In this blog entry, I’ll look at some of these stages, and the problems you can solve with them.The StreamSets Data collector has a bunch of built-in functionality for transforming data. This post looks at transforming JSON data with the Field Pivoter, Field Flattener, Field Renamer, and Field Splitter.

HBase Breathes with In-Memory Compaction

The Apache HBase team has introduced a new compacting memstore (nicknamed “accordion”), that optimizes HBase’s memory usage. The compaction algorithm is similar to what is done with HFiles—the first article gives a brief overview of what this means and what the types of performance improvements you might see are. The second looks at the internals of the implementation and does a deep dive of the technical details.


The Hortonworks blog has an overview of the SparkR architecture and walkthrough of using SparkR (from an R programmer’s perspective) to crunch on datasets in a SparkR DataFrame. The post includes an example of using `gapply` to apply a user defined function.R is one of the primary programming languages for data science with more than 10,000 packages. R is an open source software that is widely taught in colleges and universities as part of statistics and computer science curriculum. R uses data frame as the API which makes data manipulation convenient. R has powerful visualization infrastructure, which lets data scientists interpret data efficiently.