DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Big Data, Blog, DataDotz Weekly, Hbase, Kafka, Spark on by .   0 Comment[s]

Apache Kafka

Spark Streaming integration with Kafka allows users to read messages from a single Kafka topic or multiple Kafka topics. A Kafka topic receives messages across a distributed set of partitions where they are stored. Each partition maintains the messages it has received in a sequential order where they are identified by an offset, also known as a positionDevelopers can take advantage of using offsets in their application to control the position of where their Spark Streaming job reads from, but it does require offset management.

Streaming Analytics Manager in Hortonworks

Hortonworks has an in depth look at the new Streaming Analytics Manager. The post describes the main components service pools and environments and describes how to build an application using the String Builder canvas. There’s integration with the Hortonworks Schema Registry to automatically detect the schema from a Kafka topic and built in support for common streaming processors like joins, projections, and aggregations.

Use Kafka for streaming ETL

This post provides an overview of how to use Kafka for streaming ETL. The tutorial uses Kafka Connect for extracting data from a relational database (including a simple transformation), running a Kafka Streams application, and then loading database to another database (once again) using Kafka Connect. The post has lots of code (which tends to be mostly configuration) and an overview of what each of these pieces is doing.

Apache HBase

Apache HBase has special support for “Medium Object Storage” or MOB, which separately stores files from references when a value is larger than a particular size. This post describes an enhancement (support for weekly and monthly partitioning) in the design which solves memory problems on the Name Node due to the number of MOB files that could be created.


HDFS is architecture to automatically handle datanode and disk failures by taking corrective actions like moving blocks to other datanodes disks. Slow datanodes or disks can affect the overall performance of the cluster as they are assigned tasks by the namenode.Device behavior analytics improves the diagnosability of datanode performance problems by flagging slow datanodes in the cluster. This feature will be available in HDP-2.6.1.

Apache Kafka for Python Programmers

The Confluent blog has an introduction to the lib and kafka-based Python APIs for Kafka. In short, the post shows how to use the APIs to produce and consume records from a Kafka cluster and how to setup a local development environment.