DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Big Data, Blog, DataDotz Weekly, Hadoop, Kafka, Spark on by .   0 Comment[s]

Amazon S3

A data lake is an increasingly popular way to store and analyze data that addresses the challenges of dealing with massive volumes of heterogeneous data. A data lake allows organizations to store all their data structured and unstructured in one centralized repository. Because data can be stored as-is, there is no need to convert it to a predefined schema. Many organizations understand the benefits of using Amazon S3 as their data lake.


Hortonworks has written about LogAI, their tool for analyzing logs coming out of a test run of the HDP test suite. The system uses frequency, co-occurence, and other correlation models to highlight errors, stack traces, and other items. There’s a web UI for exploring the interesting parts of the logs.
Apache Kafka

The Streamsets blog has a good overview of the motivations behind the Confluent Schema Registry for storing Avro schema versions. Using tools from the StreamSets data collector, it walks through how the schema-aware producers/consumers serialize/deserialize data.

This tutorial describes how to use Spark to read data from a CSV file, convert to a well-defined schema (in this case a Scala Case Class), and query the data using SparkSQL. There’s also sample code to store the data in MapR-DB and read it back out.

Continuous integration and continuous delivery (CI/CD) is a practice that enables an organization to rapidly iterate on software changes while maintaining stability, performance and security. Continuous Integration (CI) practice allows multiple developers to merge code changes to a central repository. Each merge typically triggers an automated build that compiles the code and runs unit tests
Kafka Streams Applications

The Kafka Streams API is available as a Java library included in Apache Kafka that allows you to build real-time applications and micro services that process data from Kafka. It allows you to perform stateless operations such as filtering (where messages are processed independently from each other) as well as stateful operations such as aggregations, joins, windowing, and more. Applications built with the Streams API are elastically scalable, distributed, and fault-tolerant.