DataDotz Bigdata Weekly

This entry was posted in Big Data, Blog, DataDotz Weekly, Hadoop, Hive, Kafka, Spark on by .   0 Comment[s]

Apache Spark’s Structured Streaming

This post from Databricks shows how powerful Spark’s Structured Streaming APIs are for doing windowed aggregations with support for late data/watermark calculations. The post describes and visualizes, at a high-level, the logic that is being abstracted by these APIs.Structured Streaming allows users to express the same streaming query as a batch query, and the Spark SQL engine incrementalizes the query and executes on streaming data. For example, suppose you have a streaming DataFrame having events with signal strength from IoT devices and you want to calculate the running average signal strength for each device.

Using Apache Spark detecting fake accounts

Uber is a big user of Apache Spark, and they recently worked on and deployed a Locality Sensitive Hashing (LSH) implementation for Spark for applications such as detecting fake accounts and payment fraud. The post has an example of using LSH for finding similar articles form the Wikipedia Extraction dataset. The post has quite a bit of example code, performance test results, and a look at next steps.


The Hortonworks blog has an overview of YARN’s support for running Docker containers via its “LinuxContainerExecutor.” There is a DockerLinuxContainerRuntime under development, and an example application (distributed shell) is demonstrated in the article. The post has a list of future improvements, including volume support, service discovery, and image management.

Security features in HDP 2.6

The Hortonworks blog has the first part of a series on new features in the security products (Apache Atlas, Apache Ranger, and Apache Knox) that are part of HDP 2.6. This part focuses on what’s new in Atlas 0.8.0, which is the data governance product. Updates include a new REST API that includes swagger documentation, a revamped search UX, visualization updates, and more.

Reading data securely from Apache Kafka to Apache Spark

The Cloudera Distribution of Apache Kafka 2.0.0 (based on Apache Kafka 0.9.0) introduced a new Kafka consumer API that allowed consumers to read data from a secure Kafka cluster. This allows administrators to lock down their Kafka clusters and requires clients to authenticate via Kerberos. It also allows clients to encrypt data over the wire when communicating with Kafka brokers (via SSL/TLS). Subsequently, in the Cloudera Distribution of Apache Kafka 2.1.0, Kafka introduced support for authorization via Apache Sentry. This allows Kafka administrators to lock down certain topics and grant privileges to specific roles and users, leveraging role-based access control.


Apache Spark Streaming for consuming and publishing messages with MapR Streams and the Kafka API. Spark Streaming is an extension of the core Spark API that enables continuous data stream processing. MapR Streams is a distributed messaging system for streaming event data at scale. MapR Streams enables producers and consumers to exchange events in real time via the Apache Kafka 0.9 API. MapR Streams integrates with Spark Streaming via the Kafka direct approach.

Data Warehouse Using Amazon Tools

In the healthcare field, data comes in all shapes and sizes. Despite efforts to standardize terminology, some concepts (e.g., blood glucose) are still often depicted in different ways. This post demonstrates how to convert an openly available dataset called MIMIC-III, which consists of de-identified medical data for about 40,000 patients, into an open source data model known as the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). It describes the architecture and steps for analyzing data across various disconnected sources of health datasets so you can start applying Big Data methods to health research.