DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Big Data, Blog, DataDotz Weekly, Hadoop, Hbase, Hive, Kafka, Spark on by .   0 Comment[s]

Server-Side Encryption for Amazon Kinesis Streams

Amazon Kinesis Streams to ingest, process, and deliver data in real time from millions of devices or applications. Use cases for Kinesis Streams vary, but a few common ones include IoT data ingestion and analytics, log processing, clickstream analytics, and enterprise data bus architectures.Within milliseconds of data arrival, attached to a stream are continuously mining value or delivering data to downstream destinations.Customers are then scaling their streams elastically to match demand. They pay incrementally for the resources that they need, while taking advantage of a fully managed, serverless streaming data service that allows them to focus on adding value closer to their customers.

Deep Dive into Rescalable State in Apache Flink

Apache Flink is a massively parallel distributed system that allows stateful stream processing at large scale. For scalability, a Flink job is logically decomposed into a graph of operators, and the execution of each operator is physically decomposed into multiple parallel operator instances. Conceptually, each parallel operator instance in Flink is an independent task that can be scheduled on its own machine in a network-connected cluster of shared-nothing machines.

Streaming Billions of Daily Events Using Kafka

The blog has a post on loading data from Apache Kafka into Google Big Query. They examine several different options including batch (via Secor) and streaming (via Apache Kafka Streams, Apache Kafka Connect, and Apache Beam).We have collected billions of events a day of many different types, from request logs originating in web servers and backend services, change data capture logs that are generated by our databases and activity events from our users across different platforms.

Powerful _USERs in Apache Hadoop 3.0.0-alpha4

Apache Hadoop 3.0.0-alpha4 was released this week. One of the new features is more powerful support for user configuration via shell variables in the Hadoop shell scripts. This post describes the new features of user restrictions and user switching as well as the return of the start-all script.

Google cloud big data and Machine Learning

The Google Cloud Big Data blog has a post on common use cases for Cloud Dataflow. While some of the content is Google Cloud-specific, the patterns and the psuedo-code presented is largely general purpose and it’s interesting to see how Cloud Dataflow solves various problems. In this open-ended series, we’ll describe the most common patterns across these customers that in combination cover an overwhelming majority of use. Each pattern includes a description, example, solution and pseudocode to make it as actionable as possible within your own environment.

Automatic Statistics Collection for Better Query Performance

Presto, Apache Spark and Apache Hive can generate more efficient query plans with table statistics. For example, Spark, will perform broadcast joins only if the table size is available in the table statistics stored in the Hive Megastore. Broadcast joins can have a dramatic impact on the run time of everyday SQL queries where small dimension tables are joined frequently. The Big Bench tuning exercise from Intel reported a 4.22x speedup by turning on broadcast joins for specific queries.

Offset Management for Apache Kafka with Apache Spark

An ingest pattern that we commonly see being adopted at Cloudera customers is Apache Spark Streaming applications which read data from Kafka. Streaming data continuously from Kafka has many benefits such as having the capability to gather insights faster. However, users must take into consideration management of Kafka offsets in order to recover their streaming application from failures.