DataDotz Bigdata Weekly

This entry was posted in Big Data, Cassandra, Hadoop, Kafka, Spark, zeppelin on by .   0 Comment[s]

Apache Kafka

It may be counter intuitive at first, but there are some pretty compelling reasons to store multiple different types of events on the same Kafka topic. In particular, when implementing an event sourcing strategy, order of events is key for correctness. This post lays out that and other use cases as well as describes some changes to the Confluent Schema Registry Client to better support heterogeneous schemas within a topic.

Apache Zeppelin

In a follow-up to their post on running Spark via Kubernetes, this post adds instructions for deploying Apache Zeppelin inside of a k8s cluster. The Banzai team has published an image to Docker Hub and sample configs to github to make the process easy.
Size your Apache Flink cluster

The data Artisans blog has a post with tips for sizing an Apache Flink (or really, any distributed computing application) cluster by estimating disk and network throughput. It walks through a practical example and the related formulas to make these estimations for a five-node cluster

The MapR blog has a two-part post on using Apache Kafka and Apache Spark (streams and ML apis) to build a real-time flight delay prediction application. The post includes code on github and an Apache Zeppelin notebook.

Service and Role Layouts segment of the series, we take a step higher up the stack looking at the various services and roles that make up your Cloudera Enterprise deployment. There are so many capabilities and configurations possible in CDH to meet a variety of demands. To help focus our discussion, we zero in on three different product offerings and how they would be deployed in your enterprise. These layouts will ensure you’re meeting high availability and full security demands while setting your cluster up for stability as you continue to scale out whether on-prem or in the cloud.