DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]

Apache Pulsar
=====

While Apache Pulsar (incubating) shares some similarities with Apache Kafka, it has a different architecture. Namely, it has stateless brokers and separate storage bookies (services implemented via Apache BookKeeper). Data is stored as segments, which allow scale up without rebalancing. The post describes this architecture and compares with Kafka. Also of note: Pulsar provides a Kafka-compatible API, which aims to provide drop-in compatibility.
https://streaml.io/blog/pulsar-segment-based-architecture/

Apache Spark
==========

As Kubernetes seems to be gaining lots of traction for container orchestration, it’s pretty natural to try to run Spark jobs with it. This first post describes how to do so, and it describes some of the current shortcomings in the current implementation. The second looks at how to then integrate with Apache Zeppelin, which has a few gotchas.

https://banzaicloud.com/blog/scaling-spark-k8s/
Using Hue to interact with Apache Kylin
==========

Apache Kylin is an OLAP database system for big data. It supports JDBC drivers, which can be used to run queries from Hue, including on an Amazon Elastic MapReduce setup. This post includes the basic steps to get going.

http://gethue.com/using-hue-to-interact-with-apache-kylin/
Apache Kakfa
==========

The Confluent blog has recently had several articles about the exactly once semantics in Apache Kafka. In the latest post in their series, they describe how the Kafka Streams API achieves exactly once

https://www.confluent.io/blog/enabling-exactly-kafka-streams/
Apache Flink
==========

As I usually say, you should always validate a benchmark with your own use case, rather than trusting what you see online. This example helps to really drive home that point—a small bug in data production of a benchmark driver program caused a major slowdown for Apache Flink in a competitor’s analysis.

https://data-artisans.com/blog/curious-case-broken-benchmark-revisiting-apache-flink-vs-databricks-runtime
AWS
==========

This post shows how to track Presto queries on a cluster by implementing an Event Listener class to log the contents of queries. The code is available on github, and there are instructions in the article for how to deploy the custom code via Amazon EMR.

https://aws.amazon.com/blogs/big-data/custom-log-presto-query-events-on-amazon-emr-for-auditing-and-performance-insights/