Structured Streaming in Apache Spark 2.0, it has supported joins (inner join and some type of outer joins) between a streaming and a static DataFrame/Dataset. With the release of Apache Spark 2.3.0, now available in Databricks Runtime 4.0 as part of Databricks Unified Analytics Platform, we now support stream-stream joins. In this post, we will explore a canonical case of how to use stream-stream joins, what challenges we resolved, and what type of workloads they enable. Let’s start with the canonical use case for stream-stream joins – ad monetization.
Apache Kafka is a distributed streaming platform, and as well as powering a large number of stream-based mission-critical systems around the world, it has a huge role to play in data integration too. Back in 2016 Neha Narkhede wrote that ETL Is Dead, Long Live Streams, and since then we’ve seen more and more companies moving to adopt Apache Kafka as the backbone of their architectures.
Netflix streaming pipeline (called Keystone) processes 2 Trillion messages/day at peak! It handles 3 PB/day of incoming data, and 7PB/day of outgoing data! Built 100% on AWS, Keystone is built as a self-serve platform allowing multiple teams to publish,, process, and consume events. There are lessons from Netflix’s operational experience that every enterprise can leverage.
Criteo has open sourced a new Apache Cassandra backend for Graphite called Big Graphite. With it, they store over 1 million metrics per second, and they take advantage of Cassandra’s secondary indexes to support the different types of Graphite queries. This presentation goes into further detail on the new backend and also links out to the design document and code on github
Apache Spark 2.3.0 adds experimental support for continuous mode in stream processing. Compared to micro batch latency of 100s of milliseconds, continuous mode uses long-running processes that can take a record end to end in 10s of milliseconds
A tutorial on the Confluent Blog covers loading data into Kafka is using Kafka Connect for MySQL (via Debezium for change data capture) and from CSV files, querying and joining those two streams, and writing the results out to Amazon S3 using the Kafka Connect S3 sink