UDAF in KSQL 5.0
KSQL is the open source streaming SQL engine that enables real-time data processing against Apache Kafka. KSQL makes it easy to read, write and process streaming data in real time, at scale, using SQL-like semantics.KSQL already has plenty of available functions like SUBSTRING, STRINGTOTIMESTAMP or COUNT. Even so, many users need additional functions to process their data streamsKSQL now has an official API for building your own functions. As of the release of Confluent Platform 5.0, KSQL supports creating user-defined scalar functions (UDF) and user-defined aggregate functions (UDAF).
Air Flow DAG Tests and Unit Tests
Testing is an integral part of any software system to build confidence and increase the reliability of the system. Recently, I joined Grab and here at Grab, we are using Airflow to create and manage pipelines. But, we were facing issues with Airflow. I had a conversation with my engineering manager and discussed on how we could make Airflow reliable and testable.
Spark SQL Performance in Video Play Sessions
Play Sessions are the bread and butter of the Data Pipelines engineering team at JW Player. They are an attempt to identify a single ‘unit of work’ of a video viewer by computing transformations and aggregations in Spark SQL to compact the data down into a much more manageable size. These query operations are performed across roughly 100 columns, which turns out to be a hefty query and the impetus for why we needed to tune & optimize our Spark SQL job.
The use of data in a startup, becomes increasingly important as its number of users grows. In the early days of any B2B startup, you have to meet with every single customer yourself to get them on-board, so understanding what will get them using the platform becomes a case of simply asking them. However, as a business scales, this approach to making decisions no longer works.
Postgres Databases with Rails
The first step of architecting is always to figure out what we’re optimizing for. We’ve enjoyed great success with Postgres and saw no reason to depart. It has given us high performance and has an incredible array of datatypes and query functions that have been extremely useful.
The most common response was the need for better tools to monitor and manage Kafka in production. Specifically, users wanted better visibility in understanding what is going on in the cluster across the four key entities with Kafka: producers, topics, brokers, and consumers. In fact, because we heard this same response over and over from the users we interviewed, we gave it a name: The Kafka Blindness.