Qubole & Snowflake with Spark
The blog series covers the use cases directly served by the Qubole–Snowflake integration. The first blog discussed how to get started with ML in Apache Spark using data stored in Snowflake. Blog two covered how data engineers can use Qubole to read data in Snowflake for advanced data preparation, such as data wrangling, data augmentation, and advanced ETL to refine existing Snowflake data sets.
Qubole & Snowflake with Machine Learning
Snowflake and Qubole have partnered to bring a new level of integrated product capabilities that make it easier and faster to build and deploy machine learning (ML) and artificial intelligence (AI) models in Apache Spark using data stored in Snowflake and big data sources. First we will discuss how to get started with ML in Apache Spark using data stored in Snowflake. Blogs two and three will cover reading and transforming data in Apache Spark; extracting data from other sources; processing ML models; and loading it into Snowflake.
In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible. Given the scope and pace at which Uber operates, our systems must be fault-tolerant and uncompromising when it comes to failing intelligently. To accomplish this, we leverage Apache Kafka, an open source distributed messaging platform, which has been industry-tested for delivering high performance at scale.
Apache Kafka Cluster at Goldman Sachs
At QCon New York 2017, Anton Gorshkov presented “When Streams Fail: Kafka off the Shore”. He shared insight into how a platform team at a large financial institution designs and operates shared internal messaging clusters like Apache Kafka, and also they plan for and resolve the inevitable failures that occur. Gorshkov, managing director at Goldman Sachs, began by introducing Goldman Sachs and discussing the stream-processing workloads his division manages.
Apache Flink 1.4.0, released in December 2017, introduced a significant milestone for stream processing with Flink: a new feature called TwoPhaseCommitSinkFunction (relevant Jira here) that extracts the common logic of the two-phase commit protocol and makes it possible to build end-to-end exactly-once applications with Flink and a selection of data sources and sinks, including Apache Kafka versions 0.11 and beyond. It provides a layer of abstraction and requires a user to implement only a handful of methods to achieve end-to-end exactly-once semantics.