The AWS blog has a post from NUVIAD about their big data infrastructure for an online ad systems. The full post has a lot of interesting details, but some highlights include that Redshift Spectrum with data in Parquet format is sometimes faster than traditional Redshift, its quite simple to use AWS Lambda and AWS Glue to convert data from CSV to Parquet, and it’s important to sort data within a Parquet file by a commonly used key.
The Databricks blog has a recap, with slides and video, from the Women in Big Data Meet up. There are talks on stereotypes in CS, Spark performance (including future work to get better insights into bottlenecks), and using Spark to build a Deep Learning pipeline.
Spark on aws Lambda
Qubole has written about a prototype that they’ve built for running Apache Spark via AWS Lambda. With it, they scan 1TB of data for about $1.18 and can sort 100 GB of data using about $7.50. Lambda doesn’t allow functions to directly communicate with one another, so the Qubole team has written a custom scheduler and state store (described in more detail in the post) for Spark 2.1.0. The code is on github, and there’s a roadmap of future work.
Lambda to kappa dataflow paradigms
A good overview of the evolution of data processing frameworks in the past few years, this post looks at the older (e.g. Storm and Samza) as well as newer (e.g. Beam, Spark, and Flink) stream processing frameworks.
This post argues that the CQRS and Event Sourcing models are in many ways similar to functional program (with immutable data, side-effect free functions). This post makes the comparison of this model vs. the traditional model (which is referred to here as object-oriented) by walking through the way data flows, failure scenarios, latency, and more.