DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Big Data, Blog, DataDotz Weekly, Hadoop, Hbase on by .   0 Comment[s]

Serverless Delivery with Databricks and AWS CodePipeline

Databricks interactive workspace serves as an ideal environment for collaborative development and interactive analysis. The platform supports all the necessary features to make the creation of a continuous delivery pipeline not only possible but simple.In this blog, we will walk through how to leverage Databricks along with AWS CodePipeline to deliver a full end-to-end pipeline with serverless CI/CD.

Benchmarking Big Data SQL Platforms in the Cloud

We compare Databricks Runtime 3.0 (which includes Apache Spark and our DBIO accelerator module) with vanilla open source Apache Spark and Presto on in the cloud using the industry standard TPC-DS v2.4 benchmark. In addition to the cloud setup, the Databricks Runtime is compared at 10TB scale to a recent Cloudera benchmark on Apache Impala using on-premises hardware. In this case, only 77 of the 104 TPC-DS queries are reported in the Impala results published by Cloudera.

Better Cloud Built an Alerting System with Apache Flink

We’ll highlight the work of BetterCloud, who learned that a dynamic alerting tool would only be truly useful to their customers only if newly-created alerts applied to future events as well as historical events. In this guest post, we’ll talk more about how BetterCloud uses Apache Flink to power its alerting and how they delivered on the challenge of applying newly-created alerts to historical event data in an efficient manner.

Real-Time Streaming ETL Pipeline

Developers increasingly prefer a new ETL paradigm with distributed systems and event-driven applications, where businesses process data in real time and at scale. There is still a need to “extract”, “transform”, and “load,” but the difference now is the treatment of data as a first-class citizen. Businesses no longer want to relegate data to batch processing, which often is limited to being done offline, once a day. They have many more data sources and of differing types, and want to do away with messy point-to-point connections.

Apache Kafka and Amazon Kinesis

This post compares and contrasts Apache Kafka and Amazon Kinesis. Since Kinesis is a SaaS product, it compares favorably in terms of operational complexity. With that said, it has more limitations on throughput (but at least it’s predictable), and the two share similar sets of high-level APIs for moving data and doing analysis (e.g. Kafka Connect/Kinesis Firehose and Kafka Streams/Kinesis Analytics). The post also discusses architectural and pricing considerations.

Scaling out StreamSets with Kubernetes

The StreamSets blog has a post on deploying the StreamSets Data Collector (SDC) with Kubernetes. SDC can be deployed as a stateful application or it can rely on a cloud service (Dataflow Performance Manager) for storing state, so the example is a bit more interesting than some of the common Kubernetes tutorials out there.

Apache HBase

The Apache Software Foundation blog has a post describing several HBase application archetypes. It details four different use cases, including storage of documents, graphs, queues, and metrics. For each, there is a example or two of how the schema/structure can be defined as well as a reference out to an application that is designed to use HBase in this way.