DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]

Scheduling Notebooks at Netflix

At Netflix we’ve put substantial effort into adopting notebooks as an integrated development platform. The idea started as a discussion of what development and collaboration interfaces might look like in the future. It evolved into a strategic bet on notebooks, both as an interactive UI and as the unifying foundation of our workflow scheduler.We’ve made significant strides towards this over the past year, and we’re currently in the process of migrating all 10,000 of the scheduled jobs running on the Netflix Data Platform to use notebook-based execution. When we’re done, more than 150,000 Genie jobs will be running through notebooks on our platform every single day.

Kafka KSQL

KSQL is a SQL engine for Kafka. It allows you to write SQL queries to analyze a stream of data in real time. Since a stream is an unbounded data set (for more details about this terminology, see a query with KSQL will keep generating results until you stop it.KSQL is built on top of Kafka Streams. When you submit a query, this query will be parsed and a Kafka Streams topology will be built and executed. This means that KSQL offers similar concepts as to what Kafka Streams offers, but all with a SQL language: streams (KStreams), tables (KTables), joins, windowing functions, etc.
Jupyter Notebooks and Apache Drill

Apache This blog post will walk through the installation and basic usage of the jupyter_drill module for Python that allows you, from a Jupyter Notebook, to connect and work with data from Apache Drill using IPython magic functions. If you are looking for the design goals of the project, please see my other blog post Mining the Data Universe: Sending a Drill to Jupyter about how this module came to be and the design considerations I used while building this module.
Citi Bike Real time utilization using Kafka Streams

This is when Kafka Streams comes in. Kafka Streams is a set of application API (currently in Java & Scala) that seamlessly integrates stateless (stream) and stateful (table) processing. The underlying premise of the design is very interesting. In short it is based on the fact that a table can be reconstructed from a stream of change data capture (CDC) or transaction log records. If we have a stream of change logs, a table is just a local store that reflects that latest state of each change record.
AWS EC2 instance store vs EBS for MySQL

If you are using large EBS GP2 volumes for MySQL (i.e. 10TB+) on AWS EC2, you can increase performance and save a significant amount of money by moving to local SSD (NVMe) instance storage. Interested? Then read on for a more detailed examination of how to achieve cost-benefits and increase performance from this implementation.
Performance Comparison of HDP LLAP, Presto, SparkSQL, Hive on Tez, and Hive on MR3

There are a plethora of benchmark results available on the internet, but we still need new benchmark results. Since all SQL-on-Hadoop systems constantly evolve, the landscape gradually changes and previous benchmark results may already be obsolete. Moreover the hardware employed in a benchmark may favor certain systems only, and a system may not be configured at all to achieve the best performance. On the other hand, the TPC-DS benchmark continues to remain as the de facto standard for measuring the performance of SQL-on-Hadoop systems.