DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]

Apache Spark

we start to dive into the details of running custom versions of Spark, it’s important to note if all you need to do is run a “supported” version of Spark on Google Cloud Dataproc or Spark on Kubernetes there are much easier options and guides out there for you. Also, as we alluded above, it’s important to make sure your experiments don’t destroy your production data, so consider using a sub-account with more restrictive permissions.


This is a great, Azure-focused whirlwind tour of Hadoop (and briefly MapReduce), Pig (on Tez), Storm (with Azure Event Hubs), and Spark. It uses PowerShell and the Azure UI to deploy clusters to crunch data from the Global Database of Events, Language, and Tone (GDELT) dataset.
Data Transformation and Visualization on the YouTube dataset using Spark

This post describes how use Scala for data prep in Apache Spark. Once that’s done, Spark SQL and Apache Zeppelin can be used to query and visualize the results. This type of hybrid solution seems like a great way to make sure you’re using the best tool at each step of your analysis.
Apache Flink

TechTarget has coverage of the Flink Forward conference talks by Capital One and Comcast. There are some interesting insights into how the companies are supporting data science and machine learning—e.g. both are using Python to bridge the gap between data science and production systems.
Improving HBase backup efficiency at Pinterest

Pinterest has one of the largest HBase production deployments in the industry. HBase is one of the foundational building blocks of our infrastructure and powers many of our critical services including our graph database (Zen), our general-purpose key-value store (UMS), our time series database and several others. Despite its high availability, we periodically backup our production HBase clusters onto S3 for disaster recovery purposes.
From SQL to Streaming SQL

Stream processing takes in events from a stream, analyzes them, and creates new events in new streams. So, stream processing first needs an event source. It can be a sensor that pushes events to us or some code that periodically pulls the events from a source. One useful tool at this point is a message queue, which you can think of as a bucket that holds the events coming in your way until you are ready to process them. We will come to this later.
Performance Evaluation of Hive-MR3 0.1

Since Hive-MR3 uses MR3 as its execution engine and borrows runtime environments from Tez, a natural question arises as to whether the use of MR3 results in performance improvement in terms of execution time, turnaround time, or overall throughput at all. While it is difficult to accurately quantify the performance of MR3 over Tez as an execution engine, we can compare Hive-MR3 and Hive-on-Tez under identical conditions to see if there is any benefit of using MR3 in place of Tez.