DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]

Data Pipeline Patterns in the Decoupled Processing Era
================================================

A Data pipeline is a sequence of transformations that converts raw data into actionable insights. In the past, the processing and storage engines were coupled together e.g., a traditional MPP warehouse combines both a processing and storage engine. With decoupled processing solutions (such as Spark, Redshift Spectrum, etc.) becoming mainstream in both open-source as well as the AWS Big Data Ecosystem, what are the popular data pipeline patterns? This post describes the data pipeline patterns we have defined in the context of decoupled processing engines.
https://quickbooks-engineering.intuit.com/re-think-your-data-pipelines-in-the-decoupled-era-5b032bc8b779

Apache Hadoop 3.1, YARN & HDP 3.0
===============================

GPUs are increasingly becoming a key tool for many big data applications. Deep-learning / machine learning, data analytics, Genome Sequencing etc all have applications that rely on GPUs for tractable performance. In many cases, GPUs can get up to 10x speedups. And in some reported cases (like this), GPUs can get up to 300x speedups.

https://hortonworks.com/blog/gpus-support-in-apache-hadoop-3-1-yarn-hdp-3/
Using StreamSets and MapR
========================

To use StreamSets with MapR, the mapr-client package needs to be installed on the StreamSets host. Alternatively (emphasized because this is important), you can run a separate CentOS Docker container, which has the mapr-client package installed, then you can share /opt/mapr as a Docker volume with the StreamSets container.

https://mapr.com/blog/using-streamsets-and-mapr-together-in-docker/
TensorFlow on Spark 2.3
=====================

TensorFlow is Google’s open source software library dedicated to high performance numerical computation. Taking advantages of GPUs, TPUs and CPUs power and usable on servers, clusters and even mobiles, it is mainly used for Machine Learning and especially Deep Learning.It provides support for C++, Python and recently for Javascript with TensorFlow.js. The library is now in its 1.8th version and comes with an official high level wraper called Keras.

http://www.adaltas.com/en/2018/05/29/spark-tensorflow-2-3/
Data engineering tech at Unruly
===========================

Data engineering began at Unruly with a product experiment as with many of the products we build. As an ad-tech organisation, we wanted to demonstrate the commercial value of predicting if a user will complete watching a digital video ad. This provides value by meeting objectives (KPIs) of advertisers, saving money, providing the end user more relevant ads and helping Unruly to optimise ad serving.

https://medium.com/unruly-engineering/data-engineering-tech-at-unruly-bb8b4afa2758
Data Warehousing and Building Analytic
===================================

CHSDM knows which objects are the most popular at any given time, how long visitors spend in its galleries, and many other quantitative facts about visitor behavior it previously was unable to understand. The museum is beginning to be able to develop a deep understanding around the ways its visitors behave with the Pen and this paper will attempt to explain how it can continue to develop the tools necessary to dig even deeper.

https://medium.com/micah-walter-studio/data-warehousing-and-building-analytics-at-cooper-hewitt-smithsonian-design-museum-159ec772905e