DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]

Spark SQL

Bastian Haase is an alum from the Insight Data Engineering program in Silicon Valley, now working as a Program Director at Insight Data Science for the Data Engineering and the DevOps Engineering programs.In this blog post, he shares his experiences on how to get started working on open source software.As one of the first steps of my Insight project, I wanted to compute the intersection of two columns of a Spark SQL DataFrame containing arrays of strings.

Amazon Athena

At Goodreads, we’re currently in the process of decomposing our monolithic Rails application into microservices. For the vast majority of those services, we’ve decided to use Amazon DynamoDB as the primary data store. We like DynamoDB because it gives us consistent, single-digit-millisecond performance across a variety of our storage and throughput needs.
Machine Learning at Uber

As of 2018, Uber’s ridesharing business operates in more than 600 cities, while the Uber Eats food delivery business has expanded to more than 250 cities worldwide. The gross booking run rate for ridesharing hit $37 billion in 2018. With more than 75 million active riders and 3 million active drivers, Uber’s platform powers 15 million trips every day.
Cloudera Manager

One instance of Cloudera Manager (CM) can manage N clusters. In the current Role Based Access Control (RBAC) model, CM users hold privileges and permissions across everything in CM’s purview (including every cluster that CM manages). For example, Read-Only user John is a user who can perform all the actions of Read-Only users on all clusters managed by CM. The “Cluster Admin” user Chris is a cluster administrator of all the clusters managed by CM.
Apache Beam and Apache NiFi

This story is about transforming XML data to RDF graph with the help of Apache Beam pipelines run on Google Cloud Platform (GCP) and managed with Apache NiFi.We had the following goal take data describing commercial companies (collected by Federal Tax Service of Russia) and turn it into a form that enables querying relations between a company and its parts (owners, management) as well as between different companies.
Gaming Events Data Pipeline with Databricks Delta

The world of mobile gaming is fast paced and requires the ability to scale quickly. With millions of users around the world generating millions of events per second by means of game play, you will need to calculate key metrics (score adjustments, in-game purchases, in-game actions, etc.) in real-time.