DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]


This post describes how a unified analytics platform, such as Databricks, can power multiple use cases and developer personas. Using the Amazon public product ratings dataset, it shows how both an analyst and a data scientist can build reports and machine learning prediction algorithms (respectively) using the notebook features.It also describes using the Databricks dbml-local library for serving of model data and using Spark for stream processing. There are a few Databricks-specific parts to the post (such as the notebook API and scheduling), but in general the post is broadly applicable.

AWS Lambda

Kinesis Analytics is able to run SQL queries over streams if the data is in a suitable format with a clear schema. Some data, such as Apache HTTPD access logs and certain compressed data files aren’t in that format and need to be translated. This post shows how to use AWS Lambda to do that transformation there’s an example in node.js.
Apache Kafka

The Pinterest Engineering blog has a post on how they’re using Kafka Streams for real-time predictive stream processing. First, the post describes the problem over delivery of online ads. Next, it walks through how they use Kafka Streams to measure in flight spend to provide a better heuristic for determining if an ad should be shown to end users as part of an ad inventory system. The post also includes a few tricks they employed to increase throughput and efficiency of their application.
ADLS Performance – Throughput and Scalability

Azure Data Lake Store (ADLS) is a highly scalable cloud-based data store that is designed for collecting, storing and analyzing large amounts of data, and is ideal for enterprise-grade applications. Data can originate from almost any source, such as Internet applications and mobile devices; it is stored securely and durably, while being highly available in any geographic region. ADLS is performance-tuned for big data analytics and can be easily accessed from many components of the Apache Hadoop ecosystem, such as Map Reduce, Apache Spark, Apache Impala (incubating) and Apache Hive.
Kibana Dashboard

Kibana dashboards without the risk of someone accidentally deleting or modifying them? Do you want to show off your dashboards without the distraction of unrelated applications and links? In version 6.0 we’re making it easier than ever to set up a restricted access user, with limited visibility into Kibana. It’s already possible to create read only users, but new in 6.0 is a UI to match, and we’ve made it simple to set up. All you have to do is assign the new, reserved, built-in kibana dashboard only user role, along with the appropriate data access roles, to your user and they will be in dashboard only mode when they log in to Kibana.
Analyzing Twitter Data in Apache Kafka through KSQL

KSQL is the open source streaming SQL engine for Apache Kafka. It lets you do sophisticated stream processing on Kafka topics, easily, using a simple and interactive SQL interface. In this short article we’ll see how easy it is to get up and running with a sandbox for exploring it, using everyone’s favourite demo streaming data source: Twitter. We’ll go from ingesting the raw stream of tweets, through to filtering it with predicates in KSQL, to building aggregates such as counting the number of tweets per user per hour.