datadotzweekly

DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]

AWS
=================

Amazon Redshift connector with support for Amazon Redshift Spectrum to analyze data in external Amazon S3 tables. This feature, the direct result of joint engineering and testing work performed by the teams at Tableau and AWS, was released as part of Tableau 10.3.3 and will be available broadly in Tableau 10.4.1. With this update, you can quickly and directly connect Tableau to data in Amazon Redshift and analyze it in conjunction with data in Amazon S3—all with drag-and-drop ease.
https://aws.amazon.com/blogs/big-data/

Apache Kafka
==========

This post demonstrates how to use Kafka for a non-trivial stream processing application. Specifically, data is pulled from an HTTP endpoint and inserted into Kafka using the Producer API. From there, a Kafka Streams application performs fraud detection (with a stubbed out method in this example) and computes long-term and 90-day aggregated statistics. Finally, Kafka Connect writes data to a PostgreSQL database for serving up via REST API.

https://medium.com/@stephane.maarek/how-to-use-apache-kafka-to-transform-a-batch-pipeline-into-a-real-time-one-831b48a6ad85
Azure for Hadoop
==========

The MSDN blog has a four-part blog series focused on learning to use Azure for folks who are already familiar with Hadoop. It teaches the basics like access management, networking, storage, compute, and the Azure Data Services (including HDFS compatible PaaS offerings).

https://blogs.msdn.microsoft.com/cloud_solution_architect/2017/10/26/just-enough-azure-for-hadoop/
Kafka Streams in AWS
==========

Zolando has a guest post on the Confluent blog about several operational / SRE considerations for Kafka. These include the type of EBS volumes to use on AWS (specifically, EBS volume type is important since RocksDB makes lots of writes, and for some volume types there is a burst capacity budget that is important to monitor), tuning some defaults, monitoring of consumer lag, monitoring memory usage, and optimizing instance memory (with the counter-intuitive realization that a smaller JVM heap is better given that Kafka uses lots of off-heap memory).

https://www.confluent.io/blog/running-kafka-streams-applications-aws/
Cloudera Workbench
==========

Cloudera Data Science Workbench (CDSW) provides data science teams with a self-service platform for quickly developing machine learning workloads in their preferred language, with secure access to enterprise data and simple provisioning of compute. Individuals can request schedulable resources (e.g. compute, memory, GPUs) on a shared cluster that is managed centrally.

http://blog.cloudera.com/blog/2017/10/new-in-cloudera-data-science-workbench-1-2-usage-monitoring-for-administrators/