datadotzweekly

DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]

AWS Glue
=================

Cloud Trail delivers log files in an Amazon S3 bucket folder. To correctly crawl these logs, you modify the file contents and folder structure using an Amazon S3-triggered Lambda function that stores the transformed files in an S3 bucket single folder. When the files are in a single folder, AWS Glue scans the data, converts it into Apache Parquet format, and catalogs it to allow for querying and visualization using Amazon Athena and Amazon Quick Sight.
https://aws.amazon.com/blogs/big-data/streamline-aws-cloudtrail-log-visualization-using-aws-glue-and-amazon-quicksight/

Elasticsearch
==========

IBM has published a new tutorial for using Spark with Elasticsearch to build a recommender system. The code is on github, and the README has describes how to get it up and running. It includes an interactive Jupyter notebook for stepping through the process and a list of troubleshooting steps.

https://github.com/IBM/elasticsearch-spark-recommender
Hortonworks
==========

Hortonworks Data Platform (HDP) and the majority of the components in HDP support Kerberos based authentication mechanism. By default, authentication is disabled to allow ease of installation, however for production as well as sensitive data hosting clusters we highly recommend enabling Kerberos based authentication. While configuring and deploying Kerberos enabled applications might seem challenging and time-consuming task, Ambari makes it extremely simple for HDP by automating provisioning all SPNs and distributing key tab files across the cluster using a wizard.

https://hortonworks.com/blog/ambari-kerberos-support-hbase-1/
KSQL Using Real-time Device Data
==========

This post demonstrates a non-trivial streaming program with KSQL. The input is a stream of digital sensor data from a gaming steering wheel, which makes for some interesting challenges (e.g. an event is only generated for the brake when its position changes, so one must calculate how long the break has been applied). At the end of the processing, data is visualized with a Grafana dashboard pulling data from Influx DB.

https://www.rittmanmead.com/blog/2017/11/taking-ksql-for-a-spin-using-real-time-device-data/
Cypher
==========

Cypher is a declarative graph query language. The Neo4j team has announced alpha support for Cypher with Apache Spark. There are more details and some examples (including with the zeppelin integration to visualize a graph) in this post.

https://neo4j.com/blog/cypher-for-apache-spark/