datadotzweekly

DataDotz Bigdata Weekly

This entry was posted in Big Data, Blog, DataDotz Weekly, Hadoop, Hive, Kafka, Spark on by .   0 Comment[s]

Reading data securely from Apache Kafka
==========

The Cloudera Distribution of Apache Kafka 2.0.0 (based on Apache Kafka 0.9.0) introduced a new Kafka consumer API that allowed consumers to read data from a secure Kafka cluster. This allows administrators to lock down their Kafka clusters and requires clients to authenticate via Kerberos.It also allows clients to encrypt data over the wire when communicating with Kafka brokers (via SSL/TLS). Subsequently, in the Cloudera Distribution of Apache Kafka 2.1.0, Kafka introduced support for authorization via Apache Sentry. This allows Kafka administrators to lock down certain topics and grant privileges to specific roles and users, leveraging role-based access control.

http://blog.cloudera.com/blog/2017/05/reading-data-securely-from-apache-kafka-to-apache-spark/

Hortonworks
==========

The Hortonworks blog has a post that motivates the need for a shared schema registry, especially for streaming applications. They are planning on shipping, as part of the next HDFS release, their own schema registry that will eventually integrate with Apache Atlas and Apache Ranger in addition to Kafka.

https://hortonworks.com/blog/part-2-hdf-blog-series-shared-schema-registry-important/

ROW/ COLUMN LEVEL ACCESS CONTROL FOR APACHE SPARK
=======================

The latest version of Hortonworks Data Platform (HDP) introduced a number of significant enhancements for our customers. For instance, HDP 2.6.0 now supports both Apache Spark 2.1 and Apache Hive 2.1 (LLAP) as GA. Often customers store their data in Hive and analyze that data using both Hive and SparkSQL. An important requirement in this scenario is to apply the same fine-grained access control policy to Hive data, irrespective of whether the data is analyzed using Hive or SparkSQL. This fine-grained access control includes features such as row/ column level access or data masking.

https://hortonworks.com/blog/row-column-level-control-apache-spark/

SUB-SECOND ANALYTICS WITH APACHE HIVE AND DRUID
=======================

This post describes how to build an OLAP table in Druid from data in Apache Hive. It’s the second part in a series on integrating Hive with Druid, and the first one has some more context for when or why this might be a good idea.

https://hortonworks.com/blog/sub-second-analytics-hive-druid/

DISTRIBUTED DEEP LEARNING ON THE MAPR CONVERGED DATA
=======================

The MapR Converged Data Platform provides the only state-of-the-art distributed file system in the world. With MapR File System (MapR-FS), our customers gain a unique opportunity to put your deep learning development, training, and deployment closer to your data. In the meanwhile, since MapR-DB and MapR Streams are also tied closely to our file system, if you were developing a deep learning application on MapR, it is convenient to deploy your model to extend our MapR Persistent Application Client Container (PACC) to harness the distributed key-value storage of MapR-DB and cutting-edge streaming technology of MapR Streams for different use cases.

https://mapr.com/blog/distributed-deep-learning-mapr/

Sparklyr R Interface for Apache Spark
=======================

Sparklyr is a package, that when installing in R, can be integrated with Spark. Advantages Connect to Spark from R. The sparklyr package provides an entire dplyr backend. Filter and aggregate Spark datasets and bring them into R for analysis and visualizing. Use Spark’s distributed machine learning library from R.

https://acadgild.com/blog/sparklyr-r-interface-for-apache-spark/