DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]

Replication Guide On HDFS and Amazon Web Services
=======================

Hortonworks’ Data Lifecycle Manager (DLM), an extensible service built on the Hortonworks DataPlane Platform (DPS) provides a complete solution to replicate HDFS, Hive data, metadata and security policies between on-premises and Amazon S3.This data movement enables Data science and ML workloads to execute models in Amazon SageMaker and bring back the successful data to on-premise. To facilitate this use case, here are the 3 steps for replication between HDFS to AWS cloud.
https://hortonworks.com/blog/a-step-by-step-replication-guide-between-on-prem-hdfs-and-amazon-web-services/

Azure Source
=====================

This is an update to the Azure Sphere Operating System, Azure Sphere Security Service, and Visual Studio development environment. This release includes substantial investments in our security infrastructure and our connectivity solutions, and it incorporates some of your feedback. Azure Sphere, which is in public preview, is a secured, high-level application platform with built-in communication and security features for internet-connected devices.

https://azure.microsoft.com/en-us/blog/azure-source-volume-58/
Kafka Connect Deep Dive Converters and Serialization
==================================================================

Kafka Connect is part of Apache Kafka, providing streaming integration between data stores and Kafka. For data engineers, it just requires JSON configuration files to use. There are connectors for common (and not-so-common) data stores out there already, ncluding JDBC, Elasticsearch, IBM MQ, S3 and BigQuery, to name but a few.For developers, Kafka Connect has a rich API in which additional connectors can be developed if required. In addition to this, it also has a REST API for configuration and management of connectors.

https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
Druid enables analytics at Airbnb
=======================================

Airbnb serves millions of guests and hosts in our community. Every second, their activities on Airbnb.com, such as searching, booking, and messaging, generate a huge amount of data we anonymize and use to improve the community’s experience on our platform.The Data Platform Team at Airbnb strives to leverage this data to improve our customers’ experiences and optimize Airbnb’s business.

https://medium.com/airbnb-engineering/druid-airbnb-data-platform-601c312f2a4c
Higher-Order Functions for Complex Data Types in Apache Spark 2.4
==========================================

Apache Spark 2.4 introduces 29 new built-in functions for manipulating complex types (for example, array type), including higher-order functions.Before Spark 2.4, for manipulating the complex types directly, there were two typical solutions: 1) Exploding the nested structure into individual rows, and applying some functions, and then creating the structure again 2) Building a User Defined Function (UDF).

https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html
Real-Time Analytics with Pulsar Functions
=======================

For many event-driven applications, how quickly data can be processed, understood and reacted to is paramount. In analytics and data processing for those scenarios, calculating a precise value may be time-consuming or impractically resource intensive. In those cases having an approximate answer within a given time frame is much more valuable than waiting to calculate an exact answer.

https://streaml.io/blog/eda-real-time-analytics-with-pulsar-functions