DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Hive, Kafka, MongoDB, Spark on by .   0 Comment[s]

Qubole & Snowflake with Spark

Snowflake and Qubole have partnered to bring a new level of integrated product capabilities that make it easier and faster to build and deploy machine learning (ML) and artificial intelligence (AI) models in Apache Spark using data stored in Snowflake and big data sources. In this second blog of three we cover how to perform advanced data preparation with Apache Spark to create refined data sets and write the results to Snowflake, thereby enabling new analytic use cases.


MongoDB 4.0 will add support for multi-document transactions, making it the only database to combine the speed, flexibility, and power of the document model with ACID data integrity guarantees. Through snapshot isolation, transactions provide a globally consistent view of data, and enforce all-or-nothing execution to maintain data integrity.


Apache Kafka 1.0 and the powerful HDF integrations including Apache NiFi’s Kafka processors, Apache Ambari for provisioning/management/monitoring and Ranger for access control policies and audit for Apache Kafka. Today, in this fourth part of the series, we discuss the innovations added to Hortonworks Streaming Analytics Manager, aka SAM, specifically around tooling for developers to test streaming analytics apps.

KSQL Real Time Streaming ETL

With Kafka’s Connect and Streams APIs, as well as KSQL, we have the tools available to make Streaming ETL a reality. By streaming events from the source system as they are created, using Kafka’s Connect API, data is available for driving applications throughout the business in real time.


Quite some time ago, with the GA release of Elasticsearch 5.0 we released an API that allows to shrink an index into a new index with fewer shards than the original index. The reasoning behind adding this functionality was to provide a tool to keep the number of shards in a cluster at bay. Indices are commonly created with a large number of shards to maximise indexing throughput but later once theses indices are rolled over into a daily or hourly index, the number of shards should be reduced again to maximise resource utilization of the cluster

ADLS integration in Cloudera

The Data Catalog search was introduced in Cloudera5.11 and its usability experience keeps improving. It is now available in the top bar of the interface and offers free text search of SQL tables, columns, tags and saved queries. This is particularly useful for quickly looking up a table among thousands or finding existing queries already analyzing a certain dataset.