DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]


MapR has a great technical and architectural comparison of MapR-DB with Apache HBase and Apache Cassandra. The article spends some timing describing the trade-offs of Log Structured Merge (LSM) trees that power HBase and Cassandra, including read and (async) write amplification. MapR-DB leverages the random read/write semantics supported by its file system to implement a hybrid LSM/b-tree indexing strategy. Overall, there are lots of interesting details in the post.


These are the first two posts in a series on how Hortonworks builds and tests their distribution. Validating a single change set is a large undertaking, as there can be as many as 30 downstream applications that need to be tested. Running all those unit tests in serial takes 6 hours, so the Hortonworks team is using YARN to run them in a distributed manner (including running YARN in YARN, which they call yinception).
Apache Kafka

The Confluent blog has a post describing how to use Kafka to power a machine learning application. Kafka is used to store feature data, model params, training data, and more. The pieces of the model building and evaluation pipeline are built with Kafka Connect, KSQL, and Kafka Streams.

Hortonworks has posted a benchmark comparing performance of Hive on HDP 2.5 vs. 2.6. The new version is faster due to improvements in the optimizer, vectorization, and more. The post also includes a contentious comparison to Impala (see the comments for more details). As always, be sure to consider your use case rather than relying on benchmarks from a vendor.
Cloudera Altus on Microsoft Azure

Cloudera Altus is a platform-as-a-service (PaaS) offering that enables users to analyze and process data at scale in public cloud infrastructures. Altus was designed from the outset to support multiple clouds from the perspective of both back-end architecture and front-end workflows. With the announcement of Microsoft Azure support, Altus will be able to support data engineering workloads in Microsoft Azure, with the same Altus interfaces for API and CLI.
TLS for the Elastic Stack

Transport Layer Security (TLS) can be deployed across the entire Elastic Stack, allowing for encrypted communications so you can rest easy at night knowing that the data transmitted over your networks is secured. It may not seem all that necessary, but then again consider the impossible situation of making sure that no developer starts logging sensitive data into the logs that you are shipping to a central location. Sensitive data is what most would believe to be passwords, customer’s personal information, etc. However, this definition of sensitive data is far too narrow for the era of cyber security that we live in.