DataDotz BigData Weekly

DataDotz BigData Weekly

This entry was posted in DataDotz Weekly on by .   0 Comment[s]

Confluent released first preview version of Confluent Platform
This first preview release introduces powerful new capabilities for KSQL (streaming SQL for Apache Kafka®) and Confluent Control Center. Confluent Control Center provides features such as UI for KSQL, Broker Configuration, Topic Inspection, Consumer Lag.  They have also made several improvements to KSQL REST API. Additional KSQL features such as flexible timestamp handling, non-windowed aggregate functions(SUM, COUNT) on table. Preview release also includes protection on both tables and streams.

MapR  – Tutorial on MapR Data Fabric for Kubernetes
Mapr has published three part series on how MapR Data Fabric for Kubernetes enables customers to have a secure, persistent way to access data, beginning with files, no matter what form it’s persisted in through a secure, recoverable, and highly available solution.

Google launches cloud composer – a workflow automation tool
Google launches cloud composer , a new workflow automation tool based on Apache Airflow project. Airflow , a workflow orchestration , is open sourced from Airbnb and written in Python. Earlier open source community members published a lot of google cloud  operators last year. Googles Cloud composer provides features such as fully managed, integrated with GCP, multi-cloud workflow.

Evolution of Couchbase at Linkedin
Linkedin has written a post on their evolution of Couchbase in their architecture. The post clearly tell the history how they moved from Memchache to Couchbase for their caching solution. They also mentioned how they created a dedicated team to work on Couchbase.

Confluent ties Kafka with Kubernetes
Kubernetes has been gaining a lot of momentum especially for distributed clusters. Confluent announced their solution ’Confluent Operator’ to provision and manage theKafka on Kubernetes. This provides features such as Automated cluster management, complete streaming platform and flexibility in deployment.
Databricks – Benchmarking Apache Spark on Single Node
Databricks have published a blog on how they compred th performance of PySpark against pandas. They have used store_sales from TPC-DS. Pyspark clearly outperforms Pandas in any size. Pandas easily ran out of memory on increased data size. Blog also outlines Pandas UDF(vectorized UDF) in Apache Spark. Pandas UDF improves the performance and usuability of user-defined functions in Python.

Databricks – Optimized Autoscaling
Apache spark provides a mechanism to dynamically allocate resources for the application. It requires external shuffle service on each node to continue serving files. It works fine perfectly in a static on-permise environment but does not utlize the feature of the cloud. Databricks makes use of the detailed statistics of executors and also location of intermediate files within the cluster to provide a better elastic scaling of the resources.

Apache Hbase 2.0, Hadoop’s NoSQL, is available for Download.