DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]

Getting Your Feet Wet with Stream Processing
======================================

When you create a stream processing application with Kafka’s Streams API, you create a Topology either using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the KafkaStreams class, which connects to a Kafka cluster and begins processing when you call start(). For testing though, connecting to a running Kafka cluster and making sure to clean up state between tests adds a lot of complexity and time.
https://www.confluent.io/blog/stream-processing-part-2-testing-your-streaming-application

Hive vs Impala Schema Loading Case: Reading Parquet Files
==================================================

Quite often in big data , comes a scenario where raw data is processed in Spark and then needs to be made available to the analytics team . For this purpose a standard solution is to write the processed data from the spark application in the form of parquet files in HDFS and then point a Hive/Impala table to this data upon which analytics team can then fire SQL like queries.

https://medium.com/@kartik.gupta_56068/hive-vs-impala-schema-loading-case-reading-parquet-files-acd0280c2cb3
Building A Scalable Interactive Analytics Backend
=========================================

According to a study by Gartner, diverse organizations perform 12% better than non-diverse ones, with more innovation and better financial returns. Eightfold.ai offers a Talent Diversity solution to our customers to track and analyze their diversity goals and check for any existing https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11bias in the hiring process across different steps like recruiter screening, hiring manager screen, onsite etc.

https://medium.com/@eightfold/building-a-scalable-interactive-analytics-backend-aebeb79ee0c8

Elasticsearch Distributed Consistency Principles Analysis (3) — Data
========================================================

The previous two articles described the composition of the ES clusters, master election algorithm, master update meta process, and analyzed the consistency issues of the election and Meta update. This article analyzes the data flow in ES, including its write process, PacificA algorithm model, SequenceNumber, Checkpoint and compares the similarities and differences between ES implementation and the standard PacificA algorithm.

https://medium.com/@Alibaba_Cloud/elasticsearch-distributed-consistency-principles-analysis-3-data-a98cc436bc6b
Amazon Managed Streaming For Kafka (MSK) With Apache Spark On Qubole
===============================================================

AWS recently announced Managed Streaming for Kafka (MSK) at AWS 2018. Apache Kafka is one of the most popular open source streaming message queues. Kafka provides a high-throughput, low-latency technology for handling data streaming in real time. MSK allows developers to spin up Kafka as a managed service and offload operational overhead to AWS.

https://www.qubole.com/blog/amazon-managed-streaming-for-kafka/
Deploy production-grade Spark to Kubernetes in minutes
===============================================

In December 2018 we released the public beta of Pipeline and introduced a Banzai Cloud terminology – spotguides. We have already gone deep into what Spotguides were and how they supercharged Kubernetes deployments of application frameworks (automated deployments, preconfigured GitHub repositories, CI/CD, job specific automated cluster sizing, Vault based secret management, etc.). This post is focused on one specific spotguide: Spark with HistoryServer.

https://banzaicloud.com/blog/spotguides-spark/