Category Archives: Uncategorized

DataDotz BigData Weekly

DataDotz Bigdata Weekly

Scalability Improvement of Apache Impala 2.12.0 in CDH 5.15.0
=======================================================

Apache Impala is a massively-parallel SQL execution engine, allowing users to run complex queries on large data sets with interactive query response times. An Impala cluster is usually comprised of tens to hundreds of nodes, with an Impala daemon (Impalad) running on each node. Continue reading

Read More
DataDotz BigData Weekly

DataDotz Bigdata Weekly

Using Docker and Pyspark
=======================

Pyspark can be a bit difficult to get up and running on your machine. Docker is a quick and easy way to get a spark environment working on your local machine and is how I run Pyspark on my machine.I’ll start by giving an introduction to Docker. Continue reading

Read More
DataDotz BigData Weekly

DataDotz Bigdata Weekly

Getting Your Feet Wet with Stream Processing
======================================

When you create a stream processing application with Kafka’s Streams API, you create a Topology either using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the KafkaStreams class, which connects to a Kafka cluster and begins processing when you call start(). Continue reading

Read More
DataDotz BigData Weekly

DataDotz Bigdata Weekly

HyperLogLog in Presto: A significantly faster way to handle cardinality estimation
=============================================================

Computing the count of distinct elements in massive data sets is often necessary but computationally intensive. Say you need to determine the number of distinct people visiting Facebook in the past week using a single machine. Doing this with a traditional SQL query on a data set as massive as the ones we use at Facebook would take days and terabytes of memory. Continue reading

Read More
DataDotz BigData Weekly

DataDotz Bigdata Weekly

Best Practices for Securing Amazon EMR
============================================

Amazon EMR is a managed Hadoop framework that you use to process vast amounts of data. One of the reasons that customers choose Amazon EMR is its security features. For example, customers like FINRA in regulated industries such as financial services, and in healthcare, choose Amazon EMR as part of their data strategy. Continue reading

Read More
DataDotz BigData Weekly

DataDotz Bigdata Weekly

Data Management Strategies for Computer Vision
============================================

Kafka is a message system. Let us understand more about the message system and the problems it solves. Take the currently popular micro-service as an example. Let’s assume that there are three terminal-oriented. Continue reading

Read More