DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]

HyperLogLog in Presto: A significantly faster way to handle cardinality estimation

Computing the count of distinct elements in massive data sets is often necessary but computationally intensive. Say you need to determine the number of distinct people visiting Facebook in the past week using a single machine. Doing this with a traditional SQL query on a data set as massive as the ones we use at Facebook would take days and terabytes of memory.To speed up these queries, we implemented an algorithm called HyperLogLog (HLL) in Presto, a distributed SQL query engine.We have seen great improvements, with some queries being run within minutes, including those used to analyze thousands of A/B tests.

Kafka Performance Tuning — Ways for Kafka Optimization

There are few configuration parameters to be considered while we talk about Kafka Performance tuning. Hence, to improve performance, the most important configurations are the one, which controls the disk flush rate.Also, we can divide these configurations on the component basis. So, let’s talk about Producer first.
New Features of Kafka 2.1

Kafka 2.1 is now available with Java 11! Java 11 was created in September 2018 and we get all the benefits from it, such as the Improved SSL and TLS performance (the improvements come from Java 9). According to one of the main Kafka committer, it is 2.5 times faster than Java 8.

Introducing Hive-Kafka integration for real-time Kafka SQL queries

Stream processing engines/libraries like Kafka Streams provide a programmatic stream processing access pattern to Kafka. Application developers love this access pattern but when you talk to BI developers, their analytics requirements are quite different which are focused on use cases around ad hoc analytics, data exploration and trend discovery. BI persona requirements for Kafka access include:
Accelerating Hive Queries with Parquet Vectorization

Apache Hive is a widely adopted data warehouse engine that runs on Apache Hadoop. Features that improve Hive performance can significantly improve the overall utilization of resources on the cluster. Hive processes data using a chain of operators within the Hive execution engine. These operators are scheduled in the various tasks (for example, MapTask, ReduceTask, or SparkTask) of the query execution plan. Traditionally, these operators are designed to process one row at a time.
Apache Spark — Tips and Tricks for better performance

Apache Spark is quickly gaining steam both in the headlines and real-world adoption. Top use cases are Streaming Data, Machine Learning, Interactive Analysis and more. Many known companies uses it like Uber, Pinterest and more. So after working with Spark for more then 3 years in production, I’m happy to share my tips and tricks for better performance.