Scalability Improvement of Apache Impala 2.12.0 in CDH 5.15.0
Apache Impala is a massively-parallel SQL execution engine, allowing users to run complex queries on large data sets with interactive query response times. An Impala cluster is usually comprised of tens to hundreds of nodes, with an Impala daemon (Impalad) running on each node. According to wikipedia “Docker is a computer program that performs operating-systCommunication between the Impala daemons happens through remote procedure calls (RPCs) using the Apache Thrift library. When processing queries,
Uber’s GPU-Powered Open Source, Real-time Analytics Engine
At Uber, real-time analytics allow us to attain business insights and operational efficiency, enabling us to make data-driven decisions to improve experiences on the Uber platform. For example, our operations team relies on data to monitor the market health and spot potential issues on our platform; software powered by machine learning models leverages data to predict rider supply and driver demand; and data scientists use data to improve machine learning models for better forecasting.
Spark Surprises for the Uninitiated
He started by adding a monotonically increasing ID column to the DataFrame. Spark has a built-in function for this, monotonically_increasing_id — you can find how to use it in the docs.His idea was pretty simple: once creating a new column with this increasing ID, he would select a subset of the initial DataFrame and then do an anti-join with the initial one to find the complement1.
40x faster hash joiner with vectorized execution
For the past four months, I’ve been working with the incredible SQL Execution team at Cockroach Labs as a backend engineering intern to develop the first prototype of a batched, column-at-a-time execution engine.
Finding Kafka’s throughput limit in Dropbox infrastructure
Apache Kafka is a popular solution for distributed streaming and queuing for large amounts of data. It is widely adopted in the technology industry, and Dropbox is no exception. Kafka plays an important role in the data fabric of many of our critical distributed systems: data analytics, machine learning, monitoring, search, and stream processing (Cape), to name a few.
Google BigQuery’s Python SDK: Creating Tables Programmatically
GCP is on the rise, and it’s getting harder and harder to have conversations around data without addressing the 500-pound gorilla in the room: Google BigQuery. With most enterprises comfortably settled into their Apache-based Big Data stacks, BigQuery rattles the cages of convention for many. Luckily, Hackers And Slackers is no such enterprise. Thus, we aren’t afraid to ask the Big question: how much easier would life be with BigQuery.