DataDotz Bigdata Weekly

This entry was posted in Big Data, DataDotz Weekly, Hadoop, Hive, Kafka, Spark on by .   0 Comment[s]

Data ingestion into Splunk

Amazon Web Services (AWS) jointly announced that Amazon Kinesis Data Firehose now supports Splunk Enterprise and Splunk Cloud as a delivery destination. This native integration between Splunk Enterprise, Splunk Cloud, and Amazon Kinesis Data Firehose is designed to make AWS data ingestion setup seamless, while offering a secure and fault-tolerant delivery mechanism. We want to enable customers to monitor and analyze machine data from any source and use it to deliver operational intelligence and optimize IT, security, and business performance.


Impala can now take advantage for column statistics when scanning data stored in parquet files. This post describes how it uses the min and max value as well as information stored in dictionaries to skip entire blocks of data during query. There are a few considerations when loading your data, which the post also describes.
Apache Kafka

This post describes the role that a streaming system, like Apache Kafka, can play in microservices architecture. It argues that leveraging a streaming system can resolve some of the problems resulting from large amounts of data and interconnectivity that arrives from microservices architecture.
Yarn Capacity Scheduler

The Hortonworks blog has a thorough overview of the YARN capacity scheduler. It describes hierarchical queues, several queue archetypes (including ad-hoc, batch, exploration, and always on), CPU scheduling, preemption, and more.
Hive Data Retrieval using Spark, StreamSets and Predera

This guest post on the StreamSets blog shows how Predera uses Hive, Spark, and StreamSets for their data pipeline. The walkthrough includes example commands from Hive and screen shots from StreamSets.