DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Big Data, Blog, DataDotz Weekly, Hadoop, Hive, Kafka, Pig, Spark on by .   0 Comment[s]


Data processing today is done in form of pipelines which include various steps like aggregation, sanitization, filtering and finally generating insights by applying various statistical models. Amazon Kinesis is a platform to build pipelines for streaming data at the scale of terabytes per hour.Parts of the Kinesis platform are a direct competitor to the Apache Kafka project for Big Data Analysis. The platform is divided into three separate products: Firehose, Streams, and Analytics.

Kafka Connect vs Stream Sets

Streamsets is a general purpose dataflow Management system. Kafka Connect was designed specifically for Apache Kafka and one endpoint in every Kafka connector is always Kafka and the other endpoint is another data system. Both Kafka Connect and StreamSets Data Collector are open source Apache licensed tools that can help you with getting event streams in and out of Apache Kafka/MapR Streams and build data pipelines. Both Kafka Connect and StreamSets Data Collector have advantages and disadvantages.

Running Streaming Jobs Once a Day for 10x Cost Savings

Spark’s structured streaming has a “Processing Time” trigger that will attempt to process new data at regular intervals (like cron). For a cluster that is elastic in size, this can save money by only bringing up the necessary resources when the trigger fires. With that said, jobs can still be stateful, and structured streaming has a few other features (such as bookeeping of failures and table-level atomicity) that make it more attractive than a normal batch operation

HDFS Maintenance State

System maintenance operations such as updating operating systems, and applying security patches or hotfixes are routine operations in any data center. DataNodes undergoing such maintenance operations can go offline for anywhere from a few minutes to several hours. By design, Apache Hadoop HDFS can handle DataNodes going down. However, any uncoordinated maintenance operations on several DataNodes at the same time could lead to temporary data availability issues.

Building a Data Exploration Tool with React

Elasticsearch’s built-in visualization tool, Kibana, is robust and the appropriate tool in many cases. However, it is geared specifically towards log exploration and time-series data, and we felt that its steep learning curve would impede adoption rate among data scientists accustomed to writing SQL. The solution was to create something that would replicate some of Kibana’s essential functionality while hiding Elasticsearch’s complexity behind SQL labels and terminology in the UI.

Nested Data Using Higher Order Functions in SQL on Databricks

Nested data types offer Databricks customers and Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way. While this feature is certainly useful, it can be a bit cumbersome to manipulate data inside of the complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial.