Category Archives: DataDotz Weekly

datadotzweekly

DataDotz Bigdata Weekly

APACHE FLINK
==========

Build a Real-time Stream Processing Pipeline with Apache Flink on AWS

Apache Flink is an open source project that is well-suited to form the basis of such a stream processing pipeline. It offers unique capabilities that are tailored to the continuous analysis of streaming data Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

APACHE SPARK
==========

Using Apache Spark for large-scale language model training

Facebook has written about their experience converting their n-gram language model training pipeline from Apache Hive to Apache Spark. The post describes their Hive-based solution, their Spark-based solution, and the scalability challenges Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Replicating Relational Databases with Stream Sets Data Collector
==========

Relational Databases with Stream DataSets

StreamSets Data Collector has long supported both reading and writing data from and to relational databases via Java Database Connectivity (JDBC). While it was straightforward to configure pipelines to read data from individual tables, ingesting records from an entire database was cumbersome, requiring a pipeline per table. Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Apache Apex
==========

SQL on Apache Apex

Big Data has an interesting history. In the past few years, massive amounts of data have been generated for processing and analytics, and enterprises have been facing problems processing ever increasing data size. In order to process this increasing data size, the way was to scale up but scaling up was costly and resulted in vendor lock-in. So they started to look at scaling out. Enter the Big Data ecosystem with projects like Hadoop, YARN, Spark which fairly satisfied Big Data processing needs. Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Apache Spark
==========

ETL with Apache Spark
A common design pattern often emerges when teams begin to stitch together existing systems and an EDH cluster: file dumps, typically in a format like CSV, are regularly uploaded to EDH, where they are then unpacked, transformed into optimal query format, and tucked away in HDFS where various EDH components can use them. When these file dumps are large or happen very often, these simple steps can significantly slow down an ingest pipeline. Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Apache Oozie
==========

Use the New Apache Oozie Database Migration Tool
The Apache Oozie server is a stateless web application by design, with all information about running and completed workflows, coordinator jobs, and bundle jobs stored in a relational database. Prior to Cloudera Manager 5.4, Oozie was configured to use the embedded Apache Derby database for this purpose by default. Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Apache Flink
==========

Apache Flink on Amazon EMR
Apache Flink is a parallel data processing engine that customers are using to build real time, big data applications. Flink enables you to perform transformations on many different data sources, such as Amazon Kinesis Streams or the Apache Cassandra database. It provides both batch and streaming APIs. Also, Flink has some SQL support for these stream and batch datasets. Most of Flink’s API actions are very similar to the transformations on distributed object collections found in Apache Hadoop or Apache Spark. Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Apache Hadoop
==========

HDFS Intra-DataNode Disk Balancer in Apache Hadoop

The apache hadoop community developed server offline scripts to alleviate the data imbalance issue. However, due to being outside the hdfs codebase, these scripts require that the datanode be offline before moving data between disks. As a result, hdfs-1312 also introduces an online disk balancer that is designed to re-balance the volumes on a running datanode based on various metrics. Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Apache Flink
==========

Getting Started with Apache Flink on MapR Converged Data Platform

Flink is a streaming data flow with several API to create the stream orients application and it is open source platform for distributed stream and batch data processing. It is very common for Flink applications to use Apache Kafka for data input and output. MapR Streams is a distributed messaging system for streaming event data at scale, and it’s integrated into the MapR converged Data Platform, based on the Apache Kafka API (0.9.0). Continue reading

Read More