Category Archives: Spark

datadotzweekly

DataDotz Bigdata Weekly

APACHE FLINK
==========

Build a Real-time Stream Processing Pipeline with Apache Flink on AWS

Apache Flink is an open source project that is well-suited to form the basis of such a stream processing pipeline. It offers unique capabilities that are tailored to the continuous analysis of streaming data Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

APACHE SPARK
==========

Using Apache Spark for large-scale language model training

Facebook has written about their experience converting their n-gram language model training pipeline from Apache Hive to Apache Spark. The post describes their Hive-based solution, their Spark-based solution, and the scalability challenges Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Replicating Relational Databases with Stream Sets Data Collector
==========

Relational Databases with Stream DataSets

StreamSets Data Collector has long supported both reading and writing data from and to relational databases via Java Database Connectivity (JDBC). While it was straightforward to configure pipelines to read data from individual tables, ingesting records from an entire database was cumbersome, requiring a pipeline per table. Continue reading

Read More
Apache spark

Apache Spark(DataBricks) breaks previous sort record

Is MapReduce coming to an end?
DataBricks recently published their benchmarks results for sorting 100 TB of Data over AWS Ec2 Machines, The results have clearly proven it as general purpose distributed processing framework which is meant for both in-memory and on-disk. They have used recent version of Apache Spark(Spark 1.1).Below pic depicts the results of the same. Continue reading

Read More