Category Archives: Spark


DataDotz Bigdata Weekly

Replicating Relational Databases with Stream Sets Data Collector

Relational Databases with Stream DataSets

StreamSets Data Collector has long supported both reading and writing data from and to relational databases via Java Database Connectivity (JDBC). While it was straightforward to configure pipelines to read data from individual tables, ingesting records from an entire database was cumbersome, requiring a pipeline per table. Continue reading

Read More
Apache spark

Apache Spark(DataBricks) breaks previous sort record

Is MapReduce coming to an end?
DataBricks recently published their benchmarks results for sorting 100 TB of Data over AWS Ec2 Machines, The results have clearly proven it as general purpose distributed processing framework which is meant for both in-memory and on-disk. They have used recent version of Apache Spark(Spark 1.1).Below pic depicts the results of the same. Continue reading

Read More