Category Archives: Kafka

datadotzweekly

DataDotz Bigdata Weekly

APACHE SPARK
==========

Using Apache Spark for large-scale language model training

Facebook has written about their experience converting their n-gram language model training pipeline from Apache Hive to Apache Spark. The post describes their Hive-based solution, their Spark-based solution, and the scalability challenges Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Replicating Relational Databases with Stream Sets Data Collector
==========

Relational Databases with Stream DataSets

StreamSets Data Collector has long supported both reading and writing data from and to relational databases via Java Database Connectivity (JDBC). While it was straightforward to configure pipelines to read data from individual tables, ingesting records from an entire database was cumbersome, requiring a pipeline per table. Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Apache Spark
==========

ETL with Apache Spark
A common design pattern often emerges when teams begin to stitch together existing systems and an EDH cluster: file dumps, typically in a format like CSV, are regularly uploaded to EDH, where they are then unpacked, transformed into optimal query format, and tucked away in HDFS where various EDH components can use them. When these file dumps are large or happen very often, these simple steps can significantly slow down an ingest pipeline. Continue reading

Read More