DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Big Data, Blog, DataDotz Weekly, Hadoop, Hive, Uncategorized on by .   0 Comment[s]

Containerized application on Apache Hadoop
=====

The YARN community is constantly looking at ways to enable new use cases and improve existing capabilities. With the emergence of light-weight containerization technologies such as Docker, the benefit of bringing this capability to YARN was clearHortonworks, a leader in Hadoop distribution, has written a blog on running containerized applications on Apache Hadoop.
https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/

Confluent : Visualizations on Apache Kafka using KSQL
==========

KSQL is a game-changer not only for application developers, but also for non-technical business users. How? The SQL interface opens up access of Kafka data to analytics platforms based on SQL. Business analysts who are accustomed to non-coding, drag-and-drop interfaces can now apply their analytics skills to Kafka.

https://www.confluent.io/blog/visualizations-on-apache-kafka-made-easy-with-ksql/
Serverless MapReduce Framework
==========

The concept of MapReduce is incredibly powerful, but the amount of boilerplate needed to write even a simple Hadoop job in Java is, in my opinion, rather off-putting. Hadoop and Spark also require at least some infrastructure knowledge. Services like EMR and Dataproc make this easier, but at a hefty cost. The result is corral, a framework for writing arbitrary MapReduce applications that can be executed in AWS Lambda.

https://benjamincongdon.me/blog/2018/05/02/Introducing-Corral-A-Serverless-MapReduce-Framework/
Netflix optimizes Apache Flink for their usecases
==========

Datanami has coverage of a talk from Flink Forward on Netflix’s move to Apache Flink. There are some impressive numbers (e.g. 12PB of data per day) as well as several AWS-specific optimizations that the Netflix team implemented (such as randomness in output file names to prevent S3 hot-spotting).

https://www.datanami.com/2018/04/30/how-netflix-optimized-flink-for-massive-scale-on-aws/
Google acquires Cask Data
==========

Cask Data, a startup, specializes in building solutions to run big data analytics services based on Hadoop. The company has its own products namely Cask Data App Platform (CDAP).

http://blog.cask.co/2018/05/cask-is-joining-google-cloud/
https://seekingalpha.com/news/3357546-google-acquires-big-data-hadoop-company
Azure Storage with HDinsight clusters
==========

Cloud based big data deployments have been gaining momentum in last couple of years. Most of cloud vendors such as AWS, GCP, Azure have offering the services build around Hadoop. Azure has written a blog on HDInsight Storage Architecture.

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage
Hulu: Migrating Hadoop cluster from one Datacenter to another datacenter.
==========

Hulu which provides subscription video on demand service recently migrated their Hadoop clusters from one data center to another. In this tech blog, they have written all the challenges and possible solutions for the migration.Its a definitely an interesting read.

https://medium.com/hulu-tech-blog/migrating-hulus-hadoop-clusters-to-a-new-data-center-part-one-extending-our-hadoop-instance-b88c4bda61bc
Azure Cosmos DB: A technical Overview
==========

Microsoft has written an article on their database service “Azure Cosmos DB”. It is a globally distributed, horizontally partitioned, multi-model database service. They share the design goals and other important aspects such as Horizontal partitioning, latency, availability of the DB.

https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db/