DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]

Using Docker and Pyspark
=======================

Pyspark can be a bit difficult to get up and running on your machine. Docker is a quick and easy way to get a spark environment working on your local machine and is how I run Pyspark on my machine.I’ll start by giving an introduction to Docker. According to wikipedia “Docker is a computer program that performs operating-system-level virtualization, also known as ‘containerization’ ”. To greatly simplify, Docker creates a walled off linux operating system to run software on top of your machines OS called a container.
https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867

Deploying Logstash pipelines to Kubernetes
======================================

Towards the end of 2018 I started to wrap up things I’d been learning and decided to put some structure into my learning for 2019.2018 had been an interesting year, I’d moved jobs 3 times and felt like my learning was all over the place. One day I was learning Scala and the next I was learning Hadoop. Looking back, I felt like I didn’t gain much ground.

https://towardsdatascience.com/the-basics-of-deploying-logstash-pipelines-to-kubernetes-94a470ad34d9
Spark Streaming or Kafka Streams or Alpakka Kafka?
==============================================

Recently we needed to choose a stream processing framework for processing CDC events on Kafka. CDC events were produced by a legacy system and the resulting state would persist in a Neo4J graph database. We had to choose between, Spark Streaming, Kafka Streams and Alpakka Kafka.

https://medium.com/@unmeshvjoshi/choosing-a-stream-processing-framework-spark-streaming-or-kafka-streams-or-alpakka-kafka-fa0422229b25
Joy and Pain of using Google BigTable
=================================

Last year, I wrote about Ravelin’s use and displeasure with DynamoDB. After some time battling that database we decided to put it aside and pick up a new battle, Google Bigtable. We have now had a year and a half of using Bigtable and have learned a lot along the way.

https://syslog.ravelin.com/the-joy-and-pain-of-using-google-bigtable-4210604c75be
Optimising Spark RDD Pipelines
============================

Every day, in THRON, we collect and process millions of events regarding user-content interaction. The reason we do so is because we enrich user and content datasets, analise the timeseries, extract behaviour patterns and ultimately infer user interest and content characteristics from those; this is done to fuel lots of different cool benefits such as recommendations, Digital content ROI calculation, predictions and many more.

https://medium.com/thron-tech/optimising-spark-rdd-pipelines-679b41362a8a
Serverless Data Lake on AWS
=========================

In this post, we talk about designing a cloud-native data warehouse as a replacement for our legacy data warehouse built on a relational database.At the beginning of the design process, the simplest solution appeared to be a straightforward lift-and-shift migration from one relational database to another.

https://aws.amazon.com/blogs/big-data/our-data-lake-story-how-woot-com-built-a-serverless-data-lake-on-aws/