Using Docker and Pyspark
Pyspark can be a bit difficult to get up and running on your machine. Docker is a quick and easy way to get a spark environment working on your local machine and is how I run Pyspark on my machine.I’ll start by giving an introduction to Docker. According to wikipedia “Docker is a computer program that performs operating-system-level virtualization, also known as ‘containerization’ ”. To greatly simplify, Docker creates a walled off linux operating system to run software on top of your machines OS called a container.
Deploying Logstash pipelines to Kubernetes
Towards the end of 2018 I started to wrap up things I’d been learning and decided to put some structure into my learning for 2019.2018 had been an interesting year, I’d moved jobs 3 times and felt like my learning was all over the place. One day I was learning Scala and the next I was learning Hadoop. Looking back, I felt like I didn’t gain much ground.
Spark Streaming or Kafka Streams or Alpakka Kafka?
Recently we needed to choose a stream processing framework for processing CDC events on Kafka. CDC events were produced by a legacy system and the resulting state would persist in a Neo4J graph database. We had to choose between, Spark Streaming, Kafka Streams and Alpakka Kafka.
Joy and Pain of using Google BigTable
Last year, I wrote about Ravelin’s use and displeasure with DynamoDB. After some time battling that database we decided to put it aside and pick up a new battle, Google Bigtable. We have now had a year and a half of using Bigtable and have learned a lot along the way.
Optimising Spark RDD Pipelines
Every day, in THRON, we collect and process millions of events regarding user-content interaction. The reason we do so is because we enrich user and content datasets, analise the timeseries, extract behaviour patterns and ultimately infer user interest and content characteristics from those; this is done to fuel lots of different cool benefits such as recommendations, Digital content ROI calculation, predictions and many more.
Serverless Data Lake on AWS
In this post, we talk about designing a cloud-native data warehouse as a replacement for our legacy data warehouse built on a relational database.At the beginning of the design process, the simplest solution appeared to be a straightforward lift-and-shift migration from one relational database to another.