DataDotz Bigdata Weekly

This entry was posted in Uncategorized on by .   0 Comment[s]

Apache Flink

Fraudulent transactions cost the banking industry large amounts of money every year. One of our bank’s most important goals is to protect both the bank and our customers from fraudulent transactions, where a bad actor misuses the banking system in some way for personal gains. Most banks have rule-based alerting in place to detect potential fraud.Rule-based alerting works fine if you already know what you’re looking for, but we are living in a time where fraudsters are becoming more sophisticated.


Its teams of security and network professionals use techniques such as Big Data, Machine Learning and Artificial Intelligence to detect early cyber threats, neutralize the threat and manage IT-efficiently. The company has embarked on a Big Data journey to help solve the business challenges presented by the growth and diversity of data, and the speed with which data needs to be processed. QSight’s previous methods allowed it to identify threats, but the company had no way of determining the potential impact of security related events.
Apache Kafka

Building a machine learning model that adapts in real time to new information has long been a end-goal of many ML pipelines. Kafka Streams makes this relatively easy by using the same code for offline and online training. This post walks through building out a real time evaluation and training pipeline with flight arrival data as an example.
Streaming SQL for Apache Kafka

In all our examples Kafka has been used just for data transportation with any necessary transformation happening in the target data store like BigQuery, with the usage of languages like Python and engines like Spark Streaming or directly in the querying tool like Presto.KSQL enables something really effective: reading, writing and transforming data in real-time and a scale using a semantic already known by the majority of the community working in the data space, the SQL.

If you’re using PySpark, you’ve probably wanted to combine it with Pandas or other python libraries. This post describes why this is a bit of a challenge and provides some code to convert data between numpy types and PySpark-compatible types (and vice-versa) for implementing custom user defined functions. It’s an in-depth article that also explains some of the internals of PySpark.

The Matrix a set of over 27 software components that need to work together as part of any big data infrastructure. The automation suite used to perform functional validation of these components consists of over 30,000 tests which are divided into 250+ logical groups .The splits are executed concurrently on an internal container cloud each split consists of a set of tests that verifies entire features using actual instances of the services involved no mocks are used in these functional tests. Each split spins up a test cluster, deploys and configures a set of services and executes the tests using a test automation framework.