DataDotz Bigdata Weekly

This entry was posted in Big Data, Blog, DataDotz Weekly, Hadoop, Hbase, Hive, Kafka, Spark on by .   0 Comment[s]


Using Apache Spark for large-scale language model training

Facebook has written about their experience converting their n-gram language model training pipeline from Apache Hive to Apache Spark. The post describes their Hive-based solution, their Spark-based solution, and the scalability challengesThe post also talks about the overall difference in the two implementations (e.g. the flexibility of the Spark DSL vs. Hive QL) and shares some performance numbers.


The Hortonworks blog has a thorough overview of Apache Ranger’s feature set, including how it provides attribute-based access control, its policy engine framework, its Key Management Service (that can integrate with a Hardware Security module), dynamic column masking capabilities for Apache Hive, central auditing, and more.


Analyze Security, Compliance, and Operational Activity Using AWS

Most AWS services integrate with CloudTrail for auditing. Once you start adding a few services, this can generate an awful lot of data that is overwhelming to consume. The new Amazon Athena is a useful tool for analyzing that data, given that it doesn’t require any additional infrastructure. The AWS big data blog has a tutorial with several example queries to get started analyzing the data.

Kafka connect for FTP data

In this article we are going to implement custom file transformers to efficiently load files over FTP and using Kafka Connect convert them to meaningful events in Avro format.Depending on data subscriptions we might get access to FTP locations with files updated daily , weekly or monthly. File structures might be positional, csv, json , xml or even binary.On IoT use cases we might need to flatten multiple events arriving in a single line; or apply other transformations before allowing the data to enter into the kafka highway as a stream of meaningful messages.Kafka Connect distributed workers can provide a reliable and straight forward way of ingesting data over FTP.

Streaming databases in realtime with MySQL, Debezium, and Kafka

We Pay has an article about their change data capture solution for MySQL, which uses Debezium to stream data to Kafka. We Pay is on the Google Cloud Platform, so the MySQL instances are running in Google CloudSQL, and from Kafka data is loaded into BigQuery. The post goes into the finer operational details, including how to add a new database to Debezium/Kafka, how they make use of the new global transaction IDs added in MySQL 5.6, and how streaming data that comes out of Debezium looks.

Amazon RDS as Your Hive Metastore

As the central source of truth for metadata about your data, it’s quite important for the Hive Metastore to be up to date. Previously, this could be a challenge in a cloud environment in which there are multiple transient clusters that come and go unpredictably. Recently, Hadoop and CDH added support for a persistent Hive Metastore that lives independent of any one cluster. This post has some basics of configuring the metastore and the list of gotchas/assumptions to keep in mind.

Complex Data Formats with Structured Streaming in Apache Spark

This post describes APIs and tools for working with nested data in Spark and Spark SQL. It covers things like how to extract fields out of a nested struct and how to convert a json string to a struct on which normal operations can be easy it is to write an end-to-end streaming ETL pipeline using Structured Streaming that converts JSON CloudTrail logs into a Parquet table. The blog highlighted that one of the major challenges in building such pipelines is to read and transform data from various sources and complex formats. In this blog post, we are going to examine this problem in further detail, and show how Apache Spark SQL’s built-in functions can be used to solve all your data transformation challenges.

Harmonize,Search,and Analyze Loosely Coupled Datasets on AWS

You have come up with an exciting hypothesis, and now you are keen to find and analyze as much data as possible to prove (or refute) it. There are many datasets that might be applicable, but they have been created at different times by different people and don’t conform to any common standard. They use different names for variables that mean the same thing and the same names for variables that mean different things. They use different units of measurement and different categories. Some have more variables than others. And they all have data quality issues (for example, badly formed dates and times, invalid geographic coordinates, and so on).


Modern information security encompass broader data sets than in the past, in order to create context and generate a complete picture of network data, user behaviour pattern and business data – all combined together so that a trendline of normal operations can be created. It is impractical for security personnel to manually piece together all the relevant security data to detect threats, and modern cybersecurity solutions need to lean on the automation of manual tasks of processing very large sample sets made possible through big data and machine learning. And as the hackers constantly evolve their game, security teams need to adapt at the same time, at the same rate, to efficiently detect and interpret the signs of the most relevant threats which require further investigation, and to quickly respond to evolving threats.