Replicating Relational Databases with Stream Sets Data Collector
Relational Databases with Stream DataSets
StreamSets Data Collector has long supported both reading and writing data from and to relational databases via Java Database Connectivity (JDBC). While it was straightforward to configure pipelines to read data from individual tables, ingesting records from an entire database was cumbersome, requiring a pipeline per table.StreamSets Data Collector (SDC) 184.108.40.206 introduces the JDBC Multitable Consumer, a new pipeline origin that can read data from multiple tables through a single database connection. In this blog entry, I’ll explain how the JDBC Multitable Consumer can implement a typical use case – replicating an entire relational database into Hadoop.
Cern has published a performance comparison of Apache Avro, Apache Parquet, Apache HBase, and Apache Kudu for querying and analyzing the ATLAS Event Index of collisions done on the Large Hadron Collider. The post describes space utilization, ingestion rate, random lookup latency, data scan rates, and provides a number of lessons learned. If you’re considering a similar use case or any of these systems, this post provides a lot to chew on
Monitoring HBase with Prometheus
This post describes how to hookup HBase metrics to Prometheus, the open-source monitoring system with Grafana integration. Metrics are exported by way of the Prometheus JMX exporter, which runs as a Java Agent and is configured via a simple YAML file. HBase is a column-oriented DBMS providing fast random access. It comes with a management UI showing table details, but I wanted a better understanding of the internals of HBase. In this blog post I will show how to get started with Prometheus’ JMX exporter, exporting the HBase metrics and visualizing them in Grafana.
Auto Scaling in Qubole with AWS Elastic Block Storage
Qubole continues to innovate in the Hadoop in the cloud space. This time, they’ve added the ability to dynamically grow the size of HDFS without adding more nodes by utilizing EBS volumes and the Linux Logical Volume Manager. If you’re running HDFS in the cloud, replicating this setup is likely a good way to keep cost down on storage-limited workloads.
SQL on Apache Apex
SQL on Apache Apex
Stream processing system Apache Apex recently added support for SQL by integrating with Apache Calcite. This post describes the integration, describes how to use it, and provides an example code snippet that runs a basic SQL statement.
Secure Amazon EMR with Encryption
The Amazon Big Data blog has a tutorial describing how to configure an Amazon EMR cluster for encryption in transit (to/from S3 and during Map Reduce shuffle) and at rest (in S3 and on local disk). Much of the work to do this is related to configuring encryption keys, which is done using the Amazon Key Management Service.
The Cloudera blog has a walkthrough demonstrating the integration between Apache Kudu and Apache Spark. There are a number of code snippets (written in Scala) demonstrating the DataFrame integration (which includes support for inserts, upserts, and updates), the native Kudu RDD, and more. Of note, the integration includes support for Kudu’s predicate pushdown via the DataFrame APIs.
Migrate External Table Definitions from a Hive Metastore to Amazon Athena
For customers who use Hive external tables on Amazon EMR, or any flavor of Hadoop, a key challenge is how to effectively migrate an existing Hive metastore to Amazon Athena, an interactive query service that directly analyzes data stored in Amazon S3. With Athena, there are no clusters to manage and tune, and no infrastructure to set up or manage. Customers pay only for the queries they run.In this post, I discuss an approach to migrate an existing Hive metastore to Athena, as well as how to use the Athena JDBC driver to run scripts..
The MapR blog has a two-part series on integrating complex event processing into a streaming architecture using the open-source Drools engine. As an example use-case, the tutorial has a script for generating synthetic sensor data related to road traffic. Data is ingested using StreamSets and can be visualized using Kibana.