Category Archives: Hbase

datadotzweekly

DataDotz Bigdata Weekly

Serverless Delivery with Databricks and AWS CodePipeline
=====================================

Databricks interactive workspace serves as an ideal environment for collaborative development and interactive analysis. The platform supports all the necessary features to make the creation of a continuous delivery pipeline not only possible but simple. Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Server-Side Encryption for Amazon Kinesis Streams
==========

Amazon Kinesis Streams to ingest, process, and deliver data in real time from millions of devices or applications. Use cases for Kinesis Streams vary, but a few common ones include IoT data ingestion and analytics, log processing, clickstream analytics, and enterprise data bus architectures.Within milliseconds of data arrival, attached to a stream are continuously mining value or delivering data to downstream destinations. Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Apache Kafka
==========

Spark Streaming integration with Kafka allows users to read messages from a single Kafka topic or multiple Kafka topics. A Kafka topic receives messages across a distributed set of partitions where they are stored. Each partition maintains the messages it has received in a sequential order where they are identified by an offset, also known as a position Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

APACHE FLINK
==========

Build a Real-time Stream Processing Pipeline with Apache Flink on AWS

Apache Flink is an open source project that is well-suited to form the basis of such a stream processing pipeline. It offers unique capabilities that are tailored to the continuous analysis of streaming data Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

APACHE SPARK
==========

Using Apache Spark for large-scale language model training

Facebook has written about their experience converting their n-gram language model training pipeline from Apache Hive to Apache Spark. The post describes their Hive-based solution, their Spark-based solution, and the scalability challenges Continue reading

Read More
datadotzweekly

DataDotz Bigdata Weekly

Replicating Relational Databases with Stream Sets Data Collector
==========

Relational Databases with Stream DataSets

StreamSets Data Collector has long supported both reading and writing data from and to relational databases via Java Database Connectivity (JDBC). While it was straightforward to configure pipelines to read data from individual tables, ingesting records from an entire database was cumbersome, requiring a pipeline per table. Continue reading

Read More
hbase_logo

Moving from HBase 0.94 to Hbase 0.98

Version difference between Hbase 0.94 – Hbase 0.96:

Hbase 0.96 is more than a year of making. Some of the major improvements in this version are

  • Improved Stability: The node count configurability, data sizing, duration and more turned up on more bugs when we try to do scan or fetch. This has been fixed by introducing the table locks for cross cluster alterations and cross-row transaction support

Continue reading

Read More