Qubole & Snowflake with Spark
The blog series covers the use cases directly served by the Qubole–Snowflake integration. The first blog discussed how to get started with ML in Apache Spark using data stored in Snowflake. Continue reading
A data lake is an increasingly popular way to store and analyze data that addresses the challenges of dealing with massive volumes of heterogeneous data. A data lake allows organizations to store all their data structured and unstructured in one centralized repository. Continue reading
Using Amazon S3 with Cloudera BDR
More of you are moving to public cloud services for backup and disaster recovery purposes, and Cloudera has been enhancing the capabilities of Cloudera Manager and CDH to help you do that. Specifically, Cloudera Backup and Disaster Recovery (BDR) now supports backup to and restore from Amazon S3.BDR lets you replicate HDFS data from your on-premise cluster to or from Amazon S3 with full fidelity (all file and directory metadata is replicated along with the data). Continue reading
Serverless Delivery with Databricks and AWS CodePipeline
Databricks interactive workspace serves as an ideal environment for collaborative development and interactive analysis. The platform supports all the necessary features to make the creation of a continuous delivery pipeline not only possible but simple. Continue reading
Apache NIFI Installation
* Apache NiFi is a software project from the Apache Software Foundation which enables the automation of data flow between systems.
* It can be described as data logistics.
* Similar to how parcel services move and track packages like Apache NiFi helps move and track data.
Server-Side Encryption for Amazon Kinesis Streams
Amazon Kinesis Streams to ingest, process, and deliver data in real time from millions of devices or applications. Use cases for Kinesis Streams vary, but a few common ones include IoT data ingestion and analytics, log processing, clickstream analytics, and enterprise data bus architectures.Within milliseconds of data arrival, attached to a stream are continuously mining value or delivering data to downstream destinations. Continue reading
Spark Streaming integration with Kafka allows users to read messages from a single Kafka topic or multiple Kafka topics. A Kafka topic receives messages across a distributed set of partitions where they are stored. Each partition maintains the messages it has received in a sequential order where they are identified by an offset, also known as a position Continue reading
AMAZON KINESIS VS APACHE KAFKA FOR BIG DATA ANALYSIS
Data processing today is done in form of pipelines which include various steps like aggregation, sanitization, filtering and finally generating insights by applying various statistical models. Amazon Kinesis is a platform to build pipelines for streaming data at the scale of terabytes per hour. Continue reading
Reading data securely from Apache Kafka
The Cloudera Distribution of Apache Kafka 2.0.0 (based on Apache Kafka 0.9.0) introduced a new Kafka consumer API that allowed consumers to read data from a secure Kafka cluster. This allows administrators to lock down their Kafka clusters and requires clients to authenticate via Kerberos. Continue reading
Copyright © 2014. DataDotz All rights reserved.