datadotzweekly

DataDotz Bigdata Weekly

This entry was posted in Big Data, Blog, DataDotz Weekly, Hadoop, Hive, Kafka, Spark on by .   0 Comment[s]

Using Amazon S3 with Cloudera BDR
=================

More of you are moving to public cloud services for backup and disaster recovery purposes, and Cloudera has been enhancing the capabilities of Cloudera Manager and CDH to help you do that. Specifically, Cloudera Backup and Disaster Recovery (BDR) now supports backup to and restore from Amazon S3.BDR lets you replicate HDFS data from your on-premise cluster to or from Amazon S3 with full fidelity (all file and directory metadata is replicated along with the data).In case of replicating Hive data, apart from data, BDR replicates metadata of all entities (e.g. databases, tables, etc.) along with statistics (e.g. Impala statistics, etc.) This feature supports many different use cases.
https://blog.cloudera.com/blog/2017/08/use-amazon-s3-with-cloudera-bdr/

Apache Spark-based analysis and Databricks notebooks impact genomic research
==========

We are in the midst of the digital revolution where consumers and businesses demand decisions be based on evidence collected from data. The resulting data identification of almost everything produces datasets that are not only growing vertically, by capturing more events, but also horizontally by capturing more information about these events.

https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html
Artificial or Augmented Intelligence
==========

Current AI technology is powerful enough to identify a kitten in a picture, or identify an object lying on the roadside, and then determine the chances of that object moving onto the road. In fact, autonomous vehicles are a great example of AI technologies in action. In the future, AI will be able to identify the most important elements in huge data chunks, and then take advantageous actions. So here’s the question that most of us ask: How will AI show up in our lives? For the foreseeable future, AI is there augmenting our capabilities, allowing us to do more, with better accurately, in less time

https://mapr.com/blog/ai-or-augmented-intelligence/
Real-Time Anomaly Detection Streaming Micro services with H2O and MapR
==========

In this blog series, we cover the architecture of a real-time predictive maintenance system. In Part 1: Architecture, we cover the use case and general architecture of our solution. In Part 2: Modeling, we cover the modeling with H2O and show the code used to train the final model. In this blog, We cover the production deployment.

https://mapr.com/blog/real-time-anomaly-detection-3/
Apache Kafka
==========

Apache Kafka’s new support for exactly-once semantics and transactions enables some interesting new use cases. The latest post in a series on using Kafka to enable event-based services looks at how these new features can simplify event-based systems. The built-in failure and retry handling provide a new level of abstraction that lets developers focus on the core business logic of the application

https://www.confluent.io/blog/chain-services-exactly-guarantees/
Run Common Data Science Packages on Anaconda and Oozie with Amazon EMR
==========

Amazon EMR allows data scientists to spin up complex cluster configurations easily, and to be up and running with complex queries in a matter of minutes. Data scientists often use scheduling applications such as Oozie to run jobs overnight. However, Oozie can be difficult to configure when you are trying to use popular Python packages (such as “pandas,” “numpy,” and “stats models”), which are not included by default. One such popular platform that contains these types of packages (and more) is Anaconda. This post focuses on setting up an Anaconda platform on EMR, with intent to use its packages with Oozie.

https://aws.amazon.com/blogs/big-data/run-common-data-science-packages-on-anaconda-and-oozie-with-amazon-emr/