DataDotz BigData Weekly

DataDotz Bigdata Weekly

This entry was posted in zeppelin on by .   0 Comment[s]

Hadoop – Distributions Technical News




Hortonworks DataFlow 1.1 Released


Announce that the 2nd release of Hortonworks DataFlow is now available. Hortonworks DataFlow is a data-source agnostic, real time data collection and dataflow management platform designed to meet the practical challenges of collecting and moving data securely and efficiently.


HDF 1.1 builds on the strength of the initial GA version of HDF released in September 2015. HDF 1.1 supports additional security models, improves the user experience, and increases user options for accessing and delivering data.

Highlights of HDF Release 1.1 include:

  • Enhanced Security
  • Improved User experience
  • New data sources and destinations


Hadoop Summit Dublin — Community Choice Winners Announced


Hadoop Summit – Dublin taking place 13-14 April 2016


Unlike other conferences, Hadoop Summit is driven for the community by the community and this year’s speaker submissions have been open for public viewing The top vote getting sessions are automatically selected for the conference. The competition was strong, the content was amazing and with over 13,000 votes tallied, we are happy to announce that the results are in!

Without further ado, the Community Choice Winners for Hadoop Summit 2016 are…


Apache Committer Insights : How To: A beginners guide to becoming a Apache Contributor


Applications of Hadoop and the Data-Driven Business : Crime Prediction using Hadoop framework


Data Science Applications for Hadoop : Machine Learning in Big Data – Look Forward or Be Left Behind


Hadoop and the Internet of Things : Hadoop Everywhere: Geo-Distributed Storage for Big Data


Hadoop Application Development: Dev Languages, Scripting, SQL and NoSQL : ooperative data exploration with IPython notebook


Hadoop Governance, Security, Deployment and Operations : Advanced execution visualization of Spark jobs


The Future of Apache Hadoop  : Overview of Apache Flink: the 4G of Big Data Analytics Frameworks



  Leveraging Big Data for Security Analytics


Hortonworks partnered with ManTech and B23 to foster a vibrant open community to accelerate the development of OpenSOC. In December we additionally partnered with Rackspace Managed Security and submitted OpenSOC to the Apache Incubator as a podling under the name of Apache Metron. A decision to rename the project was made to represent the new direction and the new community. Now the process of graduating Metron to a top-level project (TLP) has begun. Given our proven commitment to the Apache Software Foundation process, we feel uniquely qualified to bring this important technology and its capabilities to the broader open source community.


Metron integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis. Metron provides capabilities for log aggregation, full packet capture indexing, storage, advanced behavioral analytics and data enrichment, while applying the most current threat-intelligence information to security telemetry within a single platform.

Metron can be divided into 4 areas:

  1. A mechanism to capture, store, and normalize any type of security telemetry at extremely high rates.
  2. Real time processing and application of enrichments.
  3. Efficient information storage.
  4. An interface that gives a security investigator a centralized view of data and alerts passed through the system.





New in Cloudera Enterprise 5.5: Improvements to HUE for Automatic HA Setup and More.

Cloudera Enterprise 5.5 improves the life of the admin through a deeper integration between HUE and Cloudera Manager, as well as a rebase on HUE 3.9.


Cloudera Enterprise 5.5 contains a number of improvements related to HUE (the open source GUI that makes Apache Hadoop easier to use), including easier setup for HUE HA, built-in activity monitoring for improved stability, and better security and reporting via Cloudera Navigator and Apache Sentry (incubating). In this post, we’ll offer an overview of some of these improvements.


Automatic HA and Load Balancing

With this new release, you can now add a built-in load balancer via just a few clicks in Cloudera Manager, whereas in the past, setting up a load balancer involved some extra steps. These steps are still valid and recommended if you use “raw” CDH; however, if you use Cloudera Manager, this new load balancer provides a few advantages:


Automatic failover to an available healthy HUE in case of crash, network, or host glitch

Transparent serving of the static files for much better request performance and more responsiveness (cuts down average number of Web requests per page from 60 to 5; that’s a lot of savings with many concurrent users!)


Announcing RecordService Beta 2: Brings Column-level Security to Apache Spark and MapReduce


With this new beta release, column-level privileges set via Apache Sentry (incubating) are now enforced on Spark/MapReduce jobs.


Cloudera is excited to announce the availability of the second beta release for RecordService. This release is based on CDH 5.5 and provides some new features, including:


  • Support for Sentry column-level security. Previously, column-level access control required the use of views; now, permissions can be set on individual columns in a table. This new feature simplifies administration as views no longer need to be created and specified in jobs.
  • Multiple planners, enabling high availability
  • Spark 1.5 compatibility

In this post, we’ll walk you through two examples: how to replace an existing MapReduce/Spark job with RecordService, and how to use the column-level security feature with RecordService.



DistCp Performance Improvements in Apache Hadoop


Recent improvements to Apache Hadoop’s native backup utility, which are now shipping in CDH, make that process much faster.


DistCp is a popular tool in Apache Hadoop for periodically backing up data across and within clusters. (Each run of DistCp in the backup process is referred to as a backup cycle.) Its popularity has grown in popularity despite relatively slow performance.


In this post, we’ll provide a quick introduction to DistCp. Then we’ll explain how HDFS-7535 improves DistCp performance by utilizing HDFS snapshots to avoid copying renamed files. Finally, we’ll describe how HDFS-8828 (shipping in CDH 5.5) improves performance further on top of HDFS-7535.



How DistCp Works (Default)

DistCp uses MapReduce jobs to copy files between clusters, or within the same cluster in parallel. It involves two steps:


  • Building the list of files to copy (known as the copy list)
  • Running a MapReduce job to copy files, with the copy list as input






Top 10 Big Data Trends in 2016 for Financial Services


2015 was a groundbreaking year for banking and financial markets firms, as they continue to learn how big data can help transform their processes and organizations. Now, with an eye towards what lies ahead for 2016, we see that financial services organizations are still at various stages of their activity with big data in terms of how they’re changing their environments to leverage the benefits it can offer. Banks are continuing to make progress on drafting big data strategies, onboarding providers and executing against initial and subsequent use cases.



Julia – a Fresh Approach to Numerical Computing and Data Science


The Julia programming language was created in 2009 by Jeff Bezanson, Stefan Karpinski, and Viral B Shah. It was broadly announced in 2012 and has had a growing community of contributors and users ever since. There are over 700 packages in the official registry and the base language has had over 400 contributors.


Julia aims to address the “two language problem” that is all too common in technical computing. Interactive data exploration and algorithm prototyping is incredibly convenient and productive in high level dynamic languages like Python, Matlab, or R, but scaling initial prototypes to handle larger data sets quickly runs into performance limitations in those environments. December conventional approach has been to rely on compiled C (or C++, or Fortran) extensions to optimize performance-critical calculations.


MapR Streams Under the Hood – Whiteboard Walkthrough


MapR  launched MapR Streams, a lot of people who were familiar with Kafka asked us how we achieved some of the differentiators that we claimed. During this Whiteboard Walkthrough, I wanted to go through what are the different things that we achieved and some basics about how we pulled them off.


Let’s start with the list. First, we have a single cluster for multiple data services, files, tables, and streams. Second, we are able to do event persistence, meaning once an event is published into the system, it’s available forever for any analytics after the fact. You never have to worry about aging it out. Millions of producers and consumers, true IoT scale, and global arbitrary global topology with global failover. Let’s jump into how we do some of these things.


College Football Playoff – Did the committee get it right?


College Football Playoffs


This first recorded argument about the superiority of football teams probably occurred ten minutes after the discovery of pigskin. Before the current college playoff system was created, these discussions were largely perfunctory. But now there is more at stake than ever, because admittance to the playoffs is by invitation only, and the bowl selection committee calls the shots – their deliberations are essentially just extensions of these arguments. Did they get it right? Which schools were left out? We can apply machine -learning, remove the bias and address these questions.



The PageRank Algorithm


To evaluate the quality of the committee’s ranks, a famous machine-learning algorithm was applied. PageRank is the name of the method used in the early days of Google to rank Internet search results.


Google doesn’t use PageRank anymore, but there is no shortage of on-line documentation on this algorithm. The Wikipedia page has the basic details, and there are numerous applications to business and science scenarios. Most machine learning packages, such as Apache Spark, have implementations of PageRank.


Hadoop Proves a Win for Morgan Stanley


Big data and Hadoop-based approaches are now widely recognized but are still considered by many to be new technologies. The potential benefit of these approaches already is clear, but are they able to deliver practical value now?


To answer such a question it’s always useful to hear real world experiences that take us beyond the theoretical. That’s what came to light in an interview with Erwan Le Doeuff, VP of Information Technology, Risk and Security for Morgan Stanley at a recent Big Data Everywhere conference in New York City. And the answer was a resounding “yes”.



General Discussion



LendUp Expands Access to Credit with Databricks


Announce a new deployment of Databricks in the financial technology sector with LendUp, a company that builds technology to expand access to credit. LendUp uses Databricks to develop innovative machine learning models that touch all aspects of its lending business. Specifically, it uses Databricks to perform feature engineering at scale and quickly iterate through the model building process. Faster iterations lead to more accurate models, the ability to offer credit to more of the tens of millions of Americans who need it and the ability to establish new products more easily.




The Best of Databricks Blog: Most Read Posts of 2015


Databricks developers are prolific blog authors when they are not writing code for the Databricks platform or Apache Spark. As 2015 draws to a close, we did a quick tally of page views across all the blog posts published during this year to understand what topics attracted the most interest amongst our readers.

The result indicates that people are most interested in announcements of new Spark features and practical guides on tuning Spark. While blog posts published earlier in the year have an advantage, there is a clear winner by a wide margin. So here we give you a countdown to the most popular Databricks blog posts of 2015.


Guest Blog: Streamliner – An Open Source Spark Streaming Application


Spark Streaming tackles several challenges associated with real-time data streams including the need for real-time Extract, Transform, and Load (ETL). This has led to the rapid adoption of Spark Streaming in the industry. In the 2015 Spark survey, 48% of Spark users indicated they were using streaming. For production deployments, there was a 56% rise in usage of Spark Streaming from 2014 to 2015.


More importantly, it is becoming a flexible, robust, scalable platform for building end-to-end real-time pipeline solutions. At MemSQL, we observed these trends and built Streamliner – an open source, one-click solution for building real-time data pipelines using Spark Streaming and MemSQL. With Streamliner, you can stream data from real-time data sources (e.g. Apache Kafka), perform data transformations with Spark Streaming, and ultimately load data into MemSQL for persistence and application serving.



Flink 2015: A year in review, and a lookout to 2016


With 2015 ending, we thought that this would be good time to reflect on the amazing work done by the Flink community over this past year, and how much this community has grown.


Overall, we have seen Flink grow in terms of functionality from an engine to one of the most complete open-source stream processing frameworks available. The community grew from a relatively small and geographically focused team, to a truly global, and one of the largest big data communities in the the Apache Software Foundation.


How Apache Flink enables new streaming applications


Stream data processing is booming in popularity, as it promises better insights from fresher data, as well as a radically simplified pipeline from data ingestion to analytics. Data production in the real world has always been a continuous process (for example, web server logs, user activity in mobile applications, database transactions, or sensor readings). As has been noted by others, until now, most pieces of the data infrastructure stack were built with the underlying assumption that data is finite and static. To bridge this fundamental gap between continuous data production and the limitations of older “batch” systems, companies have been introducing complex and fragile end-to-end pipelines. Modern data streaming technology alleviates the need for complex solutions by modeling and processing data in the form that it is produced, a stream of real-world events.


How to Build a Scalable ETL Pipeline with Kafka Connect


Kafka Connect is designed to make it easier to build large scale, real-time data pipelines by standardizing how you move data into and out of Kafka. You can use Kafka connectors to read from or write to external systems, manage data flow, and scale the system—all without writing new code. Kafka Connect manages all the common problems in connecting with other systems (scalability, fault tolerance, configuration, and management), allowing each connector to focus only on how to best copy data between its target system and Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export connector can deliver data from Kafka topics into secondary indexes like Elasticsearch or into batch systems such as Hadoop for offline analysis.


Confluent Platform now ships with Kafka Connect and includes three connectors: one for moving files, a JDBC connector for SQL databases, and an HDFS connector for Hadoop (including Hive). Both the JDBC and HDFS connector offer useful features for you to easily build ETL pipelines.




SaravanaKumar, Data Engineer @ DataDotz.

DataDotz is a Chennai based BigData Team primarily focussed on consulting and training on technologies such as Apache Hadoop, Apache Spark , NoSQL(HBase, Cassandra, MongoDB), Search and Cloud Computing. Saravanan can be reached via his linkedin profile(