highres_14391580

Sharding in MongoDB

This entry was posted in Big Data, Blog, MongoDB on by .   0 Comment[s]

MongoDB is one of several database types to arise in mid 2000’s under the banner of NoSQL. Rather than using the rows and columns as in the relational databases, MongoDB is built over architecture of collections and documents. Documents contain sets of Key-Value pairs and are the basic unit of data in mongoDB. Collections contain sets of documents and functions as the equivalent of the relational database tables.

Database systems with huge amount of the data sets and high throughput applications normally are a threat to the single server.  They can exceed the storage capacity of a single machine or the high query rates can add on more loads to the CPU capacity of the server leading to exhaust. To overcome these issues, there are two basic approaches: Vertical Scaling and Horizontal Scaling.

Vertical Scaling is adding upon more CPU and storage resources to increase the capacity. But this vertical concept also gives up some limitations like, more expensive to achieve high performance systems, and cloud-based providers allow users to provision only smaller instances. Whereas Horizontal Scaling by contrast, divides the huge data sets and distributes over multiple machines. Each individual server is a shard or an independent database and collectively the shards make up a single database.

 

SHARDING IN MONGODB

Sharding in mongoDB is implemented through three components.

  1. Config Server Query Router                                3.  Shard Server

 

Config Server:

Config Servers are used to store the metadata that maps the requested chunks into shards. Each production sharding implementation must contain exactly three configuration servers. This is to ensure the high availability and redundancy.

 

Query Routers:

Query routers are the machines that your application actually communicates with, to do the read and write operations. These machines are responsible for communicating to config servers to route the applications to shards. Each query router runs the “mongos” command.

 

Shard Server:

Shard Servers are responsible for the actual data storage operations. It’s a mongodb instance that holds a subset of a collection’s data. A shard can be a single mongodb instance or replica sets to ensure that data will be accessible in the event that the primary shard server goes down.

 

Mongodb Sharding

 

TYPES OF SHARDING

                To Shard a collection, we need to select a shard key. A shard key is either an indexed field or an indexed compound field that exist in every document of the collection. MongoDB divides the shard key values into chunks and then distributes the chunks evenly across the shards. There are two types of sharding in mongoDB.

  1. Range based Sharding                         2. Hash Based Sharding

 

Range Based Sharding:

For range based sharding, mongoDB divides the data set into ranges determined by the shard key values. To understand it in a clear way, assume a numeric line that goes from zero to infinity. Each value of the shard key falls somewhere else in the line. Now the mongoDB partitions this line into smaller non overlapping ranges called as chunks, which is the range of values between some minimum values to some maximum value.

Mongodb- Range Based

 

Hash Based Sharding:

For Hash based sharding, MongoDB computes a hash of a field’s value, and then uses these hashes to create chunks. Because of this, two documents with “close” shard keys are unlikely to be a part of the same chunk, thus ensuring a more random distribution of a collection in the cluster.

Mongodb- Hash Based

PERFORMANCE DIFFERENCE BETWEEN RANGE AND HASH BASED SHARDING:

Range based sharding supports more efficient range queries. Like if a range query has been given the query router can easily determine which chunks overlap that range and route the query to those shards only. However range based sharding may result in the uneven distribution of data, which may result in an unbalanced cluster. Whereas Hash based sharding on contrast ensures the even distribution of data. But as it is random distribution, it will not able to target a few shards as it needs to query every shard in order to return a result.

 

Written by Alex, Data Engineer @ DataDotz.

DataDotz is a Chennai based BigData Team primarily focussed on consulting and training on technologies such as Apache Hadoop, Apache Spark , NoSQL(HBase, Cassandra, MongoDB), Search and Cloud Computing.