MongoDB is a popular document database, which is easy to use, easy to expand, rich in functions and excellent in performance. MongoDB has its own solutions of high availability and partitioning, namely Replica Set and Sharding. We will focus on these two features below.


1. A copy of the set

Some people say that a MongoDB replica set needs at least three nodes, but this statement is actually problematic because the replica set can have at least one node, up to 12 nodes before 3.0, and up to 50 nodes starting 3.0. However, when the number of nodes is 1 or 2, MongoDB cannot give full play to the unique advantages of replica sets. Therefore, it is generally recommended that the number of nodes be more than 3.


First, let’s look at the various roles in the MongoDB replica set.

  • Primary: A group of servers that handle requests from clients, usually read and write

  • Secondary: The Secondary server, which has multiple groups and stores a copy of the data of the primary server. If the primary server fails, one of the Secondary servers can be promoted to the new primary server and can provide read-only services

  • Hidden: Typically used only for backup nodes and does not process client read requests

  • Secondary-only: cannot be used as the primary node, but can Only be used as the Secondary copy to prevent some nodes with low performance from becoming the primary node

  • Delayed: slaveDelay is set to not process client requests and generally needs to be hidden

  • Non-voting: Non-voting secondary nodes, purely backup data nodes.

  • Arbiter: Arbiter node, does not store data, only participates in the election, available or not


Then we think about how MongoDB replica sets synchronize data. We know about Oracle DataGuar synchronization mode, and we know about MySQL master-slave synchronization mode. They both transfer logs to the standby database and then apply them. MongoDB’s replica set is basically the same way, and here we have to mention the core Oplog on which synchronization depends. Oplog is actually like MySQL’s Binlog, which records every operation performed on the primary node. Secondary copies Oplog and applies it to synchronize data. The size of Oplog is fixed. By default, 5% of the available space is allocated (64 bits). You can also specify the size with the — oplogSize option. This is because Oplog, unlike MySQL Binary, is recycled, and unlike Oracle logs, there are no multiple redo logs and no archive logs. Oplog is a log file with a fixed size, which is reused repeatedly. When Secondary is far behind the Primary until Oplog is overwritten, full synchronization can only be restarted.


You might also ask if MongoDB replica sets are synchronized in real time. This is also a question of database consistency. The semi-synchronous replication mode of MySQL ensures the strong consistency of the database, and the maximum protection mode of Oracle DataGuard also ensures the strong consistency of the database. MongoDB can guarantee the security of writing through getLastError command, but it is not a transaction operation after all. Strong consistency of data cannot be achieved.


MongoDB replica set Secondary is typically milliseconds behind, and can be even more delayed if there are loading problems, configuration errors, network failures, etc.



Failover, Switchover, and read/write separations are triggered in Switchover mode. You may be interested in how to elect a MongoDB Switchover and how to prevent a split brain. By default, the MongoDB replica will request Read/write pressure on the Primary node, but we can set setSlaveOk to place Read pressure on each Secondary node. The MongoDB driver also provides five Read Preferences, as follows:


  • Primary: the default parameter that reads only from the primary node.

  • PrimaryPreferred: Read data mostly from the primary node and only from the secondary node when the primary node is not available.

  • Secondary: data is only read from the secondary node. The problem is that data from the secondary node is “older” than data from the primary node.

  • SecondaryPreferred: Data is read from the secondary node preferentially. Data is read from the primary node when the secondary node is unavailable.

  • Nearest: no matter the primary node or secondary node, read data from the node with the lowest network delay.


Let’s take a look at the MongoDB replica set election method. Election can be simply understood as how to select the appropriate node from the cluster to promote the process as Primary. Like many NoSQL databases, MongoDB replicas use the Bully algorithm, which is described in the Wiki documentation.


The general idea is that each member of the cluster can declare that it is the master node and notify the other nodes, and the node accepted by the other nodes becomes the master node. The MongoDB replica set has the concept of “majority”, the rule of “majority” must be followed in the election, the node can become the master node only when it is supported by the majority, and the number of surviving nodes in the replica set must be greater than the number of “majority”.



After MongoDB 3.0, the number of replica set members breaks to 50, but after 12 nodes, most of them are 7.


MongoDB conducts elections under the following conditions:


  • Initializing the replica set;

  • The backup node cannot communicate with the primary node (the primary node may be down or the network may cause).

  • StepDown (SEC), default 60s.


Here’s how the election works:


  • Get the last operation timestamp for each server node. Each mongodb has oplog mechanism to record the local operation, which is convenient to compare with the master server and whether data synchronization can also be used for error recovery.

  • If most servers in the cluster go down, all remaining nodes are in the secondary state and the election is stopped.

  • If the last synchronization of the primary or all slave nodes in the cluster looks old, stop the election and wait for manual operation.

  • If there are no problems, the server node with the latest timestamp of the last operation is selected as the primary node.


Some people may make the mistake that the number of member nodes must be odd when designing the architecture of MongoDB replica set. Is it a problem if the number of member nodes in MongoDB replica set is even?



As can be clearly seen from the above figure, there is no problem in a single machine room whether the number of member nodes in the replica set is even or odd. However, if there are two rooms, each room has the same number of member nodes. When the heartbeat between the two rooms is interrupted, the whole cluster will fail to elect a Primary. This is the split brain in the MongoDB replica set. So how do you prevent a split brain? From an architectural perspective, we recommend the following:



On the left, “most” members are in a data center

Requirement: The Primary of the replica set is always in the Primary data center

Disadvantages: No Primary node is available if the Primary data center fails


On the right, two data centers have the same number of members, and a replica node (but a quorum node) is placed in the third place to determine the outcome.

Requirement: Cross-equipment room Dr

Disadvantages: Additional need for a third machine room


Therefore, the odd number of MongoDB replica members is applicable to multi-machine deployment scenarios.


In addition, when designing the MongoDB replica set, we also need to consider the issue of overloading, which leads to poor MongoDB database performance. Therefore, be sure to measure the amount of read, take full account of the possibility of read/write node downtime.



MongoDB replica sets also have some concepts like synchronization, heartbeat, and rollback, which I’ve briefly sorted out.


synchronous

Initial synchronization: A full synchronization is performed from other nodes in the replica set. Triggering conditions:


  • When the Secondary node is added for the first time;

  • When the Secondary node lags behind data larger than oplog size;

  • When the rollback fails.


Keep Sync: Incremental synchronization after initial synchronization

Note: The synchronization source is not the Primary node. MongoDB selects the synchronization source based on the Ping time. When selecting a source, the system selects a member whose data is newer than its own.


The heartbeat


  • What is the Primary node? Which node is down? Which node can serve as the synchronization source? – heartbeat to solve;

  • Each node sends heartbeat requests to other nodes every 2s and maintains its own state view according to the results.

  • The Primary node uses its heartbeat to determine whether it meets the “majority” criteria. If it does not, it abdicates to Secondary.


The rollback



The Secondary node did not have time to replicate the write operation. This means that the newly elected Primary does not have the write operation. When the Primary is restored and becomes Secondary, the write operation needs to be rolled back so that the synchronization can be redone. If the amount of data to be rolled back is larger than 300 MB or the rollback duration is longer than 30 minutes, the rollback fails and full synchronization must be performed again.



2. The shard


In fact, sharding is the splitting of data, dividing data across multiple nodes, also known as horizontal splitting. MongoDB supports automatic sharding, and regardless of the pros and cons of automatic sharding, MongoDB still prides itself on having this feature.


MongoDB sharding applies to the following scenarios:


  • When a single server cannot bear the pressure, the pressure includes load, frequent write, throughput, etc.

  • The disk space of the server is insufficient.

  • Increase the available memory size so that more data can be accessed in memory.



As shown in the figure above, MongoDB sharding has three components as follows:


Mongos: The entry point for database cluster requests. It acts as a route and forwards the corresponding data requests to the corresponding SHard server. Multiple Mongos are required in a production environment.

Config server: save metadata of cluster and sharding. Mongos will load the configuration information on the configuration server when it is started. If the configuration server information changes in the future, it will inform all Mongos to update their status. Production environments require multiple configuration servers. Regular backups are also required.

Shard Server: the actual data shard. Production environment requirements are replica sets.


Here’s a simple diagram of the sharding process:



Before sharding, a collection can be thought of as a single block in which all documents are contained.



After selecting the key for sharding, the set will be split into multiple data blocks. At this time, $minKey and $maxKey will appear in the first block and the last block, indicating negative infinity and positive infinity respectively. Of course, these are used internally by MongoDB sharding, as long as we know.



The segmented data blocks will then be evenly distributed across each node.


So some of you might be wondering how does the block break up? Again, I drew four diagrams to explain it.



  • Mongos records the amount of data in each block and checks whether the block needs to be split when a certain threshold is reached.

  • For example, when splitting blocks, Mongos updates the block metadata of Config Server;

  • Config server generates a new block and modifies the scope of the old block (split point);

  • After splitting, Mongos resets the original block trace and creates a new block trace;


Note: chunk is a logical concept. A chunk is not a page or a file that is actually stored, but is only represented in metadata in the config node. That is, the split block only modifies the metadata and does not move the data.


The process of splitting a block also has its pitfalls, such as the inability to find the splitting point resulting in a large block, and configuration server unreachable resulting in a splitting storm.



Improper selection of slice keys: Mongos found that the block reached the threshold, and then requested fragmentation to split the block, but the fragmentation could not find the fragmentation point, resulting in larger and larger blocks.


Chain reaction: unreasonable piece keys — > large blocks (unable to be separated) — > blocks cannot be moved — > unbalanced data distribution — > unbalanced data writing — > further aggravates unbalanced data distribution

Prevention: correct selection of slice keys


The config Server is unreachable: Mongos cannot communicate with the configuration server during the split process, so metadata cannot be updated. This results in a cycle of switching between attempted and failed split, which affects the performance of Mongos and the current shard. This process of repeatedly making split requests but failing to split is called a split storm.


Prevention:

1) Ensure that the configuration server is available

2) Restart Mongos and reset the write counter


Having said that, we still don’t know how to create shards, divided into two types:


  • Create sharding from scratch: generally, when new business is launched, sharding is selected at the beginning of architecture design;

  • Converting a replica set to a fragment: After the service runs for a period of time, a single replica set cannot meet the requirements and needs to be converted to a fragment.


The first way to create a shard from scratch is nothing to say, but choosing the right key is especially important. The second way to convert a replica set to a shard is as follows:


  1. Deploy config Server and Mongos;

  2. Connect mongos and add the original replica set to the cluster, which will be the first shard;

  3. Additional replica sets are deployed and added to the cluster as well;

  4. Modify the client configuration, all access to mongos;

  5. Select the slice key to enable sharding.


Note: To fragment an existing set, ensure that there is an index on the slice key. If there is no index, create it first.


There is a very important component in MongoDB sharding called balancer, and mongos actually plays this role. The equalizer is responsible for the migration of chunks. It periodically checks the balance of chunks between fragments. If the balance is unbalanced, it starts the migration of chunks. Block migration does not affect application access or usage; prior to migration, reads and writes are requested to the old block. If the metadata update is complete, all Mongos processes that attempt to access the old location data will get an error that is insensitive to the client, and Mongos will silently process the error and repeat it on the new shard.


One of the mistakes you might make here is that sharding is based on data size. Remember, the balance between shards is measured by the number of blocks, not the size of the data.


Block migration can also affect performance in some scenarios, such as hotspot slice keys. Since all new blocks are created on the hotspot, the system needs to deal with the continuous data written to the hotspot slice. When a new shard is added to the cluster, the equalizer triggers a series of migrations.


We do not know how to do block migration for a long time, the simple process is as follows:


  1. The process of the equalizer sends moveChunk instructions to the source shard.

  2. The source shard starts moving blocks, during which all operations on the block are routed to the source shard;

  3. The target shard creates all indexes on the source shard, unless the target shard already has indexes.

  4. The target shard starts requesting the document in the block and receives a copy of the data;

  5. After receiving the last document, the target shard starts to synchronize all changes made during the block movement;

  6. When fully synchronized, the target shard updates the metadata of the configuration server (the new address of the block);

  7. After updating the metadata and verifying that there are no open cursors on the block, the source shard deletes the copy of the data.


The MongoDB architecture and sharding process are introduced above, but in fact, the most important part of MongoDB sharding is the correct selection of slice keys. What is a chip key? Select one or two fields in the collection to split the data. This key is called a slice key. We should select the slice key at the beginning of sharding, it is very difficult to change the slice key after running.


How to choose the film key? Firstly, we analyze the slice key from the perspective of data distribution. The common methods of data distribution are as follows:


  • Ascending slice keys: Keys that grow steadily over time, such as Date, or ObjectId.


** MongoDB itself does not increment primary keys.

Symptom: New data is concentrated in a shard

The downside: MongoDB is busy with balancing data


  • Randomly distributed slice keys: random keys in the data set, such as user name, MD5 value, email address, and UUID


Symptom: The growth rate of all fragments is basically the same, reducing the number of migrations

Disadvantages: It is inefficient to randomly request data that exceeds the available memory size


  • Location-based slice keys: Location is an abstract concept, such as IP address, latitude and longitude, or address.


Symptom: All documents that are close to this key value are stored in the same scoped block.


We can also select the appropriate slice key according to the application type. The strategy is as follows:


  • Hashed Shard keys: Distributed randomly.


Application type: Pursues fast data loading, uses ascending keys in a large number of queries, and also expects random distribution of written data

Disadvantages: Can’t use hash key to do range query for specified target


Note: You cannot use unique. You cannot use array fields. Floating-point values are rounded first


  • GridFS hash key: GridFS is great for sharding because it can contain large amounts of file data

  • Flow strategy: A server in a cluster with better performance (such as SSD) uses the label + ascending slice key scheme to make the server handle more load


Disadvantages: If the request exceeds the capacity of the powerful server, it is not easy to load balance to other servers


OK, so here we are with shard keys. To sum up, if you use MongoDB sharding, choose the right shard keys from the beginning based on your application type so that shard performance is optimal and your application performance is excellent.


We’re nearing the end of MongoDB sharding, and finally we’ll take a look at auto-sharding and manual sharding.


auto-sharding

Although data migration is officially said to have little impact on read and write operations, it may squeeze hot data out of memory in the process, which increases I/O pressure. Therefore, it can be considered to turn off automatic balancing at ordinary times and choose the time of low pressure to carry out.


And Mongos move Chunk is single thread, a single Mongos can only move one block at a time.


pre-spliting

Also called manual-sharding, auto Balance needs to be turned off early. In this scenario, we need to fully understand the distribution of our own data and pre-partition the data, that is, divide each fragment into a range of slice keys of appropriate size, and then implement manual sharding with manual Move Chunk.


Automatic sharding is a very ideal choice, but automatic sharding in the real application scenarios will still have a lot of pits, unless we continue to step on the road and continue to fill pits, have enough strength, enough experience, control its every detail, then we might as well choose automatic sharding. But many companies still avoid this road, choose manual sharding mode, its biggest reason is manual sharding controllable ability.