Replica set: the MongoDB replica set maintains a group of Mongod processes in the same data set. The replica set is the basis for production deployment and has data redundancy and high availability.

So why set a replica set?

  • Because replication sets are replicated by saving them on different servers, data is guaranteed to be redundant and reliable in production deployments without data loss due to a single point of failure.
  • You can increase the load on the entire system by accessing data from different server replicas to improve data reading capabilities.

1. Replication set architecture principle

The replica set contains multiple data nodes and an optional quorum node. Among data nodes: There is only one primary node, and other nodes are secondary nodes. Each node member communicates through the heartbeat mechanism. When the communication between the master node and the slave node exceeds the configured electionTimeoutMillis period (default is 10 seconds), the eligible slave node requests to be designated as the new master node, and the cluster attempts to complete the election of the new master node and resume normal operation.

The master node:

Oplog: It keeps a rolling record of all operations that modify data stored in the database. MongoDB applies database operations on the primary server, then records the operations on the oplog of the primary server, and then the slave node members import oplog from any other member during the asynchronous process via heartbeat mechanism and applies those operations. Every operation in Oplog is idempotent. All replica set members contain copies of Oplog in the local.oplog.rs collection, which allows them to maintain the current state of the database.

Slave node: Copy the oplog of the master node and apply the operations recorded by oplog to its data set. If the master node goes down, a new master node will be elected from eligible slave nodes. And you can configure specific functions, such as:

  • To prevent the secondary node from becoming the Primary node in the election, specify the node priority.
  • Hide nodes by preventing applications from reading data from nodes, thereby allowing applications to run applications that need to be separated from normal traffic.
  • Keep running “history” snapshots for recovery from errors, such as inadvertently deleted databases, delayed nodes

Quorum node: The quorum node does not maintain the data set. The purpose of the quorum node is to maintain quorum in the replica set by responding to heartbeat and election requests from other replica set nodes. Because they do not store data sets, quorum nodes can be a good way to provide replica set arbitration at a cheaper resource cost than full-featured replica set members with data sets. If your replica set has an even number of members, add quorum nodes to get a majority of votes in the main election. And the quorum node always has only one election vote, thus allowing the replica set to have an uneven number of voting members without the overhead of extra members copying the data.

Hearbeat:

Data synchronization: To maintain an up-to-date copy of a shared dataset, the slave node of the replica is set to synchronize or replicate data from other nodes. MongoDB uses two forms of data synchronization: initial synchronization new nodes synchronize complete datasets, and the entire cluster nodes synchronize subsequent data changes.

Initial Sync:

  • Clone all databases except the local database. To clone, Mongod scans each collection in each source database and inserts all the data into its own copy of the collection. Initial synchronization builds all collection indexes as documents are copied for each collection. In earlier versions of MongoDB, only the _ID index was built at this stage.

  • Initial synchronization extracts newly added Oplog records during data replication. Ensure that the target member has sufficient disk space in the local database to temporarily store these Oplog records for the duration of this data replication phase.

  • Apply all changes to the data set. Using Oplog from the source, Mongod updates its dataset to reflect the current state of the replica set. After the initial synchronization, the member is converted from STARTUP2 to SECONDARY.

2. Standard replication set architecture

The standard replication set architecture consists of three servers, including three data nodes (one master, two slave) or two data nodes (one master, one slave) and one quorum node. As follows:

Three data nodes:

  • A primary node;
  • Two secondary nodes. If the primary node goes down, it has the opportunity to elect the primary node.

Two data nodes and one quorum node:

  • A primary node;
  • A slave node that has a chance to be elected as the master node;
  • A quorum node with voting rights only.

3. Node type

Priority 0 node

A priority 0 node cannot become a primary node or trigger an election. Configuring the slave node to priority 0 to prevent it from becoming the master is particularly useful in multi-data center deployments, where in many cases you do not need to set the standby database to priority 0. However, in replica sets with different hardware or geographical distribution, a standby database with priority 0 ensures that only some members become the primary database, which can be configured based on the actual network quality of the actual network partition, etc.

For example, a data center hosts both primary and secondary data centers:

Hidden node

Hidden node:

  • A hidden slave node is a copy of the maintenance master data set, but is not visible to the client application. Hidden slave nodes are applicable to use patterns that are different from those of other members of the replica set.
  • A hidden slave must always have priority over a 0 slave and therefore cannot be a master node. Hidden slave nodes may vote in elections.
  • The hidden slave node will not receive requests from the application. We can use hidden slave nodes exclusively for report nodes or backup nodes.

Delayed node

Because lazy slave nodes are “rolling backups” of data sets or run “historical” snapshots, they can help you recover from various human errors. For example, delayed nodes can recover from unsuccessful application upgrades and operator errors, including discarded databases and collections. And the deferred slave node must be a slave node with priority 0, which is also a hidden slave node. It cannot be the primary node and cannot be queried by the client.

  • Must be equal to or greater than the expected maintenance window duration.
  • The value must be smaller than the oplog capacity.

Vote nodes and non-vote nodes

A replicated set node can determine whether it has voting rights by configuring members[n].votes! Votes 1 is a voting node with voting rights. Votes 0 is a voting node that cannot vote. Non-voting nodes must have a priority of 0, and members whose priority is greater than 0 cannot have a value of 0. Although non-voting members do not vote in elections, these members hold copies of the replica set data and can receive read operations from client applications.

In addition, the replica set can contain a maximum of 50 members, but only 7 voting members, so non-voting members allow the replica set to have more than 7 members. And voting members can vote only if they have the following status:

  • PRIMARY
  • SECONDARY
  • STARTUP2
  • RECOVERING
  • ARBITER
  • ROLLBACK

Configuration:

{
   "_id" : <num>,
   "host" : <hostname:port>,
   "arbiterOnly" : false,
   "buildIndexes" : true,
   "hidden" : false,
   "priority" : 0,
   "tags" : {
},
   "slaveDelay" : NumberLong(0),
   "votes" : 0
}
Copy the code

4. Deployment Structure:

The maximum voting member is a quantitative replica set can contain up to 50 members, but only 7 voting members. If the replica set already has seven voting members, the other members must be non-voting members.

Deploying an odd number of member replica sets should ensure that you have an odd number of voting members, or if you have an even number of voting members, deploy the quorum node so that the set has an odd number of voting members. Quorum nodes do not store copies of data and require fewer resources. Therefore, you can run mediators on application servers or other shared processes. Fault tolerance The fault tolerance of a replica set is the number of members that become unavailable while still leaving enough node members in the replica to select primary node members. Fault tolerance is the effect of replica set size, as shown in the following table:

Number of Members Majority Required to Elect a New Primary Fault Tolerance
3 2 1
4 3 1
5 3 2
6 4 2

It follows, therefore, that adding an even number of members to a replica set does not always increase fault tolerance. However, in these cases, setting one of the nodes as a hidden and deferred slave can provide support for specialized functions, such as backup or reporting.

Increase read load Capability In a deployment with very high read traffic, you can improve read throughput by distributing reads to slave nodes. As the deployment grows, nodes are added or moved to standby data centers to improve redundancy and availability.

Advantages of replica sets distributed in two or more data centers:

  • If one of the data centers fails, the data can still be read.
  • If a data center with a few members fails, the replica set can still provide write operations as well as read operations. However, if the data center with the majority of members fails, the replica set becomes read-only.

Deploy data nodes in different regions (with standby data centers) To protect your data in the event of a data center failure, maintain at least one member in the standby data center. If possible, use an odd number of data centers and select a member distribution to maximize the probability that even if the data center is lost, the remaining replica set members can form can form a “majority” of the selected master nodes and have the ability to provide copies of the data. To ensure that nodes in the active data center are selected as primary members before nodes in the standby data center, set the members[n]. Priority of nodes in the standby data center to lower than nodes in the active data center, as shown below:

A replica set of three node members is deployed according to the deployment structure. The reasonable distribution and resolution of the members are as follows:

  • Two data centers: two members of data center 1 and one member of data center 2. If one of the members of the replica set is an arbitrator, the arbitrator is assigned to data center 1 that has data-hosting members.
  • If datacenter 1 is closed, the replica set becomes read-only.
  • If datacenter 2 is closed, the replica set is still writable because members in datacenter 1 can be elected.
  • Three data centers: one member to data center 1, one member to data center 2, and one member to data center 3.
  • If any data center fails, the replica set is still writable because the remaining members can hold elections.

The replica set of five pairs of node members, which are reasonably distributed and resolved as follows:

  • Two data centers: three members of data center 1 and two members of data center 2.
  • If datacenter 1 is closed, the replica set becomes read-only.
  • If datacenter 2 is closed, the replica set is still writable because members in datacenter 1 can create a majority.
  • Three data centers: two members of data center 1, two members of data center 2, and one member of data center 3.
  • If any data center fails, the replica set is still writable because the remaining members can hold elections.

The HA cluster has independent election capability. The factors and conditions that affect the selection are as follows:

  • Select agreement
  • Heartbeat mechanism: By default, members of a replication set send heartbeat messages every 2 seconds. If no heartbeat message is received from a node every 10 seconds, the node is considered down and inaccessible. If the node is Primary, Secondary (if it can be selected as Primary) initiates a new Primary election.
  • Node priority: Each node tends to vote for the node with the highest priority. The node with priority 0 cannot become the Primary node and will not initiate the Primary election. When the Primary finds Secondary with a higher priority and the data of the Secondary lags behind within 10s, the Primary will actively degrade, so that the Secondary with a higher priority has a chance to become the Primary.
  • Lost Data Center: With distributed replica sets, the loss of a data center may affect the ability of other data centers or the remaining members of the data center to select the primary database. If possible, replica set members are distributed between data centers to maximize the possibility that one of the remaining replica set members can become the new primary member even if the data center is lost.
  • Network partition: the Primary node can be selected only if the network can be unblocked with most voting nodes. If the Primary loses contact with most nodes, it will actively downgrade to Secondary. When network partitioning occurs, multiple primarys may occur in a short period of time. Therefore, it is recommended that the Driver set a “most successful” policy when writing data. In this way, even if there are multiple primarys, only one Primary can successfully write the majority of data.

5. Write concern and Read Preference

5.1 Write concern

Write Concern describes the number of data bearer members (that is, master and slave node members, but not arbitrators) of a Write operation that must be identified before the operation returns a success. A member can only acknowledge a write after it has been received and successfully applied. For replica sets, the default w: 1 Write Concern requires that only the Primary Primary node acknowledge the Write before returning Write Concern confirmation. You can specify an integer value greater than 1 to require validation of the master node and the number of slave nodes required to satisfy the specified value, up to the total number of data bearer members in the replica.

Write Concern the Client issues a Write operation with a Write request and waits until the master node receives an acknowledgement from the specified number of members that need to Write the query. For a Write concern larger than 1 or W: “majority”, the master node sends a Write concern to the client to confirm the Write after receiving the required number of secondary nodes. For w: 1 Write consult Write Concern, it can mainly return a writable reply immediately when a local application (single-machine mode) writes, because it is qualified to decide on the requested Write Concern.

Write operations that specify timeout waiting for writes consult Write Concern only indicate that the required number of replica set members have not acknowledged writes within the wTimeout period. It does not necessarily indicate that the Primary node failed to apply writes.

Validate writes add the write Concern option to the INSERT () method and specify “most” write concerns and a 5-second timeout so that the operation does not block indefinitely, as follows:

db.products.insert( { item: "envelopes", qty : 100, type: "Clasp" }, { writeConcern: { w: "majority" , wtimeout: 5000}})Copy the code

For example, in a replica set of three node members, the action would require confirmation from two of the three members. If the replica set is later scaled to include two additional voting nodes, the same operation will require confirmation from three of the five replica set members. If the master node server does not return a Write advisory Write Concern acknowledgement within the wtimeout limit, the Write operation will fail with a Write problem error.

Change the default Write Concern can pass in the replica set configuration Settings Settings. GetLastErrorDefaults Settings to modify the copy set default Write problem. Configure to wait for a write operation before returning (after confirmation on most voting members) action command:

cfg = rs.conf()
cfg.settings.getLastErrorDefaults = { w: "majority", wtimeout: 5000 }
rs.reconfig(cfg)
Copy the code

5.2 Read the Preference

Read Preference is how mongodb assigns Read operations to nodes, and by default, the application directs its Read operations to the primary member of the replica (Read Preference mode “primary”). However, clients can specify read preferences to send read operations to secondary nodes. The Read Preference mode is as follows:

  • Primary: The default rule. All read requests are sent to primary
  • PrimaryPreferred: Primary preferred. If the Primary is unreachable, request Secondary
  • Secondary: All read requests are sent to secondary
  • SecondaryPreferred: Secondary Preferred. When all Secondary is unreachable, the Primary is requested
  • Nearest: send the read request to the nearest reachable node (ping the nearest node)

The following are common use cases using the read preference pattern:

  • Provides local reads for geographically distributed applications.
  • If you have application servers installed in multiple data centers, consider using geographically dispersed replica sets and non-primary or nearest read preferences. This allows the client to read from the lowest-latency member instead of always reading from the primary member.
  • Maintain availability during failover.
  • PrimaryPreferred is allowed if you want your application to read from the primary database under normal conditions, but stale data is allowed to read from the secondary server when the primary database is not available. This provides “read-only mode” for applications during failover.


Finally, you can pay attention to the public number, learning together, every day will share dry goods, and learning video claim!