Analysis of Nacos high availability implementation mechanism

The article directories

- preface
- Introduction to High availability
- Client retry
- Distro, consistency protocol
- Local cache file Failover mechanism
- Heartbeat Synchronization Service
- Cluster deployment mode High availability
- - Number of nodes
  - Multiple availability zone deployment
  - Deployment patterns
- High availability best practices for MSE Nacos
- conclusion

Introduction: The high availability of Nacos I describe today is a series of steps Nacos has taken to improve system stability. The high availability of Nacos exists not only on the server side, but also on the client side, as well as some usability related features. These points are assembled together to make up the high availability of Nacos.

preface

Service registration discovery is an enduring topic. ZooKeeper, the default registry in the early open source of Dubbo, first came into people’s sight, and for a long time, people have equated the registry with ZooKeeper. ZooKeeper’s designers probably didn’t realize how much impact the product would have on microservices until Spring Cloud became popular and its own Eureka came into view, and people realized there were other registry options. Later, Ali, who was keen on open source, also focused on the registry, and Nacos came out of the blue.

Kirito’s thoughts on choosing a registry: Once I had no choice, now I just want to choose a good registry, preferably open source, which is open, transparent and self-controlled. Not only should it be open source, but it should have an active community to ensure that features evolve to meet growing business needs, that problems can be fixed, and that features are robust. In addition to meeting the requirements of registration services and push services, it should also have the functions required in the perfect micro-service system. The most important thing is that it should be stable. It is best to have the endorsement of the actual use scenes of big factories to prove that this is a product that can stand the test of actual combat. Of course, cloud native features, security features are also important

Kirito seems to be asking too much of a registry, but there’s always a comparison to be made when the various registries are presented to users. As mentioned above, features, maturity, usability, user experience, cloud native features, and security are all topics that can be compared. Today’s article focuses on the usability of Nacos, and hopefully it will give you a deeper understanding of Nacos.

Introduction to High availability

When we talk about high availability, what are we talking about?

The system availability reaches 99.99%
In a distributed system, the failure of some nodes does not affect the overall system running
Multiple servers are deployed in a cluster

These can be considered as high availability, and THE Nacos high availability I introduce today is a series of measures taken by Nacos to improve system stability. The high availability of Nacos exists not only on the server side, but also on the client side, as well as some usability related features. These points are assembled together to make up the high availability of Nacos.

Client retry

To unify the semantics, there are generally three roles in a microservice architecture: Consumer, Provider, and Registry, in today’s Registry topic Registry is nacOS-Server, and Consumer and Provider are nacOS-clients.

In a production environment, we often need to set up Nacos clusters, and in Dubbo we also need to explicitly configure the cluster address:

"Dubbo: registry protocol =" nacos address "=" 192.168.0.1:8848192168 0.2:8848192168 0.3:8848 "/ >Copy the code

When one of the machines goes down, there is a retry mechanism on the client so as not to affect the overall operation.

The logic is simple: get the list of addresses and try each one until the request succeeds.

This availability guarantee exists on nacOS-client.

Distro, consistency protocol

This section does not discuss the implementation process of a consistency protocol, but rather focuses on its features related to high availability. Some articles describe the consistency model of Nacos as AP + CP, which is misleading. In fact, Nacos does not support two consistency models, nor does it support switching between two models. Before introducing the consistency model, it is necessary to understand two concepts in Nacos: temporary services and persistent services.

Ephemeral: The Ephemeral service is removed from the list after the health check fails and is often used in service registration discovery scenarios.
Persistent services: Persistent services are marked as unhealthy if they fail the health check. They are often used in DNS scenarios.

Distro, a proprietary protocol customized by Nacos for service registry discovery scenarios, is used for interim services, whose consistency model is AP; The persistence service uses the RAFT protocol and the consistency model is CP. So don’t say Nacos is AP + CP in the future, it is recommended to add constraints of service node status or usage scenarios.

What does DISTRo protocol have to do with high availability? In the previous section, we mentioned that when a Nacos-server node goes down, the client retries, but without the prerequisite that nacos-Server can still work without one node. The stateful application of Nacos is different from the general stateless Web application. It does not mean that as long as a node exists, it can provide services externally. It needs to be discussed in case, which is related to the design of consistency protocol. Distro agreement works as follows:

Nacos starts by synchronizing all data from other remote nodes.
Nacos Each node is equal and can process write requests while synchronizing new data to other nodes.
Each node is only responsible for part of the data, and periodically sends its own data verification value to other nodes to maintain data consistency.

As shown in the figure above, each node is responsible for writing a portion of the service, but each node can receive a write request, so there are two situations:

When the node receives a service that belongs to the node, it writes to the node directly.
When the node receives a service that does not belong to the node, it forwards the route to the corresponding node in the cluster to complete data writing.

Read operations do not require routing because each node in the cluster synchronizes the service status and each node has a copy of the latest service data.

When a node goes down, the writing task of some services that the node is responsible for will be transferred to other nodes, so as to ensure the overall availability of the Nacos cluster.

One complication is that the node does not go down, but there is a network partition, as shown below:

This situation can hurt availability, and the client can behave as if the service is sometimes present and sometimes not.

In summary, DISTRo consistency protocol at Nacos ensures that, in most cases, a machine failure in a cluster will not compromise the overall availability. This availability guarantee exists on nacOS-Server side.

Local cache file Failover mechanism

The worst-case scenario for a registry failure is that the entire Server goes down, and Nacos still has a high availability mechanism to back it up.

A classic Dubbo interview question: Will RPC calls be affected when the Nacos registry goes down while the Dubbo application is running? Dubbo has a copy of the address stored in memory. On the one hand, this is designed for performance, since it is impossible to read the registry every time an RPC call is made. On the other hand, when the registry is down, there is a copy of the data stored in memory. This also ensures usability (although perhaps Dubbo’s designers didn’t take this into account).

What if, on top of that, I throw up a question: Will RPC calls be affected if the Nacos registry goes down and the Dubbo application is restarted? If you understand the Failover mechanism of Nacos, you should get the same answer as the previous question: No.

Nacos has a local file caching mechanism. After receiving service push from NacOS-Server, nacOS-client will save a copy in memory and then drop disk to store a snapshot. The default storage path of snapshot is {USER_HOME}/nacos/naming/.

This file has two values. First, it is used to check whether the server pushes the service normally. Second, when the client loads the service, if the data cannot be retrieved from the server, the client loads the data from the local file by default.

NamingLoadCacheAtStart =true Dubbo 2.7.4 or later supports this Nacos parameter. How to enable this parameter: dubo.registry. Address =nacos://127.0.0.1:8848? namingLoadCacheAtStart=true

In the production environment, it is recommended to enable this parameter to avoid service unavailability when the registry goes down. In service registration discovery scenarios, availability and consistency trade off, availability is most of the time preferred.

Careful readers will also notice that {USER_HOME}/nacos/ Naming /{namespace} has a failover folder in addition to the cache file, which stores the same folder as snapshot. This is another failover mechanism of Nacos. The Snapshot is restored from a snapshot of a service at a historical point in time, and the service in failover can be manually modified to cope with extreme scenarios.

This availability guarantee exists on nacOS-client.

Heartbeat Synchronization Service

The heartbeat mechanism is widely used in distributed communication to confirm survival status. The design of normal heartbeat requests is different from that of normal requests. Heartbeat requests are generally designed to be thin enough to avoid performance degradation during timed probes. However, in Nacos, for the sake of availability, a heartbeat message contains all service information, which reduces throughput and improves availability compared with only sending probe information. Consider the following two scenarios:

All nacos-server nodes are down, and all service data is lost. Nacos-server cannot be restored to service even if it is up and running, while heartbeat containing all contents can be restored to service during heartbeat to ensure availability.
Nacos-server network partition appears. Because the heartbeat can create services, the underlying availability remains in the event of an extreme network failure.

The following is the test of heartbeat synchronization service, using Ali Cloud MSE to provide Nacos cluster for testing:

Call OpenApi to remove each service in turn:

curl -X "DELETE mse-xxx-p.nacos-ans.mse.aliyuncs.com:8848/nacos/v1/ns/service?serviceName=providers:com.alibaba.edas.boot.EchoService:1. 0.0: DUBBO&groupName = DEFAULT_GROUP"Copy the code

After 5s refresh, the service is registered again, which is in line with our expectation of heartbeat registration service.

Cluster deployment mode High availability

The final high availability features of Nacos are derived from its deployment architecture.

Number of nodes

We knew we couldn’t run Nacos in standalone mode in a production cluster, so the first question was: How many machines should I deploy? As mentioned earlier, DISTRo has two consistency protocols: Distro and Raft. Distro doesn’t have a brain-splitting problem, so in theory, the number of nodes should be two or more; The voting mechanism for RAFT protocol is suggested to be 2N +1 nodes. Overall, 3 nodes is the minimum, followed by 5, 7, or even 9 nodes for throughput and higher availability.

Multiple availability zone deployment

As far as possible, two factors should be considered in the Nacos nodes that make up the cluster:

The network delay between nodes cannot be very high; otherwise, data synchronization will be affected.
To avoid a single point of failure, separate the equipment room and availability area where each node resides.

Taking Ali Cloud’s ECS as an example, it is a good practice to select different availability zones in the same Region.

Deployment patterns

There are two modes: K8s deployment and ECS deployment.

The advantage of ECS deployment is that it is simple, you can buy three machines to build a cluster, if you are familiar with Nacos cluster deployment, this is not difficult, but it cannot solve the operation and maintenance problems, if a Nacos node OOM or disk problems, it is difficult to quickly remove, cannot achieve self-operation and maintenance.

The advantage of K8s deployment lies in its strong cloud native operation and maintenance capability, which can realize self-recovery after node breakdown to ensure the smooth operation of Nacos. As mentioned earlier, Nacos is a stateful application, which is different from stateless Web applications. Therefore, when deployed in K8s, components such as StatefulSet and Operator are often used to realize the deployment, operation and maintenance of Nacos clusters.

High availability best practices for MSE Nacos

Ali Cloud micro-service engine MSE provides the hosting capability of Nacos cluster and realizes the high availability of cluster deployment mode.

When creating a cluster with multiple nodes, the system allocates them to different availability zones by default. At the same time, it is transparent to the user, who only needs to care about Nacos functionality, and MSE takes care of the user’s availability.
The MSE layer deploys Nacos using K8s operation and maintenance mode. Historically, some nodes have gone down due to user misuse of Nacos, but thanks to K8s’s self-operation mode, the downed nodes are pulled up so quickly that users may not even realize they have gone down.

Let’s simulate a node down scenario to see how K8s can self-recover.

A three-node Nacos cluster:

Run the kubectl delete pod mSE-7654c960-1605278296312-reg-center-0-2 command to simulate the shutdown of some nodes.

About 2 minutes later, the node is restored and the role is changed. The Leader switches from the killed node 2 to node 1.

conclusion

This article summarizes how Nacos ensures high availability from a number of perspectives. High availability is not achieved by deploying more nodes on the server, but by integrating client usage, server deployment mode, and usage scenarios.

Especially in service registry discovery scenarios, Nacos puts a lot of effort into usability, guarantees that ZooKeeper doesn’t necessarily have. When it comes to registry selection, Nacos is absolutely superior in terms of availability assurance.

Contents of this series

An overview of the Spring Cloud microservices series of articles, which will be updated in real time

🍎QQ group [837324215] 🍎 pay attention to my public number [Java Factory interview officer], learn together 🍎🍎🍎 🍎 personal vx [Lakernote]