Technology selection

The company’s RPC framework is Dubbo, and the service discovery component used with it has been ZooKeeper, which hasn’t been a problem for a long time. As for why you should consider replacing ZooKeeper, it’s not because of its performance bottleneck, but because of its evolution to the cloud native.

The Cloud Native Computing Foundation (CNCF) defines cloud native as:

Cloud native enables organizations to build and run applications that scale flexibly in new and dynamic environments such as public, private, and hybrid clouds. Representative technologies for cloud native include containers, service grids, microservices, immutable infrastructure, and declarative apis. These techniques enable the building of loosely-coupled systems that are fault tolerant, easy to manage, and easy to observe. Combined with reliable automation, cloud-native technology enables engineers to easily make frequent and predictable major changes to the system.

If you want to be born in the cloud, the key step is mesh. Currently, service mesh is mainly discussed, and service mesh is described in one sentence

Service Mesh is a TCP protocol in the era of microservices

The most obvious advantage of a service grid over current microservices is that it sinks the infrastructure, separating service governance from business development.

A simple example, if you want to do a good job in service governance in micro service architecture, have to introduce a large number of third-party components, such as current limiting, fusing, monitoring, service discovery, load balancing, link tracking, and so on a series of components, these components are big probability to be business code in the form of a jar package dependence, even different development language must also maintain different components, Business and infrastructure coupling makes service governance extremely difficult.

Service Mesh handles communication between services and is usually made up of a series of grid agents that are transparent to the application. Everything in service governance (including, but not limited to, current limiting, circuit breakers, monitoring, service discovery, etc., as described above) can sink into the grid proxy.

All that said, what does replacing ZooKeeper have to do with anything?

There are three major components in Dubbo’s design: the service provider, the Consumer, and the Registry.

After the provider starts, it registers the IP, port, service name, method name and other information it provides with Registry. When the consumer initiates a call, it searches the service name and method name from Registry to the corresponding IP and port to initiate a remote call.

The service registration and discovery mechanism under the cloud native system is very different from that of Dubbo. Cloud native is based on container orchestration, while the mainstream K8S service registration and discovery is based on DNS, that is, to find a corresponding IP through a domain name.

As a result, if it is difficult to migrate the Dubbo service to the cloud native system, is there a component that is compatible with both service registry discovery? After research, NACOS is.

First of all, nacOS, like ZooKeeper, provides the registration and discovery method of traditional microservices. Secondly, NACOS also provides the plug-in DNS-F (DNS Filter) based on coreDNS. Dns-f acts as a proxy to intercept DNS requests on hosts. If the service can be found on NACOS, return the registered IP directly, otherwise continue to look for DNS.

When finished, our Service Mesh architecture looks like this

The access policies inside and outside the grid are as follows:

  • Net extra dubbo -> Grid within dubbo: Registry
  • Net extra dubbo -> Net extra dubbo: registry
  • Dubbo in grid -> Net extra dubbo: domain name => DNS -f => Registry
  • Dubbo in grid -> Dubbo in grid: domain name => DNS -f => DNS
  • Heterogeneous languages (SUCH as PHP and Node) can directly invoke the IP address based on the service name. DNS-F intercepts the IP address and resolves the IP address to the correct IP address. The load balancing policy can be adjusted

The nacosAP model’s high availability and write scalability, CMDB connectivity, and other features were also considered.

The migration plan

If you want a smooth migration from ZooKeeper to NACOS, there are two options:

  • Change the dubbo application to dual registration (register to ZooKeeper and NACOS at the same time), and switch to NACOS after all applications are reformed
  • Use the migration tool to migrate all services registered on ZooKeeper to NACOS, and then slowly modify the application. You can enjoy the new features brought by NACOS without waiting for complete migration

Solution 1 is simple to implement, but the cost of transformation is large, and it is difficult to migrate some old and unmaintained services. Even at the company level, PHP, Node and other services also rely on ZooKeeper, so it is impossible to migrate them all at once.

Solution 2 requires a powerful migration tool that adds complexity to the technology, but NACOS provides nacosSync.

Of course, we chose option 2 with a little optimization. In order to reduce the migration risk, based on the excellent scalability of Dubbo, a set of state registry is customized. The dynamic registry reads the configuration from the configuration center when the service is started, chooses to register with ZooKeeper or NACOS (or both), and chooses nacOS or ZooKeeper when the service is consumed. A consumer can only specify one registry.

By default, zooKeeper is registered in two registries at the same time and consumed. After the JAR package is introduced, the service side does not realize the switchover. The configuration only needs to be changed during the switchover.

Migration Tool optimization

The principle of nacosSync is very simple. If you synchronize data from ZooKeeper to nacos, start nacosSync as a ZooKeeper client, pull down all zooKeeper services, parse them into nacOS service formats, and register them with NacOS. At the same time, each service node is monitored and nacOS data is updated when there is a change.

Unidirectional Synchronization policy

NacosSync enables bidirectional synchronization from ZooKeeper to NACOS, but we feel that bidirectional synchronization is risky. After all, NACOS is a new thing and stability is not guaranteed. If the data in NACOS is wrong, synchronizing to ZooKeeper will cause production failure. Therefore, a conservative one-way synchronization strategy from ZooKeeper to NACOS was adopted.

High availability

As a migration tool that needs to be online for a long time, it needs to ensure its own stability and high availability. Imagine if the migration tool breaks down, all services will be dropped from NACOS, which will be a devastating blow. NacosSync sinks the data store into a database, and the components themselves are stateless and can be deployed in multiple units, preventing single points of failure. However, it also brings another problem. The deployment of N servers is N times the pressure on the NACOS server, because a service will be registered N times and updated N times after modification. The optimization of this section will be described later.

Full synchronization support

NacosSync does not support full synchronization and can only be configured service-by-service. With 3k+ services, it is not possible to manually configure service-by-service. So developing a full configuration is easy, but useful.

Zookeeper events are processed out of sequence

After listening to the ZooKeeper node, nacosSync synchronizes the changed data to nacOS when the ZooKeeper node changes.

However, during the testing process, we found that after dubbo service is offline, if there is no graceful underline (such as the process is killed -9), zooKeeper will kick the node within seconds or minutes (depending on the configuration). If the service is re-registered, The remove event on a node may arrive after the add event on a new node, which leads to a very important problem: newly registered services are logged out by the old Remove event.

The solution is relatively simple, dubbo registered node information contains millisecond timestamp information, each time processing timestamp, if greater than or equal to the current value, the event is considered valid, otherwise, the event is regarded as an old event, discarded and not processed.

After passing such logic judgment, there is no such problem again.

Active heartbeat detection

As two nacosSync are deployed, if an unknown exception occurs to one nacosSync when a service goes offline, the service will be unavailable, but it has been on NACOS, and then an error will be reported when invoking it.

In order to avoid this, nacosSync has added a new check on the machine port, every once in a while to establish a connection to all machines, if failed, to see if the node in ZooKeeper exists, does not exist and then remove the machine.

Why can’t a heartbeat detection failure be deleted directly? Sometimes the server will deny the connection or time out, but the service is still online, so use ZooKeeper. As for the reason why we don’t scan ZooKeeper directly, we are worried about the performance of ZooKeeper. If the scan fails, it will be a big fault.

Nacos optimization

Once the migration tool is optimized, start synchronizing all online services to NACOS.

At first, we built a NACOS cluster with three nodes. As a result, due to the large number of services, the CPU of the NACOS machine remained at a very high value for a long time, over 50%. Here is a set of data for reference:

  • Service number: 3K +
  • Number of service instances: 30K +
  • NacosSync Number of nodes: 2
  • Number of NACOS nodes: 3 (50% to 80% CPU usage)

Monitor the perfect

How to optimize? First of all, the performance bottleneck was found out. Nacos did the monitoring based on spring Boot, but it was very weak and did not have the desired data, so the monitoring was improved from the two dimensions of client and server. Here are several monitoring indicators that I think are important

  • Nacos server: CPU ratio, number of services, Number of instances, number of requests received (API differentiated), request response time (API differentiated), heartbeat speed, push time (native), Push volume (native)
  • Nacos client: request volume (apI-specific), request time (API-specific), heartbeat rate

The heartbeat optimization

After the above monitoring is perfected, the bottleneck can be seen at a glance, there are too many heartbeat requests, 99% of the requests are heartbeat requests.

This is related to the design of NACos and Dubbo. Dubbo registration is the service dimension, one IP registers many service instances, while nacOS heartbeat is at the latitude of instance, and the default is one heartbeat per instance 5 seconds.

Close to 40K instances, every 5 seconds 40K heartbeat requests, converted to 8K /s QPS, and using two NACossyncs, double heartbeat requests 16K /s, and HTTP requests, the node internal data synchronization task, CPU is not high.

So we came up with a series of ways to optimize:

Adjust the heartbeat interval

The heartbeat time is adjusted to double the default value, that is, 10 seconds. At the same time, the time of no heartbeat offline is adjusted (30s is changed to 60s) to sacrifice the real-time detection of instance offline.

capacity

The effect of expanding nacOS servers from 3 to 5 has been achieved, but not obvious.

To reduce the heart

Since we need to migrate the service gradually, if the service itself sends the heartbeat and the two nacosSync also sends the heartbeat, the service after migration will have three times the heartbeat requests. At the same time, this also leads to the risk that the service will still be online if one party is not removed after the service is offline.

Therefore, in the dynamic registry mentioned above, we added a metadata message withNacos=true for services registered withNacos, and then modified the logic of nacosSync to ignore services synchronized by zookeeper with withNacos=true. In this way, as long as the migrated service will only send heartbeat by the service itself, reducing heartbeat requests.

Merge the heartbeat

A large number of services are registered in nacosSync. According to the previous calculation, about 8K heartbeats are sent every second. If these heartbeats can be merged, the network consumption of heartbeats will be greatly reduced, and the batch processing on the server can also be accelerated.

It is necessary to understand distro protocols in NACOS AP mode before implementing heartbeat merger. For this part, please refer to “Introduction to NacOS Consistency Protocol Distro”.

A brief summary of the processing path of a single heartbeat:

The client randomly selects a NACOS node to have a heartbeat for a certain service instance. Since each NACOS server node is only responsible for part of the service, it will judge whether it is responsible for the service after receiving the request. If it is, it will process it; if not, it will transfer it to the responsible node.

It is easy to process a single heartbeat route. If the combined service sends a heartbeat, the server needs to classify the received requests according to the responsible node. After the classification, only its own services are processed, and the services that are not its own are transferred to other nodes in batches. Be careful whether you handle requests from the client or the server.

If the buffer size is too small, heartbeats may be lost; if the buffer size is too large, it will consume too much memory.

The effect of the merger was immediate. Not only did the server CPU drop from 50%+ to less than 10%, but nacosSync’s CPU consumption also dropped by half.

Only this graph was saved, and the CPU was around 20% at that time because there was a bug, and after fixing it, the CPU was within 10%.

A long connection

At this point, the heartbeat problem is only half solved, because bulk heartbeats are more effective when there is a large number of services in nacosSync. With only 10 instances on a single server, you can’t save more than a few heartbeat requests in a second, so you’re not going to be as effective. After analyzing the possible setup of an HTTP request and the web Filter performance cost of each request taking a long time, I wanted to see the effect of changing it to a persistent connection.

To quickly verify the guess, change only the heartbeat interface, which has the largest number of interfaces and solves 80% of the problems.

Long connection to use what to achieve, the original consideration of netty and GRPC two solutions. For quick verification, we locked the GRPC. At the same time, the dnS-F mentioned above is actually a NACOS client, which is a GO language implementation and just supports GRPC natively, so we did not hesitate to implement a version with GRPC.

Use the configuration to choose to use native heartbeat, batch heartbeat, and GRPC heartbeat.

One problem encountered in the implementation of distro protocol is that the heartbeat is forwarded to the responsible node, which is native internal relay. If distro is implemented in this way, it will be more complicated and need to maintain the long connection inside the cluster. Consider writing the logic to the client and, like Redis, redirect the request to the responsible node when it is not responsible for the service and the client reinitiates the request.

At the beginning, the client randomly selects a node to send the service. If the redirect is received, it caches the target node of the service and clears the cache the next time it encounters the Redirect or an error. In this way, the selection of nodes can be guaranteed correctly. At the same time, the logic is not very complicated, and it is relatively simple to implement in DNS-f.

The final measurement is that if nacosSync uses the GRPC heartbeat in full, it is slightly higher than the batch heartbeat CPU, not much. This is a single heartbeat send, which is good enough to show that even if the service is completely migrated, it can be close to batch send efficiency.

The key interface is persistently connected

After having tasted the sweetness of long connection optimized heartbeat, several important interfaces, such as service registration, service pull, etc., were changed to long connection, and the long connection was also adapted to DNS-F. The effect was very good, reaching our expectation of NACOS performance.

Graceful up-and-down line

Nacos offers elegant offline interface, namely, offline a service, but is for instance latitude, for internal release system is not very friendly, don’t know what’s in the machine service release system, so you need to provide an IP latitude line interface, can also be understood as a batch offline interface, implementation with batch heartbeat interface, Also pay attention to handling distro protocols.

DNS – F improvement

A long connection

This has been mentioned above and will not be repeated.

The dubbo service domain name is invalid

Dubbo’s registered services on NACOS are:

providers:com.xx.yy.zzCopy the code

In common, we use this service name as a domain name to make a dubbo call. However, quotation marks are not valid in domain names, and direct access to this domain name will cause an error. Therefore, we changed the code of DNS -f to use it when calling

providers.com.xx.yy.zzCopy the code

Dns-f internally replaces providers. With providers: for the domain name, the change is minimal.

High availability

Because DNS -f itself runs on the machine as an agent process, high availability is ensured in two ways.

  • Monitor the DNS -f process, and pull it up in time after it stops
  • Set up a centralized DNS -f cluster. If the local DNS -f is unavailable, the DNS passes through the DNS -f cluster first, and then goes to the normal DNS cluster

The last

Nacos, as a relatively new open source component, is bound to encounter a variety of problems when using. This article focuses on the author in the migration of ZooKeeper to NACOS encountered in the more important pit, hope to help you, of course, there are more details limited to space can not be listed.