Blockbuster release: Microservices Engine MSE Professional edition

Brief introduction:10x performance improvement, higher SLA assurance, and 20% off resource packs for new users for a flash sale.

Microservice Engine MSE Professional edition has been released to support Nacos 2.0. Compared to the basic edition, the professional edition has a higher SLA guarantee, ten times better performance, 99.95% availability, and further enhanced configuration capabilities. New users can get 20% off their first purchase.

Since the release of Nacos 1.0, Nacos has been rapidly adopted by thousands of businesses and built a strong ecosystem. However, with the in-depth use of the product, some performance problems were gradually exposed, so we started the design of the intergeneration product of Nacos 2.0. After half a year, we finally realized all of it, and the measured performance was improved by 10 times. I believe it can meet the performance requirements of all users. On behalf of the community, I would like to introduce this cross-generational product to you.

Nacos profile

Nacos is a dynamic service discovery, configuration management, and service management platform that makes it easier to build cloud native applications. It was incubated in Alibaba and grew up during the double eleven event in the last decade. It has accumulated the core competitiveness of simplicity, reliability and excellent performance.

Nacos 2.0 architecture

Not only does the new 2.0 architecture significantly improve performance by a factor of 10, but it also implements layered abstraction of the kernel and a plug-in extension mechanism.

Nacos 2.0 architecture layers are shown below. The major changes compared to NACOS1.x are:

Communication layer unified to the gRPC protocol, while improving the client and server traffic control and load balancing capabilities, improve the overall throughput.
The storage and consistency model is fully abstracted and layered, the architecture is simpler and cleaner, the code is more robust, and the performance is more powerful.
Extensible interfaces are designed to improve integration capabilities, such as allowing users to extend their own security mechanisms.

Nacos2.0 Service discovery upgrade consistency model

Service discovery under Nacos2 architecture, the client through Grpc, initiates the registration service or subscription service request. The Client object is used by the server to record which services the Client publishes and subscribes to using the Grpc connection, and to synchronize the Client between services. The actual usage habit is the mapping between the service and the client, that is, which client instances are under the service. Therefore, the 2.0 server will quickly generate Service information similar to 1.x by building indexes and metadata, and push the Service data through Grpc Stream.

Nacos2.0 configuration management upgrade communication mechanism

Configuration management before using Http1.1 Keep Alive mode 30s send a heartbeat simulation long link, protocol is difficult to understand, memory consumption, push performance is weak, so 2.0 through gRPC to solve these problems, memory consumption is reduced.

Nacos2.0 architecture advantages

Nacos2.0 significantly reduces resource consumption, improves throughput performance, optimizes client-server interaction, and is more user-friendly. Although there is a slight decrease in observability, the overall cost performance is very high.

Nacos2.0 performance improvement

Since Nacos consists of two modules, service discovery and configuration management, the business model is slightly different, so we will introduce the specific pressure measurement indicators respectively.

Performance improvements for Nacos2.0 service discovery

Service discovery scenarios We focus on the number of clients, the number of service instances, and the number of service subscribers. In large-scale scenarios, the performance of the server in push and steady state. It is also concerned with the performance of the system when a large number of services are going online and offline.

Capacity and steady state testing

This scenario focuses on system performance as the service scale and client instance scale increase.

As you can see, the 2.0.0 version has stable support at the 10W-level client scale, with very low CPU consumption after reaching a steady state. Although in the initial mass registration phase, there is a certain push timeout due to the instantaneous large number of registrations and pushes, the push will be successful after retries, without affecting the data consistency.

In the X version, the server is in the Full GC state on the 10W and 5W clients. The push fails completely and the cluster is unavailable. At the 2W client scale, although the server is running normally, it cannot reach steady state due to slow heartbeat processing, and a large number of services are repeated during the removal and registration phases, so the CPU is always high. The 1.2W client can run stably, but the steady-state CPU consumption is more than three times that of the larger 2.0 client.

Changing tests frequently

This scenario focuses on the throughput and failure rates of different versions when services are released on a large scale and services are pushed frequently.

In the case of frequent changes, both 2.0 and 1.x can be supported stably after reaching the stable state. Among them, 2.0 has no instantaneous push storm any more, so the push failure rate returns to 0, while the instability of 1.x’s UDP push causes a very small part of the push to time out and needs to retry.

Nacos2.0 configuration management performance improvement

Because the configuration is a low-write, high-read scenario, the bottleneck is mainly in the number of single monitoring clients and the configuration push acquisition. Therefore, the pressure test performance of configuration management mainly focuses on the number of connections on a single server and the comparison of a large number of pushes.

Nacos2.0 connection capacity test

This scenario focuses on the system pressure at different client scales.

Nacos2.0 can support up to 4.2w configured client connections on a single server. During the connection setup phase, a large number of subscription requests need to be processed, so the CPU consumption is high, but after the steady state, the CPU consumption becomes very low. Almost no consumption.

On nacOS1.x, when the client is 6000, the CPU of the stable state is always high and GC is frequent. The main reason is that the long rotation keeps the connection through the hold request, which requires a Response every 30 seconds and initiates the connection and request again. You need to do a lot of context switching, and you need to hold all the requests and responses. When you get to 1.2W clients, you can’t reach steady state, so you can’t support that number of clients.

Nacos2.0 pushes tests frequently

This scenario focuses on the system performance at different push scales.

In a frequently changing scenario, both versions are in 6000 client connections. It is clear that the performance cost of version 2.0 is much lower than that of version 1.x. In the 3000tps push scenario, the optimization degree is about three times optimized.

Nacos2.0 performance conclusions

For the service discovery scenario, Nacos2.0 can run stably under the scale of 10W; Compared to the 1.2W size of nacOS 1.x version, this is about 10 times higher.

For configuration management scenarios, a single Nacos2.0 server can support up to 4.2W client connections. Compared with NACOS1.x, it is 7 times higher. In addition, the push performance is significantly better than 1.x.

Nacos Ecology and 2.x follow-up planning

With its three years of development, Nacos has supported almost all open source RPC frameworks and microservice ecosystems, and led the development of cloud-native microservices ecosystems.

Nacos is a very core component in the entire microservice ecosystem. It can seamlessly communicate with K8s service discovery system and deliver Nacos services to Sidecar through MCP/XDS protocol communication with Istio. It can also be federated with CoreDNS to expose Nacos services to downstream calls through the domain name pattern.

Nacos has been integrated with various micro-service RPC frameworks for service discovery. In addition, Sentinel can assist the high availability framework to control and issue various management rules.

Using only RPC frameworks is sometimes not simple enough, because some RPC frameworks, such as Grpc and Thrift, also need to start the Server themselves and tell the client which IP to call. It needs to be integrated with application frameworks such as SCA, Dapr, etc. You can also control traffic using an Envoy Sidecar, so the application RPC does not need to know the IP list of the service.

Finally, Nacos can also connect with various microservice gateways to realize access layer distribution and microservice invocation.

Nacos Ecology practice in Ali

At present, Nacos has completed the construction of the trinity of self-research, open source and commercialization. Alibaba’s internal business domains such as Nandao, Kaola, Ele. me and Youku have all adopted Nacos services in cloud product MSE, and have seamlessly integrated Alibaba and cloud native technology stack. Let’s take nail as an example to do a brief introduction.

Nacos runs on the microservice engine MSE (fully hosted Nacos cluster) for maintenance and multi-cluster management; The various Dubbo3 or HSF services of the business are registered to the Nacos cluster through Dubbo3 itself at startup. Nacos then synchronizes the service information to Istio and the Ingress-Envoy gateway using the MCP protocol.

When northbound user traffic enters the group’s VPC network, the ingress-Tengine gateway accesses the network in a unified manner. The ingress-Tengine gateway resolves domain names and routes them to different equipment rooms and units. Nginx Core 1.18.0, support for Dubbo, support for DTLSv1 and DTLSv1.2, support for Prometheus, In this way, the integrity, security and observability of Aliyun micro-service ecosystem can be improved.

After passing the unified access layer gateway, the user request is forwarded to the corresponding microservice through the Ingress-Envoy microservice gateway and invoked. If you need to call services in other network domains, the ingress-envoy micro service gateway will import traffic to the corresponding VPC network to overcome the services in different security domains, network domains, and service domains.

Microservices call each other using an Envoy Sidecar or traditional microservice self-subscription. Finally, the user request is completed and returned to the user in a mutual invocation of the various microservices.

Planning for Nacos 2.x

Nacos2.x will build on the performance problems of 2.0, implement new functions through plugins and modify a large number of old functions, making Nacos more convenient and easy to expand.

conclusion

Nacos2.0, as a cross-generation release, completely solves the performance problems of NACOS1.x, improving performance by a factor of 10. And through abstraction and layering to make the architecture more simple, through plug-in better extension, so that Nacos can support more scenarios, integration of a wider ecosystem. It is believed that nacOS2.x will become more easy to use, solve more microservice problems, and explore more deeply towards meshification in subsequent iterations.

Copyright Notice:The content of this article is voluntarily contributed by real-name registered users of Ali Cloud, and the copyright belongs to the original author. Ali Cloud developer community does not own the copyright, and does not bear the corresponding legal responsibility. For specific rules, please refer to the “AliYun Developer Community User Service Agreement” and “AliYun Developer Community Intellectual Property Protection Guidelines”. If you find any suspected plagiarized content in this community, please fill in the infringement complaint form to report. Once verified, the community will immediately delete the suspected infringing content.