Netease strict selection of ServiceMesh practice

Author: Guo Yun (head of Yanxuan Technology Team in Taiwan, head of containerization of netease)

He joined netease in 2008 and has been engaged in front-line RESEARCH and development and management for a long time. He is good at technical architecture and technical system construction, and has participated in or been responsible for the research and development of netease blog, netease Mailbox, Netease Money.com and other core products. He is now the technical team leader of Yanxuan in Central Taiwan, responsible for containerization and Service Mesh evolution of Yanxuan.

background

The exploration and practice of Service Mesh in strict selection can be roughly divided into the following stages:

The first stage: Exploratory period (late 2015 ~ April 2016)

From the internal incubation of Netease Yanxuan at the end of 2015 to its official launch in April 2016, the technical team of Netease Yanxuan was very small, with about 10 people. The core business adopted a single architecture and depended on a small number of basic business services, such as push service, file storage service and message center.

During this period, if we expand our vision to incubate the highly selected netease Mail Business Unit, the mainstream architecture adopted at that time was service-oriented architecture (SOA), but the implementation was not uniform. Some services were provided by centralized ESB technology, and some were provided by decentralized Spring Cloud framework.

No matter ESB or distributed service framework represented by Spring Cloud, there are some typical problems to be solved. As a phenomenon-level e-commerce product, Yan Xuan’s predictable business complexity makes us have to pay attention to this key selection. If it is careless, it may bring very large technical liabilities.

During the selection of infrastructure, we mainly considered from three dimensions:

**** service governance: ****RPC framework vs service governance platform

Provide service governance capability through RPC framework or platform service governance capability? Through the framework to integrate the service governance capabilities into business at that time still is the mainstream, but the results from the practice of many internal and external, this line of thinking to throw a lot of problems need to be solved, which is particularly significant framework upgrade, the business team and middleware team have different demands, often difficult and time-consuming to promote, on the one hand affect the quality of the business’s iterative rhythm even business services, On the other hand, it also seriously restricts the efficiency of middleware evolution.
**** service governance: ****RPC framework vs service governance platform

Should service governance capacity building consider non-Java technology stacks? The core business of yanzhan uses Java technology stack, but there are still a lot of non-java application systems, such as recommendation service using Python technology stack, access service using C++ technology stack and a large number of NodeJS applications. Relatively speaking, the ecology of Java technology stack is richer. If we want to consolidate the service governance capability of each language stack, Significant r&d costs are required, and without them, service governance weaknesses of these language stacks are likely to become weaknesses of the entire system in the future.
Open source vs diy

Regardless of which infrastructure you adopt, you need to ask two questions: 1) Build from scratch or scale from a mature open source project? 2) If the construction starts from zero, is it a duplicate wheel? What value can the additional community bring to the company?

The second stage: small-scale trial period (April 2016 to early 2017)

In July 2016, we released the first-generation Service Mesh architecture, and conducted pilot projects in Netease Mailbox, Netease Money.com and some businesses of Netease Yan Selection, which achieved good implementation effect and accumulated valuable experience in operation and maintenance. Meanwhile, the control platform was basically formed.

The third stage: full landing period

With the continuous growth of Yanxuan’s business scale and the increasing complexity of the business, the team size also increased rapidly, from more than 10 people at the beginning to 50 people in 2016, and rapidly exceeded 200 people in 2017.

Since the beginning of 2017, Yanxuan’s first-generation Service Mesh architecture has been gradually rolled out and finally fully implemented. In 2019, Netease Xiaozhou micro-service platform based on container cloud gradually matured, and Yan Xuan officially launched the cloud strategy. Service Mesh architecture, as the core technology of cloud application system, also entered the comprehensive upgrade stage.

Today’s share includes three parts:

Strictly select Service Mesh architecture evolution
Implementation of hybrid cloud architecture
Planning and Outlook

Strict selection Service Mesh evolution

Select the first-generation Service Mesh architecture

Yan Xuan’s first-generation Service Mesh architecture is extended based on Consul and Nginx:

Consul

A service-based network solution provides basic service governance capabilities such as service discovery, service registration and service routing
Nginx

A high-performance reverse proxy server with load balancing, traffic limiting, and fault tolerance features and good scalability

Consul and Nginx have built-in features that basically meet our service governance requirements. Therefore, our main work is to integrate Consul and Nginx into a Local Proxy (code name: cNginx) and develop a management and control platform to provide these capabilities.

Take a look at the overall structure:

The data plane

CNginx and Consul Client constitute our Sidecar and use Client Sidecar mode
Control surface

On the control side, we provide service registration/discovery, invocation control, and governance control

Service governance capability

From a functional perspective, the Service Mesh architecture provides basic Service governance capabilities such as Service registration/discovery, health check, routing control, load balancing, failover, Service caller flow limiting, timeout control, and retry. Other service governance capabilities such as access control, resource isolation, monitoring, and fault diagnosis are performed through middleware or logging platforms (as shown below).

The improvement process of service governance ability also reflects the core idea of yanxuan technology platform construction at the present stage: from point to area, run in small steps and improve the capability matrix.

Service Mesh is the architectural benefit of strict selection

Therefore, the practice and implementation of Service Mesh architecture bring about the architectural benefits, I believe that people are concerned about the issue.

The first is to overcome the historical burden of strict selection. The Service Mesh architecture enables existing services to introduce Service governance capabilities without modification

After yanxuan launched in 2016, its business and team size grew very fast, and its technical infrastructure was obviously lagging behind, which also caused a situation:

Because there is no complete integration within the strict selection technology team, there will be obvious differences in the selection of technology stack
At the same time, each technical team’s understanding of service governance ability is also inconsistent, on the one hand, resulting in uneven service quality, on the other hand, it also leads to some duplication of wheels, which virtually increases the cost of horizontal collaboration between technical teams

Service Mesh, as an infrastructure layer, can process and manage communication between services. The non-invasive feature of Service Mesh enables the landing process and subsequent upgrade process without Service modification, greatly reducing landing resistance and releasing the productivity of the r&d team.

Secondly, it greatly reduces the r&d investment and evolution cost of middleware, as well as the coupling cost of business and middleware

Due to the Service Mesh architecture adopted by Yan Selection, many Service governance capabilities that rely on traditional middleware (such as RPC framework) are decoued from services and sunk into Sidecar, making middleware more “lightweight”.

As this capability sinks, the amount and weight of middleware that the business needs to rely on is greatly reduced:

For the basic technology r&d team, it greatly reduces the r&d investment and evolution cost of middleware
For the business research and development team, there is no need to put a lot of energy into the learning and use of middleware, reducing the cost of business and middleware coupling

Third, infrastructure and business architecture can evolve independently

The middleware is popular, make comparative headache technology research and development team is promoting the continuous evolution of middleware, often a small iterations, even think fully tested based technology research and development team, to promote the business development team to upgrade also requires great effort and energy, consume large amounts of development and test resources at the same time, This kind of unequal investment with the evolution value leads to the slow evolution of middleware, poor effect and heavier and heavier historical burden.

The Service Mesh architecture can perfectly solve this pain point and decouple the application layer from the infrastructure layer, bringing huge engineering value:

Enables the business development team to focus on the business domain and business architecture itself
The Service Mesh architecture is naturally isolated from applications, making it easier to quantify the value of its evolution. This allows infrastructure to evolve faster and more effectively

Finally, the Service Mesh architecture provides Service governance capabilities for multi-language stacks

Before the emergence of The Service Mesh architecture, the same language stack had obvious synergistic advantages, which obviously caused the development team to be cautious when choosing the language stack, even not according to the applicable scenario. For example, the initial team chose Java, PHP, and Golang at the beginning. In general, most subsequent projects will adopt the same language, but each programming language has its own advantages and application scenarios. With the expansion of business scale, the enrichment of business scenarios or the integration of multi-team business, there will be problems of multi-language stack collaboration and service governance.

The Service Mesh architecture naturally solves the problem of multi-language stacks, making it easier to exploit the advantages of non-Java language stacks, especially emerging languages, and avoid magnifying the disadvantages of the technology ecosystem.

The appeal of continuous evolution

While Yan Xuan’s first-generation Service Mesh architecture brings significant engineering value and architectural benefits to Yan Xuan, it is still not perfect and requires continuous evolution.

On the one hand, we need richer and higher-quality service governance capabilities, such as:

Enhance flow management ability, such as flow dyeing, flow control, etc
Decouple more governance features (such as flow limiting, fuses, and fault injection) from the business architecture
Support for more protocols
Enhance control surface capability

On the other hand, we also need to support a full cloud strategy for applications and hybrid or multi-cloud architectures.

Industry technology evolution — The emergence of a general-purpose Service Mesh

When we carefully selected and implemented the Service Mesh architecture, we noticed the emergence of general-purpose Service Mesh with the wave of cloud native and microservices.

The concept of Service Mesh was first unveiled on September 29, 2016, thanks to William Buoyant, CEO of Linkerd, and 软木塞, Who came up with and defined Service Mesh, And contributed Linkerd, the first open source project of Service Mesh, to CNCF. Since then, several open source projects have emerged, notably Lyft’s Envoy and Nginx’s NginMesh, with Envoy joining CNCF in September 2017.

Early Service Mesh focused on data-side capabilities with simple control surfaces. Compared with middleware that had been developed for many years, Service Mesh did not have significant functional or performance advantages (or even performance disadvantages), so it did not arouse much response in the industry. All this has changed with the advent of Istio, which has brought unprecedented control to Service Mesh and is rapidly becoming the de facto standard in Service Mesh. Linkerd, Envoy, and NginMesh have embraced Istio. Our Canoe micro services team also quickly followed Istio and Envoy as one of the early participants in the community.

Cloud native Service Mesh framework — Istio

Istio, developed by Google, IBM, and Lyft, is in the same vein and deeply integrated with Kubernetes:

Kubernetes provides deployment, upgrade, and limited operational traffic management capabilities
Istio complements Kubernetes’ weakness in microservice governance (e.g., limiting traffic, fusing, downgrading, shunting, etc.)
Istio runs in Pod in the form of Sidecar. It automatically injects and takes over traffic. The deployment process is transparent to services

Istio provides a complete Service Mesh solution:

The data plane
- The data plane supports multiple protocols (such as HTTP 1.X/2.X, GRPC, etc.), controls all incoming and outgoing traffic of the service, implements policies formulated by the control plane, and reports remote sensing data
- Istio’s default Sidecar is Envoy, a high performance L4/L7 proxy developed based on C++ (relative to NGINX)
- Have strong traffic management ability, governance ability and expansion ability
Control surface
- Pilot: Provides service discovery and abstraction capabilities and is responsible for configuration transformation and distribution (such as dynamic routing)
- Mixer: Access control, receiving remote sensing data, etc
- Citadel: delivers and manages security certificates and keys.
- Galley: provides configuration verification

Next, we will look at isTIo-based Service Mesh solutions from both a functional and a performance perspective.

Functional Perspective — Service governance capabilities (based on Istio+Envoy)

From a functional perspective, compared to choose the first generation of yan Service Mesh architecture, in traffic management ability (such as dyeing, routing control flow, good flow, etc.) have obvious enhancement, in terms of governance control ability is also more abundant, provided such as fuse degradation, resource isolation, fault injection ability, also offer more options in terms of access control.

Performance Perspective — cNginx vs Envoy (before optimization)

During the implementation and implementation of the Service Mesh architecture, the biggest concern was performance. Although the Service Mesh architecture solved many infrastructure pain points, it added an extra hop or two compared to the original remote call, which intuitively told us would bring extra latency.

According to our pressure test data, the host configuration is 8C16G (strictly selected application server specification, shared with cNginx), at 40 concurrent 1600RPS, cNginx delays increase by 0.4ms compared to direct connections (compared to direct connections), Envoy (Community version, Before optimization) Client Sidecar mode delay increased by 0.6ms (compared with direct connection).

Both cNginx and Envoy Client modes have small and acceptable performance impacts. In addition, traditional middleware with service governance capabilities (such as Spring Cloud/Dubbo, etc.) will also bring performance overhead and resource overhead, so the actual performance impact is actually smaller. (According to the performance data shared by Ant and KKUle, the Sidecar mode compared with the SDK mode, The average delay of ant application scenario increased by about 0.2ms, while the delay of Kukule application scenario even decreased).

Performance Perspective — cNginx vs Envoy (Optimized)

Since the Sidecar and application of the Service Mesh architecture are not in the same process, the optimization path for the data aspect of the Service Mesh will be richer and the optimization will be more sustainable. Meanwhile, the optimization results will be more convincing due to less interference factors.

Our Canoe microservices team has done some preliminary optimizations to the container network and Envoy:

SRIOV container network is used
Envoy: Port the Connection loadbalancer feature from version 1.13 to version 1.10.x

According to our pressure data

In the case of low concurrency (<64) and 1000RPS, the Envoy optimized version performed better in opening Client Sidecar under a container network than direct connection from a virtual machine network, with an increase in overhead of 0.2~0.6ms compared to direct connection from a container network
Under high concurrency (>=64) and 1000RPS, the Client Sidecar optimized for opening under a container network performed far better than the virtual machine network cNginx, and almost as well as the virtual machine network directly connected. However, the delay is about 1~5ms compared with container network direct connection

Given that most applications have less than 40 concurrent applications, this is a pretty good performance and gives us a lot of confidence in how to upgrade the Service Mesh architecture.

Current Evolution direction

Therefore, the current evolution of the Service Mesh architecture is based on the Istio+Envoy solution:

The data side has an Envoy Proxy as a Proxy component
The control surface takes Pilot as the core component
The platform is opened and extended through Kubernetes CRD and Mesh Configuration Protocol (MCP for short, a set of standard GRPC Protocol).
High availability design is mainly based on Kubernetes and Istio mechanism

Implementation of hybrid cloud architecture

In 2019, Yanxuan officially launched the cloud strategy, and explicitly used containerization and Service Mesh technology to cloud yanxuan application system. As The current VIRTUAL machine cluster of Yanxuan adopts Service Mesh architecture, in the process of cloud application system, We have also fully realized the engineering value brought by the infrastructure layer decoupled from the business. At present, more than 90% of The B-end business of Yanxuan has completed container transformation and Service Mesh architecture upgrade.

Strict election Roadmap for the cloud

Take a look at yan Xuan’s up-cloud Roadmap with three main phases:

IDC (private cloud) period: Application systems are deployed in VM clusters
In the hybrid cloud phase, some application systems are deployed in container environments and some are deployed in VM environments, and the same service is deployed in multiple operating environments. This is also the stage in which Yeonchoon is at present
Cloud/multi-cloud period: Application systems are completely cloud, deployed in multiple container environments, or even deployed in multiple cloud service providers

Landing key steps

According to our practice, the implementation of hybrid cloud architecture needs to deal with four key steps:

The first is to firmly embrace the cloud yuan

It has become a general trend to apply comprehensive cloud biobiotics to maximize the advantages of the cloud. Therefore, neither enterprises nor individuals should ignore this trend
Cloud native technologies represented by container, Service Mesh, microservice and Serverless complement each other
- Containerization is an important cornerstone of cloud native, the best carrier for microservices, and the cornerstone for Service Mesh to be implemented efficiently
- The Service Mesh architecture eliminates the differences between VMS and containers in terms of traffic control and supports hybrid cloud or multi-cloud mode. It is a key step for hybrid cloud or multi-cloud architectures to be implemented efficiently

It is to want to build platform of good service management next

The Service governance platform seamlessly connects the Service governance capabilities of the virtual machine environment and container environment, integrates the control capabilities of the Service Mesh, and maximizes the advantages of the Service Mesh
The traffic control and route control capabilities provided by the service governance platform can transparently control the service form of application systems in the hybrid cloud architecture, so that application systems can smoothly migrate from private cloud to hybrid cloud or multi-cloud
The Service governance platform integrates the monitoring and alarm events of the hybrid cloud architecture, so that the availability of the Service Mesh can be monitored in real time and the operation and maintenance of the Service Mesh is better

The third is to build a unified deployment platform

A unified deployment platform eliminates the differences between virtual machine and container operating environments at the deployment level, shielding business development teams from the underlying complexity of hybrid cloud architectures
A unified deployment platform can automatically inject the Sidecar of the Service Mesh, eliminating the need for infrastructure awareness
A unified deployment platform can integrate the control plane capabilities of Service Mesh to smooth the deployment process in a hybrid cloud architecture and achieve gray level traffic diversion capability after deployment, thus accelerating the cloud process of application systems

The last is to do a good job of gray drainage

Gray drainage includes gray drainage between Service calls and gray drainage of outdomain calls (user traffic). Through gray drainage, Service Mesh can be smoothly migrated from private cloud architecture to hybrid cloud architecture, and the ability of smooth migration is also the key to the implementation of Service Mesh in hybrid cloud architecture.

Smooth migration

Here’s how we can smooth the transition from private cloud to hybrid cloud:

Through the edge gateway, each LDC (LDC is a group of logical units of applications, data, and network) is connected to each other
- The edge gateway simplifies the migration process by shielding the different infrastructure of each LDC
- Edge gateway can also be used for traffic authentication and plays an important role in cross-LDC access scenarios in hybrid cloud architectures
Bottom-pocket routing design
- Bottom-pocket routing provides a highly available solution that allows LDCS in the same environment to back up each other without the traffic escaping from the current AZ
Access control: In the process of moving from private cloud architecture to hybrid cloud architecture, smooth migration of access control is a difficult problem
- IP address pooling: Applies to basic services that rely on IP address management permission, such as databases
- The capability of Service Mesh enables service-based permission management and control
Provide grayscale drainage capability to make the migration status of infrastructure and business transparent to callers; This process needs to handle both internal and external traffic

API gateway

The grayscale traffic diversion capability of the outdomain requires the support of THE API gateway, and the API gateway in the hybrid cloud architecture requires the integration of the API gateway management and control capability of the VM environment and container environment on the control plane.

The whole is based on Envoy+Pilot scheme
The data plane
- The container environment has an Envoy Proxy as a Proxy component
- The virtual machine environment uses Kong as the proxy component
The control surface takes Pilot as the core component
The platform is opened and extended through Kubernetes CRD and Mesh Configuration Protocol (MCP for short, a set of standard GRPC Protocol).

As shown in the figure, the specific technical details are not expanded here

Quality assurance system

As an infrastructure, the quality of Service Mesh delivery is also important.

To improve the delivery quality and operation and maintenance quality of Service Mesh architecture, the following aspects should be taken into consideration:

Establish CICD process, improve unit test and integration test
Improve automated performance benchmark testing and continuously track performance data
Improve monitoring alarms so that the health of the infrastructure is monitored
Improve the version upgrade mechanism
- Support Envoy hot updates
- Provide grayscale publishing mechanism, so that business can grayscale and flow can grayscale
- Provide a multilevel environment for building infrastructure drills, testing, grayscale and release specifications
Introduce the business regression validation process

The pit of tread

Of course, the implementation process of the Service Mesh architecture is not smooth. There are still some difficulties to overcome, such as:

Envoy has a Bug in its current build
- When Istio Pilot is updated to include accesslog configuration distribution features, envoys enter a set of problematic assert logic when they are visited under pressure or when a client disconnects, causing an Envoy to crash, The requester is now displaying a 502 exception
- The community will clean up this problem assertion logic in a new release (github.com/envoyproxy/… Envoy compile option uses -opt (default -dbg)
Mixer performance trap
- Mixer performance issues, such as turning on The Mixer’s policy execution function and each call to the Envoy simultaneously calling the Mixer for a policy check, have been blamed for rapid performance degradation, but the community is aware of this problem and is working to optimize it
- As an alternative to Mixer strategy implementation, Istio’s RBAC can also fulfill some functions, such as service whitelist, which we implemented through RBAC

Planning and Outlook

The Development of Service Mesh architecture still has a large space for development. Yan Xuan and The Micro-service team of Qingzhou will continue to explore and evolve from the two dimensions of performance and function in the future.

Direction of performance optimization

In terms of performance, there are two main research directions

Scenario 1: Use eBPF/xDP(SOCKOPS), optimize the path as SVC <-> Envoy, conservative forecast, delay performance improved by 10-20%. Per-pod, an Envoy deployment in line with the community, is the current highly selected deployment solution.
Scheme 2: DPDK+Fstack user-mode protocol stack is adopted to optimize the path as Envoy <-> Envoy, and the delay performance is improved by 0.8-1 times. Envoy deployed as a Per-node, functional and operational limitations are still being evaluated.

Conclusion: Sidecar mode is optimized by scheme 1, gateway mode is optimized by scheme 2.

Service governance platform – Upgrade the capability of strict selection service governance

In terms of functions, we mainly provide richer and higher-quality service governance capabilities through the Light Boat micro-service governance platform.

Enhanced call control and governance control capabilities
- For example, the platform capability provides traffic limiting, fusing, and fault injection capabilities to reduce the learning cost of the business research and development team
Provide platformized access control capabilities so that access control is operated not as a technical requirement but as a productized service
According to the fine operation and maintenance capabilities

conclusion

Today’s sharing first introduces the evolution of Service Mesh architecture in strict selection, then shares the key role of Service Mesh in the implementation of strict selection hybrid cloud architecture, some problems encountered and our experience, and finally summarizes the two work we are doing at present: Continuous optimization of Service Mesh performance and continuous enhancement of Service governance platform capabilities.

The practice of strict selection shows that the maturity of Service Mesh architecture is ready for large-scale implementation. We hope our work can provide reference for the community.

Netease technology lover team continues to recruit teammates! Netease Yan Xuan, because of love so choose, look forward to like-minded you to join us, Java development resume can be sent to [email protected]

Netease strict selection of ServiceMesh practice

background

Strict selection Service Mesh evolution

Implementation of hybrid cloud architecture

Planning and Outlook

conclusion

Related Posts

The lesson of blood! Never use these Redis instructions in production

Use FFMPEG for video (TS) merge

Java concurrent programming tool class JUC :BlockingDeque