More than architecture - The continued evolution of the Service Mesh architecture

Maslow/ Wang Yusong (Yanxuan Technical team of netease)

As well as the carefully selected business, the UNDERLYING IT system architecture that carries IT must live, breathe, grow and evolve, otherwise outdated and rigid IT system architecture will bring our organization and business to a standstill, unable to face new opportunities and challenges. The industry’s server-side architecture concept has evolved over the years, from monolithic modular architecture, to SOA, to microservices, and finally to Service Mesh. The evolution of the strict selection server architecture is a microcosm of this, and in some ways we are even ahead of the industry.

preface

As well as the carefully selected business, the UNDERLYING IT system architecture that carries IT must live, breathe, grow and evolve, otherwise outdated and rigid IT system architecture will bring our organization and business to a standstill, unable to face new opportunities and challenges.

The industry’s server-side architecture concept has evolved over the years, from monolithic modular architecture, to SOA, to microservices, and finally to Service Mesh. The evolution of the strict selection server architecture is a microcosm of this, and in some ways we are even ahead of the industry.

Architecture maturity

While there is no universal architecture maturity model that the industry can measure, there is no doubt that the level of architecture maturity is increasing from monolithic modularity, to SOA, to microservices, and finally to Service Mesh.

Each level has its own set of issues that need to be addressed at the next level.

Different levels have different levels of complexity and different levels have different priorities.

Single modular architecture

Emphasizes business logic divided by modules, high cohesion and low coupling. Inter-module communication occurs through in-process method calls.
Service-oriented Architecture (SOA)

Emphasis on the reuse of business logic in application granularity, closure and normalization of scattered business logic, can be understood as the horizontal separation of business logic. The communication between services takes place through an enterprise service bus, where a small amount of business logic is coupled.
Microservices Architecture

The emphasis on decoupling of business logic in application granularity can be understood as vertical separation of high cohesion and low coupling business logic. Communication between services is conducted through RPC.
Service Mesh

Emphasizing layering and decoupling of business logic and service governance logic, this is another level of layering and decoupling that creates a clear boundary between business logic and underlying implementation logic. In the Service Mesh architecture, the communication between services is brokered through the grid.

All of these architectural patterns emphasize decoupling and reuse, among which Service Mesh is the most thorough. It emphasizes not only the decoupling and reuse of business logic, but also the decoupling and reuse of infrastructure.

What is a Service Mesh

Let’s take a look at the definition of Service Mesh, which was developed by William, CEO of Linkerd, the pioneer of Service Mesh.

The Service Mesh is an infrastructure layer that handles communication between services. Cloud native applications have complex service topologies in which the service grid is responsible for the reliable delivery of requests. In practice, a service grid is typically implemented as a set of lightweight network agents that are deployed with and transparent to applications.

This is an early definition of Service Mesh. Now Service Mesh covers a much larger and broader scope. In the microservice architecture, except for the Service itself, other services-related manageable and governable parts can be understood as the category of the Service Mesh.

So Service Mesh is ultimately a Service governance platform that basically encompasses all aspects of Service governance.

This is also possible through a whole-family bucket approach like Spring Cloud, but coupled with business logic, deployment, operation and maintenance are coupled to the operations of the microservices themselves. A bugfix for an RPC framework, for example, would trigger a lengthy upgrade release for all microservices, with a huge amount of repetitive work for business developers to develop, test, regression, and release.

Service Mesh greatly improves r&d efficiency by sinking the Service governance logic irrelevant to business logic and separating the concerns of business developers and the underlying technology developers.

Business developers are more concerned with understanding and modeling the business, and their core value is embodied in business implementation. Service governance platforms such as Spring Cloud/Service Mesh are tools and means to them, not ends. The time and effort invested in learning and using these tools is not worth the cost.

Basic technology development can also focus on its own technical field, there is no need to understand and learn business in order to promote the upgrading of technology, there is no need to spend a lot of time and energy to coordinate and promote business changes, and there is no need to share the risk and pressure of business changes.

Generation 1 – Strict selection of Service Mesh based on Consul+Nginx

Consul+Nginx When Service Mesh was first conceived, the concept of Service Mesh was not in use, and people were doing things that would make them pioneers in Service Mesh.

The idea was simple: to provide non-invasive Consul based service routing capabilities for cross-language business applications. And use Nginx native features, in the premise of high performance to provide load balancing, cluster fault tolerance mechanisms.

The main work was on expanding consul and Nginx to bring the capabilities of both components together to solve the problems we encountered.

Strictly select the overall architecture of Service Mesh

The data plane

Consul Client +cNginx constitute our SidecAR. Here, the sidECAR mode is one-way client Sidecar mode. At that time, the problem list we are concerned about can be fully solved by the Sidecar mode.

Service governance capabilities implemented on the data side are primarily based on NGINx.
- Load balancing
- The timeout governance
- retry
- FailOver
Control surface

The Control surface, also known as our Consul Admin management platform, has three modules that implement service governance capabilities within their respective responsibility boundaries.
- Traffic scheduling
  
  You can schedule, route, weight traffic, and isolate swimlane traffic.
- Service registration/discovery
  
  Take service instances online, offline, delete them, and perform health checks
- Speed limit & fuse
  
  The service call-out is rate-limiting or a direct circuit breaker at the caller.

Architecture earnings

By integrating Consul+Nginx, we have built simple service governance capabilities and implemented a simplified version of Servcie Mesh.

CNginx is transparent to business applications and is uniformly developed, deployed, operated and maintained by basic technical teams. The service governance functions realized by cNginx do not need to be repeatedly constructed by each business party or introduced from a third party, so that more time can be spent on their own business logic.

At the same time, we build a unified communication layer between multiple cross-language applications to ensure the consistency of communication strategies.

On the cost side, we also saved on the development and investment of framework middleware (such as RPC framework), and reduced the cost of additional coupling between middleware and business code.

Second generation – IsTIo-based Service Mesh

With the rapid development of business, the simple version of service governance capability implemented by cNginx can no longer meet our growing needs.

The maturity of cloud infrastructure makes cloud native architecture more and more universal. Yan Yan is also based on our container cloud strategy. CNginx cannot effectively integrate with cloud infrastructure such as K8S and Docker, so it must choose a new Service Mesh platform to replace it.

Without a doubt, our final choice was ISTIO, the current de facto standard for Service Mesh.

Istio overall architecture

Istio architecture is also divided into two parts: data plane and control plane.

The data plane

The data plane controls all incoming and outgoing traffic to the service and implements service governance logic. The data plane is also responsible for the implementation of the policy formulated by the control plane and reporting remote sensing data to the control plane.

The default SIDecAR on the ISTIO data surface is Envoy, a high-performance network proxy component of L4/L7.
Control surface

The control surface is composed of Pilot, Mixer, Citadel and Galley.
- Pilot
  
  Provides service discovery, dynamic routing of traffic, and resiliency between services (timeout, retry, rate limiting, and fusing).
- Mixer
  
  Responsible for ACL, policy execution, blacklist and whitelist, and collect remote sensing data of services.
- Citadel
  
  Provides security certificate delivery and management capabilities.
- Galley
  
  Galley provides abstract, unified configuration verification capabilities.

Some important decisions

1. The Client sidecars mode

Istio itself provides three sidecar modes :Client Sidecar, Server Sidecar, and Both Sidecar. After balancing, we finally decided to use the Client Sidecar mode for the time being, enabling sidecar only for service callers. The reasons are as follows.

Inheritance of patterns

The implementation of the first generation of Service Mesh, cNginx, is also a Client Sidecar model. The second generation of IStio is a very new and cutting-edge thing. In order to reduce the complexity of understanding and use, and make services seamless, We are taking a conservative approach to the changes to the second-generation Sidecar model for now.
Performance considerations

According to our pressure test results, istio Client sidecar mode is slightly worse than CNGINX mode, and both Sidecar mode has more performance loss due to the extra hop on the network. In order to prevent the business side from having too many changes in responsiveness, throughput and other perceptions, We chose the client Sidecar mode for the time being.
Differences in governance capacity

Our existing infrastructure can make up for the difference in governance capability between Client and Both. For example, service observability, we have a good APM platform and logging platform. Therefore, we do not Care whether sidecar is enabled on the Server in terms of governance capability.

2. Use as needed

Istio is like a Swiss Army knife. It offers so many features that you can get lost when you first use it.

Yan Xuan’s use of IStio has always been based on the principle of on-demand. We only require the equivalent functions of CNGINx at the beginning, and then gradually cover other functions.

Some features that have performance issues with the current version of ISTIO implementation are firmly deprecated. For example, the Mixer’s policy execution function degrades very quickly because each call to the Envoy simultaneously calls the Mixer for a policy check. This is something the community is aware of and working on, and when it’s done we’ll consider using it.

As an alternative to Mixer policy execution, ISTIO’s RBAC can be used for some functions, such as service whitelists, to avoid Mixer performance pitfalls.

3. Fusion with K8S

Service Mesh itself is platform independent, but its integration with K8S has many advantages and has gradually become the industry’s mainstream standard (i.e. cloud native architecture), which Is also adopted by Yan Selection.

Sidecar is automatically injected and automatically takes over traffic
Consistent service discovery, unified service discovery based on K8S data
Governance rules are based on K8S CRD and do not require dedicated management of services
Service discovery does not require active registration and can be automatically injected after being declared as K8S Service

Transitional Framework Scheme

In the transition period, cNginx(outside the cloud) and ISTIO (inside the cloud) two types of Service Mesh coexist. In order to make services unaware, we designed an architecture scheme for the transition period.

For the in-cloud service, if the service provider does not deploy the in-cloud service, it will automatically route to the out-of-cloud service.
For the service outside the cloud, the service provider inside the cloud abstracts into a logical service instance of the service outside the cloud in a unified way, and the flow direction to the cloud controls the weight of the flow of this logical service instance through cNginx.

Performance of the trap

One of the most misunderstood aspects of the Service Mesh is its performance. Many people think that if you add one or two hops to the call path, the Mesh will behave badly. From our practice and pressure testing, Mesh performance is not that scary as long as we make the right choices and tailor.

1600Rps +40 concurrent (host configuration is 8C16G)

The RT overhead of ISTIO (client mode) is 0.6ms when the RPS is 1600 (note: 1600Rps is sufficient for the strict service) on the basis of 40 concurrent data. RT overhead of CNGINx is around 0.4ms. The performance differences between ISTIO and CNGINx are so small that they are almost equivalent.

In addition, Spring Cloud/ Dubbo framework itself also introduces resource overhead and performance overhead, so replacing Service Mesh with Service Mesh is just a transfer payment for performance overhead, and the performance impact is relatively small.

The ISTIO used by Yan Xuan is the hangzhou version of ISTIO. Currently, students of Hangzhou Institute are further optimizing the performance of ISTIO. The preliminary test results after performance optimization are as follows.

Option 1: Use eBPF/xDP(SoCKOPS), optimize path as SVC <-> Envoy, delay performance improved by 10-20%. Per-pod, an Envoy deployment in line with the community, is the current highly selected deployment solution.
Scheme 2: DPDK+Fstack user-mode protocol stack is adopted to optimize the path as Envoy <-> Envoy, and the delay performance is improved by 0.8-1 times. Envoy deployed as a Per-node, functional and operational limitations are still being evaluated.

Of course, performance is still a core issue, so we will build in our normal pressure measurement mechanism and perform pressure regression for any changes.

conclusion

From CNGINx to ISTIO, our service governance capabilities and governance systems have been updated and modernized.

The benefits of architecture implementation are ultimately reflected by r&d efficiency. Through continuous practice and implementation of Service Mesh architecture, we decouple business logic from infrastructure logic, further liberating and developing productivity, and driving better and faster business development.

Netease technology lover team continues to recruit teammates! Netease Yan Xuan, because of love so choose, look forward to like-minded you to join us, Java development resume can be sent to [email protected]

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

More than architecture – The continued evolution of the Service Mesh architecture

preface

Architecture maturity

What is a Service Mesh

Generation 1 – Strict selection of Service Mesh based on Consul+Nginx