preface

Almost everyone is talking about Service Mesh; Nobody seems to know how to land the Service Mesh; But everyone felt like someone else was doing a lot of Service Mesh; So everybody claims to be doing Service Mesh.

The above is just a joke, but to some extent reflects some reality: Service Mesh is a design idea and concept, not a specific architecture or implementation. Although the Istio+Envoy configuration seems to have become the de facto standard, when we look around, we find that the ideal is too rich and the reality is too thin. The result is a blossoming of Service Mesh in all its forms.

Ant Financial’s Service Mesh is one of the flowers mentioned above, and we have passed the exploratory phase and entered full production application. Last year’s Singles’ Day completed the transaction payment core link, production-level verification of hundreds of thousands of containers. However, there are still many different voices in the industry regarding Service Mesh, ranging from enthusiastic support to confusion and skepticism about value, architecture, and performance. So what is our attitude to this? Where is the Service Mesh of Ant Financial after the in-depth practice of Double 11? Is the Service Mesh architecture the end?

This paper will describe ant Financial’s planning and continuous evolution along the Service Mesh after the 2019 Double 11 in combination with the actual scene and thinking of Ant Financial.

Review of Ant Financial Service Mesh practice

The figure above shows the practical architecture of Ant Financial’s Singles’ Day in 2019. As a self-developed data surface product of Ant Financial, the cloud native network agent MOSN (github.com/mosn) carries the east-west flow of Mesh architecture. For the control plane, based on the pragmatic premise, we explore a set of feasible solutions at the current stage, and implement the Service Mesh architecture based on the traditional Service discovery system.

Here is the digitalization of the ground summary. While satisfying the business, we have truly achieved low intrusion into the business: extremely low resource consumption and rapid iteration capability. Both the business and basic technology enjoy the dividend brought by cloud native Mesh.

Service Mesh has a long way to go

Taking a look at InfoQ’s Software Architecture and Design Trends Report for April 2020, Service Mesh is in its Early Adoption phase and is still hot in the cloud native tech scene. In this article, we will not discuss the selection, usage scenarios and rationality of Service Mesh. If you need it, please refer to the following historical article. There are a lot of ant Financial’s thoughts on Service Mesh.

For us, since we have chosen this road after deep thinking, and carried out in-depth practice in the double Eleven last year, then how to move to the middle game, in addition to pragmatic landing, we also need to look up at the stars, we must know what Gap there is between poetry and distance:

Non-full cloud native

Mentioned before we fall to the ground Service Mesh, still use the traditional Service system, although the overall architecture based on K8s, but did not use community on the control plan, of course, these are considered, but with the evolution of the overall architecture, a comprehensive cloud original biochemical is bound to become we continue to enjoy the biggest obstacle to cloud native dividends.

Insufficient platform capability

Service Mesh is positioned to decoule infrastructure and business, but so far, both the community Istio+Envoy combination and ant Financial’s traditional micro services +MOSN practice focus on managing east-west traffic, which is a long way from poetry and beyond. There is still a lot of infrastructure logic embedded in business systems as SDKS, and we still have to face the impact of infrastructure upgrades on business.

Boundary flow coverage is incomplete

As the cloud native within the data center intensified, but for a data center and edge of the network border, seven application layer of network flow is still not form a global system, due to the lack of system we have to border gateway between Mesh network and their respective development division, has the independent flow scheduling system and safety credible system.

Low degree of ecological integration

The traditional Service system has developed for so many years and accumulated a lot of valuable wealth. Service Mesh emerges as a new upstart from two aspects: Service Mesh needs the integration and support of the traditional Service system to make the existing services migrate to the Mesh system; At the same time, traditional service system components also need to be able to integrate with Mesh system to remain competitive.

performance

Performance is a perennial issue, and there are numerous voices questioning performance in Mesh architecture, including Mixer control surfaces, additional network consumption and codec consumption caused by the introduction of Sidecar, etc. However, we can see the community working on these issues, including refactoring the Mixer architecture, introducing EBPF to speed up traffic hijacking, and more.

To sum up, we have a long way to go with Service Mesh.

Follow through with the Service Mesh

This year our goal is to fully cover the major businesses with Mesh, which will be very challenging:

  • Financial security and credibility requirements, we need to achieve full-link encryption and service authentication;
  • Unified Sidecar and Ingress Web Server;
  • Landing of cloud native control surface;
  • Transparent hijacking capability;
  • The need to carry more middleware capacity sinks;

Based on the above analysis of various existing problems and ant Financial’s own business development needs, we can clearly apply the right medicine to the case. We will abstract the above problems into three categories and carry out special solutions:

  • Open source ecological construction to deal with ecological integration;
  • Through the evolution of cloud native standards, the problem of incomplete cloud native can be solved.
  • Finally, through the enhancement of basic core capabilities, the problem of insufficient platform capabilities, coverage scenarios and performance is addressed.

Open source ecological construction

Let’s review the first action we did after singles’ Day: At the ninth Service Mesh Meetup hosted by Ant Financial on December 28, 2019, we announced that MOSN has completed incubation in SOFAStack and started independent operation, so as to seek cooperation and construction partners with a more open attitude:

We think the future belongs more to those who leave the cathedral and embrace the bazaar. The Cathedral and the Fair

In announcing our independent operation, we also made a series of measures:

  • Independent project domain name: MOSn.io
  • Project address: github.com/mosn/mosn
  • Community Organization: MOSN Community Organization
  • Project management regulations: PMC, Committer election and promotion mechanism, etc

Next, we continue to do a lot of work on the open source community, including the creation of thematic Working groups such as Isito WG, Dubbo WG, etc.

At the same time, we also seek a lot of external cooperation. More than half of contributors come from outside, and we accept the first Committer from BOSS direct, etc. For ecological integration, we have conducted in-depth cooperation with Skywalking, Sentinel and Dubo-Go communities.

Skywalking

Invocation dependency and invocation state between services are important indicators in microservice management. Skywalking is an excellent APM software in this field. MOSN has cooperated with Skywalking community to carry out in-depth integration of the two systems. Currently, it supports:

  • Call link topology display;
  • QPS monitoring;
  • Fine-grained RT display;

In May, SkyWalking 8.0 received a complete upgrade with new probe protocol and analytics logic that will make the probes more mutually aware and better able to monitor the Service Mesh using probes. SkyWalking will also open up the Metrics system that previously existed only in the kernel. Commonly used Metrics, such as Prmoetheus, Spring Cloud Sleuth and Zabbix, were integrated into the system for analysis. In addition, SkyWalking and the MOSN community will continue to work together to support tracking Dubbo and SOFARPC, as well as link tracking in Sidecar mode.

More detailed information reference: skywalking.apache.org/zh/blog/202…

Sentinel

Sentinel is a micro-services-oriented lightweight flow control framework open source by Alibaba. Sentinel protects the stability of services from multiple dimensions, such as flow control, circuit breaker degradation and system load protection. MOSN currently has only simple traffic limiting capabilities, so we worked with the Sentinel community to integrate a variety of different traffic limiting capabilities into the MOSN to further improve the MOSN’s traffic management capabilities, while significantly reducing the cost of service traffic limiting access and configuration.

In terms of long-term planning, as will be mentioned later, we will use this as a starting point to propose a new unified udPA-based traffic limiting standard.

Dubbo

Our support for Dubbo is based on the following background:

  • Dubbo is a Service implementation framework, and Service Mesh is a framework concept. Dubbo also needs to enjoy the dividend brought by Service Mesh. Enterprise adaptation and expansion needs exist, and the Dubbo community also has such user needs.
  • Many users and enterprises can not be in place in one step cloud native, the need for gradual landing;
  • Current open source solutions do not support Dubbo service discovery;

Previously, our XProtocol architecture based on MOSN supported Dubbo protocol, but did not realize the service system based on Dubbo as a whole. This time, we designed two schemes to meet users’ demand for Dubbo, which is also a dual-mode micro-service architecture: On the left is the traditional Dubbo registry, integrated with the Dubbo-Go SDK for traditional meshing:

  • MOSN provides Subscribe, Unsubscribe, Publish, Unpublish HTTP services.
  • The SDK sends requests to these services provided by the MOSN and lets the MOSN interact with the real registry on its behalf;
  • MOSN connects directly to the registry via Dubo-Go;

The picture on the right shows Mesh support in a cloud native way directly through Istio extension. This solution is a capability contribution made by community partners multi-point Life. Detailed technical solutions and usage can be found in the practice of Multi-Point Living on Service Mesh: Istio + Mosn Exploring the Dubbo Scenario.

Cloud native standards evolve

As mentioned above, both Ant Financial and other companies have implemented Mesh in the production level, but they all implement Mesh in a traditional way. Of course, this is also based on the current situation of each company. With exploration technology, cloud native Istio operational availability and service management system architecture of rationality also gradually for positive change, the perfection of its function, performance improvements, the complexity of deployment and operational problems will be solved, at the same time as the cloud native, the evolvement of the depth of full scale of cloud native architecture will inevitably hinder our progress. So we worked closely with the Istio community to build a global Service Mesh control plane, and also worked closely with the cloud native network agent MOSN to promote our evolution from traditional to cloud native Mesh. For this purpose, we carried out the following work:

  • Creation of cloud native standard Sidecar;
  • Standardization participation and construction;

For the first point, MOSN continues to align Istio capabilities, including multiple Sidecar support on THE Istio side and functional alignment on the MOSN side. Control surface support MOSN Sidecar injection, pilot-Agent adaptation, Istio build adaptation, load balancing algorithm, traffic management system, traffic detection, service governance, Gzip, etc.

  • The required tasks will be disassembled by April 2020, and Bookinfo will be available on isTIO-1.4.x;
  • In June 2020, complete the development of HTTP strong dependency function, compatible with isTIO-1.5. x under the new architecture;
  • HTTP functions aligned with Istio;
  • Support Istio pre-release in September 2020;

In terms of standardization, we participated in UDPA related specification discussions, and proposed API specification discussions for limited circulation, and discussed in community meetings.

In addition, MOSN has been actively communicating and seeking cooperation with the Istio community. Our goal is to become the Sidecar product officially recommended by Istio. We have made a related ISSUE on Istio Github, which has attracted a lot of attention. I am also very glad that the official members gave detailed answers and discussed this question.

They raised a number of questions and concerns, and discussed them at Istio’s regular meetings.

A transcript of the discussion can be found at github.com/istio/istio…

After this communication, we got the official ideas and suggestions on this, which gave us a very clear goal and motivation. On the other hand, we also have corresponding ideas and actions for the several questions raised by Istio:

  • For test case coverage costs, maintenance costs can be reduced by decoupling test case and Envoy bindings in Istio, or by developing standard suites of data facet test sets.
  • In addition, students in the MOSN community can join in the maintenance, thus reducing the maintenance cost;

We will continue to devote our resources to building our own capabilities, while maintaining a collaborative relationship with the community, and believe that when the time is right, we will cooperate deeply in the future.

Basic core capabilities have been enhanced

What is the future of Service Mesh and how it will evolve? What capabilities should an MOSN have to support the continuous evolution of Service Mesh? In the previous paper, we solved the problems of incomplete cloud native and low ecological integration through open source ecological construction and cloud native standard evolution. For other problems, combined with the needs of Ant Financial’s own scenes, we have done a lot of capacity building:

  • Flexible and convenient multi-protocol extension support;
  • Morphologic scalability;
  • Message and P2P communication model;
  • OpenSSL support;
  • Transparent hijacking capability;

Protocol extensions

Achilles’ heel

I use the term “Achilles’ heel” to describe the pain of protocol expansion, enough to see the pain suffered in this hole. Whether it’s the “ancient” Apache HTTPD, the “Medieval” Nginx, or the “modern” Envoy, frameworks designed for HTTP or other generic protocols, many of the extensions have extensive extensions, but proprietary extensions are still difficult. In addition to the forwarding support of the protocol itself, general framework governance cannot be achieved. Therefore, we need to do independent architecture support for each protocol behavior. The framework needs to understand the entire request life cycle, connection reuse, routing policy, etc., and the development cost is very high. Based on these practical pain points, we designed the MOSN multi-protocol framework, hoping to reduce the access cost of private protocols and accelerate the implementation of the universal ServiceMesh architecture. For more details, please see the video sharing at that time: “Analysis of multi-protocol Mechanism of Cloud Native Network Agent MOSN”.

MOSN multi-protocol framework

Extensible modularity capabilities

With the development of the business and the planning of the Service Mesh, MOSN needs to carry more and more basic capacity sink. Only by providing flexible, efficient and stable scalable mechanism can MOSN maintain its competitiveness and long-term vitality.

MOSN takes advantage of Nginx and Envoy’s design from the beginning to provide an extensible filter-based mechanism for creating custom Proxy logic through Network Filter. The Stream Filter supports traffic limiting, authentication, and injection. The Listener Filter supports transparent hijacking.

But here will find a problem, is that sometimes we need to extend ability already has the realization of readily available, so if we can do a simple transform let MOSN can obtain corresponding ability, even if implemented currently available is not the language, such as the ability of the current limiting ready-made implement, injection capacity, etc.; Or it may require tighter controls, higher standards for certain capabilities, such as safety-related capabilities.

In such scenarios, we introduced the MOSN Plugin mechanism, which allows us to independently develop the capabilities required by the MOSN or modify existing programs appropriately to incorporate them into the MOSN.

The Plugin mechanism for MOSN consists of two parts:

  • The first is the MOSN self-defined Plugin framework, which supports the realization of MOSN extension ability by realizing agent interaction with an independent process in MOSN;
  • The second is based on Golang Plugin framework, through the dynamic library (SO) loading way, MOSN extension. Among them, the dynamic library loading method still has some limitations and is still in the beta stage.

In addition, the current hot WebAssembly is also the direction of future development. In many scenarios, there is mature support. Golang also has a branch of WASM.

Message communication mode

With the advent of Service Mesh and the increasingly fierce wave of practice, in addition to the traditional Service communication RPC, DB, cache and other forms of Mesh requirements are increasingly emerging, but fortunately, these communication modes are similar to RPC, we do not need too much modification to support Sidecar. But it is different for message communication:

  • Stateful network model;
  • Message sequentiality;
  • Partitions are loaded atoms;

This prevents the MESSAGE SDK from using Partitions in order of messages, causing the normal sending and receiving of meshed messages to fail. Partitions in the Pull/Push Consumer of a message are the basic unit of load balancing, In fact, the native consumers need to perceive the number of consumers who consume the same Partitions in the same ConsumerGroup as themselves. Each Consumer selects corresponding Partitions according to their location to consume. As a result, the load balancing strategy in the message is no longer applicable to the Service Mesh architecture.

OpenSSL support

In this year’s plan, we will fully implement east-west traffic encryption based on Service Mesh to provide stronger transmission traffic encryption protection. At the same time, it will also introduce the national secret algorithm to improve the security compliance ability, and realize the all-round trust ability based on the security hardware. The cornerstone of all this is the need to have an efficient, powerful and stable cryptography infrastructure, MOSN’s native GO-TLS has many problems:

  • Weak security capability: there is no software/hardware key security mechanism;
  • Long iteration cycle: Go-TLS does not fully support TLS1.3 security features until version 1.15+;
  • Poor suite support: only typical ECDHE, RSA, ECDSA algorithms are supported.
  • Weak performance: For example, the performance of RSA and Go is less than 1/5 of that of C.

OpenSSL, as the big brother of cryptography infrastructure, was the perfect choice. OpenSSL has wide use, rich hardware acceleration engines, dedicated community maintenance, large and comprehensive suite support, and highly optimized algorithm performance. Of course, we have fully tested and thought about how to support OpenSSL. If we use traditional Cgo to take over all TLS processes, although we enjoy the convenience of one-time integration and lifelong use, we cannot accept the performance loss brought by Cgo. Therefore, we finally adopted the scheme of mixed use. Implement specific security capabilities.

Transparent hijacked

Although the community provides non-invasive access Service Mesh solution, the performance loss and operation and maintenance cost brought by the native community solution are very large, so we do not achieve non-invasive access in practice. But as the business rolls out on a larger scale, non-invasive capabilities become more urgent, and we need to address multi-environment adaptation, operational maintainability, and performance issues. We are still using Iptables as the data surface for traffic hijacking, but we have optimized it for different situations:

  • Tproxy replaces DNAT to solve Conntrak connection tracking problem;
  • Hook Connet system call solves the performance loss caused by outbond traffic crossing protocol stack twice.
  • Fuzzy matching blacklist and whitelist can reduce the management cost of the whole rule.

The development of traffic hijacking technology is closely related to the landing of Service Mesh. In the future, we will continue to evolve around environmental adaptation, low latency, low management cost and other aspects, and build a multi-mode single base composed of DNAT, TProxy, TC Redirect, Sockmap and other technologies. In different scenarios with different kernel environments, performance requirements, and management costs, the most appropriate hijacking technology is adaptively selected to continuously reduce the access cost of Service Mesh.

Service Mesh is especially desirable

The above is our continuous exploration under MOSN and Service Mesh after last year’s Singles’ Day. The overall milestones are as follows:

In my opinion, Service Mesh architecture is to cloud native architecture what high-speed rail is to the national economy. We have lived through the decade of cloud computing, during which seemingly solid industry and technical barriers have been broken down, and classic concepts have been questioned and challenged, so Service Mesh is bound to see big changes in the future. Teacher Xiao Jian actually did an in-depth analysis of this “Mecha: Taking Mesh To the End”. I don’t want to repeat it here, but to share some of my personal views. First of all, in the development trend, business and basic technology continue to decouple and coordinate; Middleware continues to sink, the business base layer sinks; The basic services need to be better integrated with the Mesh architecture to form an ecosystem with a high degree of consistency. At the same time, I think as the boundary of cloud native network expands, it is bound to bring scale effect. We need to solve all kinds of basic problems such as performance, resource consumption and delay, so we need to solve the above problems by Kernel Bypaas, Sidecar AS Node and introduce hardware optimization. At the same time, we believe that in the evolution of cloud native, container network will be integrated with Service Mesh, and the network will change from IP oriented to Identity and Service oriented. Sidecar can be deposited down into system infrastructure, become the security container network stack and the basic network unit of intelligent hardware devices.

When Sidecar sinks as part of the system, it develops from a framework to a platform. Providing distributed primitive abstraction and providing remote API like Dapr is one way to provide external services. In addition, we are trying to use shared memory based interface communication solution, and finally the business will develop into Mesh oriented programming. The Mesh architecture eventually forms a distributed microservice OS.

But No Silver Bullet, although distributed systems have become the dominant form of new business, in many traditional areas, centralized architecture still exists in many core systems. This system is the most important operation efficiency, high availability and other stability demands. This is the strength of a mature centralized architecture. In the foreground of the business, more challenges are how to cope with the rapid change of the market, carry out rapid iteration and seize the market. Distributed architectures, particularly microservices frameworks, are built to help users iterate quickly and deliver business capabilities, and Service Mesh is now seen as the enabler of this architecture.

Author’s brief introduction

Xiao Han, alias Han Chang, joined Ant Financial in 2011. He has been engaged in research and development related to layer 4/7 network load balancing, high-performance proxy server and network protocol. Currently, I am in charge of the application network group of the Trusted native Technology Department of Ant Financial and the MOSN of the cloud native network agent of Ant Financial’s open source project.

The implementation of Ant Financial Service Mesh on Nov 11 series of articles

  • Ant Financial Service Mesh Large-scale Landing series – Quality chapter
  • Ant Financial Service Mesh Large-scale Landing series – Control section
  • Ant Financial Service Mesh Large-scale landing series – Operator part
  • Ant Financial Service Mesh Large-scale Landing series – Gateway part
  • Ant Financial Service Mesh Large-scale Landing series – RPC part
  • Ant Financial Service Mesh Large-scale Landing series – Operation and Maintenance
  • Ant Financial Service Mesh Large-scale landing series – Message
  • Ant Financial Service Mesh Large-scale landing series-core part
  • Person in charge of Service Mesh implementation: Ant Financial Double eleven four questions

Financial Class Distributed Architecture (Antfin_SOFA)