Practice exploration of Ant Financial Service Mesh

The author | AoXiaoJian

This article is adapted from ant Financial senior technical expert’s speech at QCon Shanghai 2018.

Hi, everyone. I am Ao Xiaojian from the Middleware team of Ant Financial. I am currently the PD of The Service Mesh project of Ant Financial. I am also the founder of Servicemesher China technology community and the earliest evangelist of Service Mesh technology in China. The topic I bring to you today is “A Long Way to Walk: Ant Financial Service Mesh Practice exploration”.

In last year’s QCon Shanghai, I once gave a speech named “Service Mesh: The Next Generation of Micro Services”. I wonder if there are any students who listened to this speech in the past year? (Note: The results of the field survey, about a dozen people to listen to last year’s speech.)

Of course, today we are not going to continue the Service Mesh sermon, as Xiu Tao requested, we are going to talk about practice this year. So I’m not going to go into the details of what A Service Mesh is, what it does, what it’s good for, like I did last year. It is based on ant Financial’s practical experience over the past year and Ant Financial’s SOFAMesh products to help you understand Service Mesh technology more deeply.

Before we start today’s episode, let’s warm up and review last year’s episode. I came to QCon last year to preach, and the core of my sermon was: What is Service Mesh?

In order to help you answer this question, I will give you a hint image that will be familiar to those of you who know Service Mesh.

Let’s review the formal definition of Service Mesh:

The Service Mesh is an infrastructure layer that handles communication between services. Modern cloud native applications have complex service topologies in which service grids are responsible for the reliable delivery of requests.

In practice, a service grid is typically implemented as a set of lightweight network agents that are deployed with and transparent to applications.

The bold black part is the key:

The infrastructure layer: This is where the Service Mesh is positioned, and I’ll cover this topic in detail in the last part of today’s presentation
Communication between services: This is the function and scope of the Service Mesh
Reliable delivery of requests: Is the goal of the Service Mesh
Lightweight Network Proxy: Is the deployment mode of Service Mesh
Transparency to applications: An important feature of Service Mesh. Zero intrusion is one of the biggest advantages of Service Mesh.

Here’s what we’re going to do today:

Let me give you a quick overview of our SOFAMesh project to give you a sense of the background of the story
Then I would like to introduce why we choose to use Golang language to realize the data plane. This is the biggest doubt about our product plan in the past year
Let me share with you some of the typical problems and solutions we have encountered in Service Mesh implementation over the past year
Then we’ll explore the scope of communication between services to see where Service Mesh can be used
Next, I would like to introduce to you the biggest feeling in this year’s practice, and talk about the significance of infrastructure to service grid, which is also the content I want to share with you this year.
Finally, to sum up today’s content and share some information

OK, let’s get started with the first part of the day, which is a quick introduction to SOFAMesh, aiming to give you some background before we move on to our various practices and explorations.

SOFAMesh is an open source product of Service Mesh launched by Ant Financial. It can be simply understood as an enhanced version of Istio. We have two principles:

Follow the community

This is reflected in the fact that SOFAMesh is forked from Istio and follows the latest version of Istio to ensure synchronization with upstream.

All of our changes in Istio are open source in the SOFAMesh project, and we will contact Istio after verification and feed back upstream.
The practice test

Everything from practice, not empty talk, in the actual production of the ground, found the problem, solve the problem. In the process of solving the problem, do not settle, do not settle, try to dig the essence of the problem, and then pursue technological innovation to solve the problem.

Principle: What Istio does well, we simply follow and keep consistent; Where Istio is not doing well or is missing, we try to improve and fix it.

All of these are based on the actual landing and meet the general direction of technology in the future.

SOFAMesh product planning, which is currently in the first stage. The architecture continues Istio’s separation of the data plane and the control plane.

Use Golang to develop Sidecar, our SOFAMosn project, instead of Envoy.
Integrating Istio and SOFAMosn, while extending and complementing the requirements and problems of landing, is our SOFAMesh project

In this architecture, the biggest difference from Istio’s original is that instead of selecting Istio’s default integration Envoy, we developed a Sidecar called SOFAMosn in Golang to replace Envoy.

Why is that?

Our second part will answer that question.

MOSN stands for Modular Observable Smart Network. As its name suggests, MOSN is a Modular Observable Smart Network. This project has a very ambitious blueprint, which was launched by the system department and middleware department of Ant Financial together with the entertainment infrastructure department of UC. Golang intends to resettle various capabilities of the original network and middleware on Golang, and build Golang into the underlying platform of a new generation of future architecture, bearing the responsibilities of various services and communications.

The Sidecar model is currently one of the main forms of MOSN, modeled on the Envoy project’s orientation. We implemented Envoy’s xDS API, which remains compatible with Istio.

Support for communication protocols in Istio and Envoy, mainly in HTTP/1.1 and HTTP/2, are first-class citizens in Istio/Envoy. The REST based on HTTP/1.1 and gRPC based on HTTP/2, one is the most mainstream communication protocol in the community at present, the other is the mainstream in the future, the darling of Google, THE RPC scheme used by CNCF, These two make up the current golden portfolio of Istio and Envoy (and all CNCF projects).

SOFAMesh, on the other hand, is at first confronted with a different situation than Istio/Envoy, where we need to support a wide range of protocols beyond REST and gRPC:

SOFARPC: This is the RPC protocol heavily used by Ant Financial (open source)
HSF RPC: This is the RPC protocol that is widely used within Ali Group (not open source)
Dubbo RPC: This is a widely used RPC protocol in the community (open source)
Other proprietary protocols: Over the past few months, we have received requests to run other TCP protocols, mostly proprietary, on SOFAMesh

To do this, we need to consider adding support for these communication protocols in SOFAMesh and SOFAMosn, especially so that our customers can easily extend support for various proprietary TCP protocols.

Why not just use Envoy?

Almost everyone who knows about SOFAMesh’s products will ask this question, which is where SOFAMesh is most questioned and criticized. Because Envoy’s performance so far is really excellent, rich in features, mature and stable.

We also focused on Envoy in our technology selection, and Envoy fits our needs very well, except in one place: Envoy is c++.

There is a choice here, which programming language should the data plane choose?

The data plane programming language choices of the major Service Mesh products on the market are listed in the figure.

First, Java – and Scala-based first-pass exclusion, the JDK/JVM/ bytecode approach proved too heavy to be a Sidecar for deployment and runtime
Nginmesh’s approach is somewhat unique. It is not orthodox to get information from Golang’s agent and then generate a configuration file for Nginx
Conduit (later renamed Linkerd2.0) chose Rust to take a different approach. Rust itself is extremely suitable for the data plane, but the popularity of the Rust language and the size of the community are greatly inadequate. Choosing Rust means that there is basically no leverage from the community
Envoy selected c++
Huawei and Sina Weibo chose Golang

Before choosing, we had an in-depth discussion internally. The focus was: what should the underlying platform of the next generation architecture be? What should the programming language stack be? The consensus was Golang, with some Java.

For a typical scenario like Sidecar:

Requires high performance, low resource consumption, lots of concurrent and network programming
Be able to be grasped quickly by the team, especially by newcomers
In order to interact frequently with the underlying infrastructure such as K8S, there will be a big background of Cloud Native in the future
Very important: to be accepted and quickly mastered by the community and future potential customers, without a high language barrier

If nothing else, the ideal programming language for the Sidecar scenario would be Golang.

But when it comes to technology selection, the decision about whether to use Envoy or not can still be difficult: the point is that c++ has mature products like Envoy that can be applied directly; Golang, on the other hand, had nothing to rival Envoy and had to start from scratch.

Both options have their advantages and disadvantages. In the short term:

Using Envoy directly, the advantage is that it is a mature project with stable performance and is Isto’s default Sidecar, which is fast and low in resource consumption. Can be directly used, super easy to get started, less investment and quick income
Developing your own Golang version of Sidecar, all disadvantages: This is a new project, it’s a huge amount of work, it’s technically challenging, there’s the extra work of integrating with Istio itself and the maintenance costs. The biggest challenge is Envoy’s rich or even many features, so it’s a huge effort to align to Envoy

In the short term, choosing An Envoy is far more realistic and sensible than developing your own version of Golang.

However, as mentioned earlier, we have a very ambitious vision for the MOSN project: we intend to resettle and polish the existing network and middleware capabilities into the underlying platform of the next generation of architecture, carrying the responsibilities of various services and communications. This is a long-term project that takes a year or two to build, to meet the needs of the next three, five, ten years, and if we choose to take the Envoy as a base, everything will be OK in the short term, get all kinds of dividends quickly, get established quickly.

But: what are the consequences? Envoy is C++, selecting Envoy means that the future communication layer core we will precipitate and polish will be C++ based, and our language stack will have to be C++ dominated for this, which will be a significant departure from the established Golang + Java language stack plan.

And once you take the time scale three, five, ten, eight years, the selective Envoy disadvantages come in:

The development and maintenance cost of C++ is much higher than Golang. The longer the time, the more changes, the more people involved, the more use scenarios, the more obvious the difference
From current requirements, extensions to Envoy are extensive, including communication protocols and features. Considering the possible innovations on the control plane in the future, it is inevitable that the data plane will cooperate and the changes will exist for a long time
Golang is still more suitable for the cloud native era. If you choose Golang, in addition to doing Sidecar, the trained team can also use Golang to complete other products. You could do this with Envoy, but then the system would really be c++ products from now on.
In addition, Envoy currently has only a Sidecar model as an official Envoy, and our planned MSON project covers a variety of service communication scenarios
How to coordinate with envoys in the future is a big challenge. In particular, we’re going to have a lot of innovative ideas coming down the road, and also allow for quick trial and error to encourage innovation, selection Envoy has a lot of limitations in that regard.

So, in the end, our choice is: difficult before easy, with an eye on the future. Painful (really painful) to abandon Envoy and work on our SOFAMosn project with Golang.

My advice to those of you who are also faced with choosing an Envoy is that it depends on whether you want to “touch” it or not.

If you use it simply, or extend it a little bit, then you’re really only touching this small part of the iceberg, in which case it’s advisable to just select Envoy
If, like us, you see Service Mesh as the core of your future architecture, expect a lot of changes and extensions, and you don’t want to let c++ dominate your mainstream programming language technology stack, consider our options

Of course, this is not a problem for those of you whose primary programming language stack is C/C ++.

In the third part of today’s series, I’m going to tell you about the typical problems SOFAMesh encountered during landing.

Here are three main questions:

Communication protocol extension

As mentioned earlier, we will need to support a wide range of TCP protocols, including proprietary protocols. Of course, this should actually be classified as a requirement, which will be discussed later.
Smooth migration of traditional architectures

By legacy architecture, we mean traditional SOA architectures, such as many existing applications based on Dubbo, that we want to be able to run directly in the Service Mesh without necessarily having to do microservices first.
Adaptive isomerism system

Heterogeneous architecture means that when we implement a Service Mesh, there are two systems: the new application is based on the Service Mesh, and the old application is based on a traditional framework such as Dubbo or Spring Cloud.

When we do application migration, considering the original stock of applications will be a lot, such as thousands of applications, these applications certainly cannot be said to cut all in one night. There has to be a transition phase in which the applications of the new and the old systems communicate, and how best to do it.

We are now working on the plan, which is now in POC. The goal we’ve set for ourselves is to come up with a solution that leaves the existing application code unchanged and then cuts the old and new sides of the application to ensure smooth migration.

Of course, the POC of this scheme is being made now, and the scheme has not been finalized yet, so we will not include details of this part today. If you are interested, you can pay attention to our plan later.

Today I will share the first two parts with you. We will expand them in detail.

The first problem to be solved is how to quickly extend support for a new communication protocol.

This problem stems mainly from the current design of Istio. The way Istio is now, if a new communication protocol is to be added, there are several big pieces of work to be done:

Add protocol Encoder and Decoder

That is the protocol encoding and decoding, this did not have to say, must be added.

Modify configurations such as delivering Virtual hosts in the Pilot
Modify Sidecar such as Envoy, MOSN to implement request matching

The latter two are heavily duplicated, and in terms of technical implementation, the changes are similar to the existing ones, but you have to make a new one. Because we have a lot of agreements, the momentum of change is very big. Based on our previous practice, adding a new communication protocol in this way could take days of work and repeat a lot of code with each change.

Here we end up with a generic solution called X-Protocol, but we won’t go into details here, just to show you the results. According to our latest validation, if we were to add a new communication protocol, it would be about a hundred or two lines of code, and we could do it in an hour or two. Even with the testing, basically within a day, we were able to add support for a communication protocol for SOFOMesh.

The second problem to solve is to apply the legacy architecture to the Service Mesh.

There are a number of applications that are based on the SOA framework. These applications are developed in a traditional SOA way and run into problems if they are moved directly under the Service Mesh, such as Istio: Because Istio uses K8S for service registration, the K8S service registration model does not match the original SOA model.

In SOA frameworks, service registration is usually done on an interface basis, that is, multiple interfaces are deployed in an application, and multiple interfaces (or services) are deployed in a process at runtime. In fact, service registration and service discovery, including service invocation, are at interface granularity. However, after being deployed in Istio, Istio performs service registration at the service granularity. In this case, no matter the registration model or the invocation method by Interface is consistent, that is, invocation through Interface cannot be adjusted.

As you can see in the code example on the left, Dubbo programs are typically registered and discovered by Interface, and called by Interface. In addition, there is another problem at this point, in addition to invoking through interfaces: the model of service registration and service discovery has changed from a pair of N interfaces per process to one for one, one service per process.

How to solve this problem? The most orthodox approach is to do microservice transformation first: change the SOA architecture to microservice architecture, split the existing applications into multiple microservice applications, one service (or interface) in each application, so that the relationship between applications and services becomes one to one, and the service registration model can be matched.

However, it can be difficult to implement because microservice transformation is a time-consuming process. The actual requirement we encountered was: Can we skip the microservice transformation and start with the Service Mesh? The features of Service Mesh are very attractive, such as flow control and secure encryption. How about we move the application to the Service Mesh, let the application run first, and then work on the microservice transformation slowly?

This is the scenario we actually encounter, and we need to find a solution to solve the problem: the registration model does not match, and the original code called with the interface does not tune.

We designed a solution called DNS Common Location solution to support SOA frameworks such as Dubbo, allowing services to be invoked by interface name.

I don’t want to go into details, but just to give you the basics, we’re going to add records to DNS, the three interface names shown in red in the bottom left corner, and we’re going to point those three interfaces in DNS to the Cluster IP of the current service. The Cluster IP of K8S is usually a very fixed IP that is assigned to each service when K8S is deployed.

After adding DNS records, we call it through Interface. In the middle, in our Service Mesh, we will complete the actual addressing based on the Cluster IP information and run through all Istio functions, which is the same as calling it with the Service name.

This feature is fully implemented in the existing SOFAMesh, and you can try it out. We will present this solution to the K8S or Istio community later to see if they are willing to accept a more general approach to addressing.

Here we put forward such an idea: first get on the train and then pay the ticket. The “first to get on” refers to the first to get on the Service Mesh bus, and the “later to pay” refers to the later to pay for the micro-service split ticket. The advantage is that you can benefit from the power of Service Mesh before the big work of microservice fragmentation is done. It also makes deployment more comfortable because you don’t have to force the entire microservice split before you can access the Service Mesh. With this solution, the application can run on the Service Mesh without microservice disassembly, and then proceed with the microservice disassembly at ease, which is the main reason for this solution.

Of course, there are a lot of technical implementation details, and there are a lot of details that are not suitable to be covered here. At the same time, it also involves more details of the underlying technology of K8S and Istio, which requires a deep understanding of the network forwarding scheme of K8S KubeProxy and the implementation of Istio. Here are a few articles, if you are interested in these technologies, you can read these articles to understand the technical details, I will not continue to expand here today.

MOSN and X-Protocol

Service Mesh Data plane SOFAMosn In deep layer
Ant Financial Open Source Go Version Service Mesh Data plane SOFAMosn performance report
Ant Financial open source SOFAMesh general protocol extension analysis
Dubbo on X-Protocol — A demonstration of the X-protocol example in SOFAMesh

X-protocol features:

Introduction to series (1)-DNS universal addressing scheme
X-protocol Introduction series (2)- Fast decoding and forwarding
Introduction to x-Protocol series (3)-TCP extension

To sum up, we have solved the following problems:

You can quickly add a new communication protocol to SOFAMesh in a matter of hours
SOA applications can continue to be invoked through interfaces on SOFAMesh without code changes
It is possible to move directly to SOFAMesh without microservice transformation of SOA application and benefit in advance

The fourth block deals with traffic hijacking.

An important feature of the Service Mesh is that it is invastion-free, which is usually achieved through traffic hijacking. By hijacking traffic, the functionality of the Service Mesh can be inserted without the client or server being aware of it. It is especially suitable for situations such as secure encryption where the business logic of an existing application is completely separated.

However, Istio’s traffic hijacking scheme is not good enough, and so far Istio only provides iptables as a solution. This scheme has many problems, so we have several ideas:

To optimize the iptables

The main reason for optimizing iptables is to reduce the impact on the Host Host.

There are two ways to use iptables: one is pod only, which means to do iptables in POD, which is the official practice of Istio, but requires ROOT permission to change the iptables configuration. Another option is to use iptables on the Host instead of ROOT. After comparison, we still think it is better to put it in POD, because the performance loss is lower, so we will use pod solution for the time being, but we will make optimization, such as simplifying the iptables module to a minimum.
Investigate IPVS solutions

We are currently investigating IPVS solutions. The main problem is the deployment of the Iptables scheme, because the iptables module is often restricted. Are there any students who do operation and maintenance? Do you have iptables enabled on your machine? What I can tell you is that so far on ant Financial’s internal machines, iptables is not only disabled, but the entire Iptables module has been removed. The reasons are performance, security, maintenance and other well-known reasons. In short, we do not have this module inside Ant Financial.

To address this issue, we are currently investigating IPVS alternatives to iptables. This work is under way and the scheme has been verified at present, but there are still some details to be improved. More information will be introduced to you later.
Lightweight client practices

Another practice is to consider not doing traffic hijacking. For example, the most typical RPC scenario, because RPC usually always has a client. After the Service Mesh, some functions of the original client, such as Service discovery, load balancing, and traffic limiting, can be simplified to form a new lightweight client, but there is still a client.

In this case, if the access address of Sidecar is known, the client can directly send the request to Sidecar without traffic hijacking. So, the basic idea is to give the address of Sidecar via environment variables or configuration and tell the client that Sidecar is on port 8080 of localhost. Then the client SDK simply reads and sends the request directly. This solution can easily bypass the traffic hijacking problem.

This is a solution that we have experimented with internally as an early alternative to multilingual clients. Of course, traffic hijacking is found to be necessary in practice, which is another topic that will be explained in more detail later.

But these three are not the focus of today, the focus of today is the following idea of Cilium + eBPF, which is the one we are looking at most closely.

Cilium is a very new project, and the idea of Cilium involves the problem of TCP stack in the underlying communication.

The picture here shows the details of the network call using iptables and the lightweight client scheme. On the left is the call between the client Service and its Sidecar, where you can see the TCP stack that it went through twice. Then there is the interception of iptables. The difference between a lightweight client solution and a traffic-hijacking solution is the reduction of iptables once, avoiding the performance cost of iptables. But even without iptables, you still end up going through the entire call process. Although Loopback addresses are much faster than network communication, they still end up going through two TCP stacks, which has a performance cost.

Cilium came up with a good idea: find a way around the TCP stack.

The advantage of Cilium solution lies in that request forwarding is completed at the socket level and redirect is realized through SockMap technology. Of course, we will not expand this technical details here. Today is mainly to talk about the benefits and value of this idea. The biggest benefit of the Cilium scheme is that it can bypass the TCP stack twice, which leads to an unexpected and even counterintuitive result: Cilium hijacking can be faster than lightweight clients without hijacking! This could be a paradigm shift.

So let’s try it out. Traffic hijacking, like iptables, involves inserting a segment in the original call chain, increasing consumption and degrading performance, right? This is the most easy to leave a negative impression of traffic hijacking, traffic hijacking is consumption, so the optimization idea is usually to reduce this consumption, or choose not to do hijacking to avoid this consumption. Cilium came up with another solution to the traffic hijacking problem: by bypassing the TCP stack twice and other low-level details, the request is forward to Sidecar faster!

We appreciate Cilium’s approach to reducing the performance penalty between the Service and Sidecar. This approach addresses a critical issue in the Service Mesh: the performance versus architecture trade-off.

Those of you who are familiar with Service Mesh technology should know that Service Mesh is a neutral art. In between performance and architecture, Service Mesh chooses to sacrifice performance for architecture. In traditional intrusive frameworks, calls between client business code and framework code are made through functions that are so fast that they can be ignored. The Service Mesh forces the framework and the class library to be separated from each other, and turns the above method call into a remote call, sacrificing the overhead of a remote call in exchange for the optimization space of the entire architecture. This is an essential difference between Service Mesh and traditional intrusive frameworks, and the source of all the differences.

This is one of the most important trade-offs in Service Mesh technology: the cost of a remote call is traded for a more resilient architecture and richer functionality.

The development of Service Mesh technology has resulted in two general directions: one is to continue to reap the benefits of architecture, more functions, richer usage scenarios, and various innovations to reap the benefits as much as possible; On the other hand, is in the word up and down, as far as possible to minimize the performance loss, in order to get the maximum benefit in front of the cost to the minimum.

The four practices we listed above are all steps down the greedy path of trying to reclaim as much of the Service Mesh architecture as possible.

Of course, there are still some problems when Cilium is actually implemented. For example, the biggest problem is that Cilium has very high requirements on the Linux kernel version. The minimum requirement is 4.9 and 4.10 is recommended, and some features in Cilium are 4.14. Linux kernel 4.14 was only released in late 2017, while the latest version of the Linux kernel is only 4.18. The version of the Linux kernel required by Cilium is too new to meet at deployment time. In addition, Cilium still has some security problems, mainly because eBPF directly injects the code into the kernel, which is good for efficiency, but there will definitely be security risks.

We will focus on tracking Cilium technology in the future, and other similar schemes may also be presented. Interested students can follow our progress.

We continue today’s fourth installment, exploring the scope of communication between services.

Service Mesh initially focuses on east-west communication, that is, communication between services within the system, which is usually synchronous, using REST or RPC protocols.

In the practice of Service Mesh, we found that Service Mesh can provide the following functions:

Request forwarding: such as service discovery, load balancing, etc
Routing capability: Powerful Content Based Routing and Version Based Routing
Service governance: grayscale publishing based on routing capabilities, blue-green deployment, versioning and control
Error correction capabilities: current limiting, fusing, retry, error injection for test purposes
Security: identity, authentication, authorization, authentication, encryption and so on

It can be applied to areas outside of the Service Mesh, which means that we can introduce and reuse these capabilities in other areas to achieve more extensive inter-service communication than just east-west communication.

The first direction to explore is the API Gateway, which is the direct counterpart of east-west communication to north-south communication.

The main reason is that the north-south communication and the east-west communication are highly overlapping in functions, such as service discovery, load balancing, routing, gray scale, security, authentication, encryption, flow limiting, fuse… So it was a natural idea to reuse these capabilities for east-west communication.

In traditional intrusive frameworks, the way to reuse these capabilities is based on class libraries, which are introduced in the implementation of THE API Gateway, typically Zuul, for east-west communication. Under the Service Mesh, Sidecar is reused instead of class library. The request forwarding and Service governance functions of Sidecar are reused by using Sidecar for north-south communication.

The advantages of introducing the Service Mesh into the API Gateway are:

Unify microservice and API Gateway two systems
Substantial savings in learning/development/maintenance costs
Various features of the Service Mesh can be obtained in north-south communication
The control plane of the Service Mesh enhances control over the north-south communication

There are also some explorations in this direction:

Ambassador: Kubernetes-Native Microservices API Gateway, an open source project built on Envoy
Gloo: The Function Gateway is built on top of Envoy, also based on Envoy, but not only for traditional microservice API gateways, but also for functions with The Serverless architecture
Kong: It was recently announced that with the upcoming 1.0 release, Kong will no longer be an API Gateway, but a service control platform. This is a case of exploring in reverse: cutting from the API Gateway to the Service Mesh.

The idea was very clear: build a new API Gateway product based on SOFAMesh and SOFAMosn to unify east-west communication and north-south communication. The project has been launched and will be published as an open source project in the future. Students who are interested in this topic can keep an eye on it.

Some time ago, when we were considering the Serverless direction, we happened to see Google’s new Serverless project, Knative. The timing is very convenient. Unlike other Serverless projects, the Knative project focuses on the standardization and normalization of the Serverless platform.

The Knative project is based on Kubernetes and Istio, where Istio is used for communication between components. In the Knative project, there was a big debate about whether or not Istio should be introduced, because Istio was felt to be too heavy, and there was a lot of excitement about introducing Istio for a small number of requirements. However, this issue is not a problem for those of us who already use Istio.

At present, we are still exploring Serverless, especially Knative. Our initial ideas are as follows:

It is very important to Serverless

In particular, the emergence of Knative indicates the emergence of new gameplay in the Serverless field and the opportunity of standardization and unification of Serverless platform
Kubernetes + Serverless + Service Mesh (especially with the extended Service Mesh) is a good combination

From bottom to top, from underlying infrastructure to inter-service communication to Function, a complete supporting system is formed between applications and systems.

In the follow-up product strategy, we will continue to conduct in-depth research on Knative and plan products while making POC. Of course, it is still a basic requirement to aim at landing products in combination with actual business needs. Then, quite naturally, we’ll replace the standard version of Istio with our SOFAMesh and SOFAMosn.

For example, we are currently planning to try a typical scenario with Serverless:

Small program
AI: Serverless AI Layer, a one-stop machine learning platform
Databus: Big data processing

Here is the complete blueprint for the inter-service communication we are currently exploring and planning:

Service Mesh

Responsible for east-west communication, which in practice is our SOFAMesh product, an expanded and enhanced version of Istio
API Gateway

In charge of the north-south communication, we are still exploring, we are trying to develop new API Gateway products based on SOFAMosn and SOFAMesh
Serverless

Responsible for asynchronous communication, event-driven model, granularity refinement from service level to Function level, currently actively exploring and practicing Knative

Here is our prediction: in the era of cloud native, the future of communication between services will be Service Mesh, which removes and subsides the responsibility of communication between services.

That’s the end of today. The content of the first four parts is basically to introduce our product practice, problems encountered in landing, and some exploration we are doing, which is more practical. Part five will be a little more special, and maybe a little less important. One of the key points is the relationship between infrastructure and the service grid, or what infrastructure means to the service grid.

It’s Cloud Native. In June this year, CNCF technical Supervision Committee approved the definition of Cloud Native.

Here we’ll focus on the sentence highlighted in red: cloud native technologies include containers, service grids, microservices, immutable infrastructure, and declarative apis.

For cloud native architecture, Ant Financial’s strategy is: embrace! Our future architecture will evolve in this direction.

For the cloud native technologies listed above:

Containers: Big Ali has a very deep accumulation of container technology, which has been practiced for many years, and the new version of Sigma3.* will also be based on K8S.
Micro-service: The predecessor of micro-service, SOA servitization, has been practiced in Alibaba for many years. Dubbo/HSF/SOFA can be said to be full of fame, and it is also in the process of micro-service transformation.
Immutable infrastructure and declarative apis: also highly recognized and long-practiced technologies.

We understand the positioning of the Service Mesh as follows:

The Service Mesh is an important link between the previous and the next
On the one hand, make full use of the underlying system capabilities
On the one hand, it provides a solid base for the upper application

One of the most important things I want to share with you today is that Service Mesh sinks into infrastructure.

In terms of the evolution of Service Mesh, from simple Proxy, to full-featured Sidecar (such as Linkerd and Envoy), to the second generation of Service Mesh represented by Istio, the evolution pattern is shown in the figure below:

Step 1: Peel it from the app

By changing the original method invocation to remote invocation and putting a Proxy shell on the function of the class library, the Service Mesh successfully separates the communication between services from the application, so that the communication between services is no longer a part of the application.

That’s the easiest thing to accept, right? This step is also the easiest to implement, just set up a Sidecar or Proxy and stuff the functionality of the existing class library into it.
Step 2: Sink as an abstraction layer

These stripped capabilities for inter-service communication, after being stripped, begin to sink, forming a single layer of abstraction under the application as a dedicated infrastructure layer for inter-service communication. At this point, these capabilities appear as a finished product, no longer in the form of a separate library or framework.

The second step and the first step are often in the same line, once the first step, will naturally continue. Because once the communication between services is extracted, moving forward, it naturally becomes an infrastructure layer.
Step 3: Integrate into the infrastructure

It continues to sink, becomes attached to the underlying infrastructure, and becomes part of the platform system, typically with Kubernetes.

Istio has made a very big innovation in this area, Istio’s innovation, not only in the addition of the control plane, but also in the integration with Kubernetes.

If you heard my talk at QCon last year, you would have noticed that my understanding of Service Mesh was different last year. At this time last year, I think Istio’s biggest innovation was the addition of control planes. However, this year I think there is another key point. In addition to the increase of control plane, it is important for Istio to start integrating with K8S and take full advantage of the capabilities of K8S. K8s stands for the underlying infrastructure on which all the various capabilities are deposited. In Istio, there is a clear trend that Service Mesh is becoming integrated into the underlying infrastructure as part of the overall platform system.

Notice the subtle differences between steps 1 and 2, which extract and precipitate the ability to communicate between services into a layer of abstraction that has nothing to do with the underlying infrastructure if you stop at Step 2. Note that Linkerd or Envoy, for example, has no relationship to the physical machine, virtual machine, or container when deployed, and does not take advantage of any of the underlying capabilities. But once you evolve to Istio, including now Linkerd 2.0, you’ll find yourself in this phase of step 3.

Today, the future of Service Mesh is to sink the ability to communicate between services into the infrastructure, and then leverage the capabilities of the underlying infrastructure to build the entire architecture. Instead of abstracting the underlying infrastructure into a simple operating system abstraction: give me THE CPU, give me the memory, give me the network, give me the IO, everything else has nothing to do with the underlying infrastructure, and I’ll do it all myself. This approach is not appropriate for the future of Service Mesh, which must be integrated with infrastructure.

Notice the difference between this approach and the traditional approach, not just in terms of technology, but in terms of confusing two traditional departments: one called middleware, as I do, or what some companies call infrastructure; There is another department, usually operations or systems, that maintains the underlying infrastructure. In most companies, the two departments are usually organized separately. Those who work on K8S and those who work on microservices frameworks such as Dubbo and Spring Cloud are often two very different organizations. If the third step is to go through, it will require the middleware and infrastructure departments to coordinate particularly well, to close cooperation, to get things done.

This is the biggest feeling we have got in practice in the past year, and it is also what I hope to share with you in the whole speech today.

Here’s a question, compared to traditional Spring Cloud, Dubbo, and other intrusive frameworks:

What are the essential differences in Service Mesh?

If I were answering this question last year, I would tell you: slide down, precipitate, and form a communication layer. And today, I’m going to tell you that in addition to that, there’s a second point: take advantage of the underlying infrastructure. This is something Dubbo, Spring Cloud has never done!

This is what I want to share with you today, and what I have learned from my practice in the past year:

The essential difference between Service Mesh and Spring Cloud/Dubbo is not just the ability to take interservice communication out of the application, but the ability to sink all the way down to the infrastructure layer and take full advantage of the underlying infrastructure.

Finally, let’s sum up today’s content:

I introduced our SOFAMesh project to you. If you plan to use Service Mesh technology and want to join Istio, you can try to learn about our project, which will make your landing more comfortable
Then I introduced the reason for choosing Golang, mainly because of the long-term choice of language stack. For those of you who are choosing Service Mesh, please refer to it. If, like us, you prefer to keep Golang and Java as the main language stack in the future, you can refer to our plan. Of course, we also hope that you can build the open source project SOFAMesh with us
Then we shared some typical problems we encountered, how to support more communication protocols quickly, how to enable traditional SOA architecture applications to benefit from Service Mesh without code modification, and achieve smooth migration of the system. It is helpful for students to prepare for the actual landing. Due to the time limit, we could not expand the details. We can check the information after the meeting or contact us directly
The scope of communication between services is discussed, from the original east-west communication, to south-north communication, and the use of serverless project. Hopefully, you’ll see more scenarios where Service Mesh can be used.
Finally, I talked about my personal experience: For Service Mesh to be fully functional, it needs to be integrated with the underlying infrastructure to maximize the capabilities of the infrastructure. This knowledge will affect the technical selection of Service Mesh, product scheme and even organizational relationship. It is very important, I hope every student who is interested in this can seriously examine this problem.

Service Mesh is a new thing, and there are always challenges and doubts when new things come along, especially when they are not fully mature on their own. Cloud Native behind Service Mesh is an unprecedented change.

We have a wonderful vision for the future Cloud Native architecture, including our Service Mesh, K8S and micro-services… And new architecture, new technology, can never be achieved overnight, there is no such thing as smooth sailing and naive thinking.

The road is always made by man, or waded through. As the pioneer of Service Mesh technology in China, we admit that Service Mesh technology is not mature enough, there are still many problems to be solved and many challenges to be faced. But we are confident that we are on the right track, and every effort and every effort we make today is bringing us closer to our goal.

Lu Xun said: there is no road on the earth, but when more people walk, it becomes a road. In this direction of the Service Mesh, more and more efforts will be made. This road, we will eventually try to wade out!

The long road is long, we should march on it!

SOFAMesh and SOFAMosn projects are now open source on Github. The address is as follows:

Sofa – mesh: github.com/alipay/sofa…
Sofa – mosn: github.com/alipay/sofa…

Welcome to pay attention to the progress of these two projects. It would be great if you could star to show your support. Thank you very much!

I hope to participate in the construction of these two projects together. Looking forward to the Issue and PR!

For those interested in Service Mesh technology, the Servicemesher community is a neutral, pure technology community that brings together the majority of Service Mesh technicians in the country. I am also one of the founders of the Servicemesher community, whose mission is to spread Service Mesh technology, enhance communication within the industry, foster an open source culture, and promote Service Mesh in the enterprise.

Visit the community website www.servicemesher.com for technical information and community events, and follow the wechat account of the ServiceMesher community for timely information promotion. We have a large team of translators who, in addition to translating various Service Mesh related technical blogs and news, are responsible for the daily maintenance of official documents for Envoy and Istio projects.

You are also welcome to join the ServiceMesher Community’s wechat group by following the instructions on servicemesher.com’s “Contact Us” page.

Finally, I would like to recommend my own personal technology blog skyao. IO, welcome to browse and exchange.

That’s the end of today’s content. Thank you very much for your listening. See you next time. Thank you!

The authors introduce

Xiao Jian Ao, senior coder, 16 years of software development experience, micro Service expert, Service Mesh evangelist, co-founder of Servicemesher.com community. Focusing on infrastructure and middleware, Cloud Native adherent, Agile practitioner, and architect who sticks to the development front line to polish craftsmanship. He has worked in Asiainfo, Ericsson, ViPSHOP, etc. Now he is working in Ant Financial, responsible for product development such as Service Mesh in middleware team.

ServiceMesher community information

Wechat group: Contact me to join the group

Community official website: www.servicemesher.com

Slack:servicemesher.slack.com requires invitation to join

Twitter: twitter.com/servicemesh…

GitHub：github.com/

servicemesher

For more ServiceMesh consultation, follow the wechat public account ServiceMesher.

Practice exploration of Ant Financial Service Mesh

The authors introduce

ServiceMesher community information

Related Posts

Flink DataStreamAPI and DATA API application case

Chapter 5 User Management

How to reverse a list of k