Ant Financial has experienced years of precipitation in servitization, which supports the annual peak of double 11. As a new direction of microservices, Service Mesh has become a hot topic in the field in the last two years. However, there is almost no experience for reference on how to evolve from the classical service-oriented architecture to the direction of Service Mesh and what problems may be encountered in the process.

In this article, we will share the evolution of Service Mesh in Ant Financial and the hot QA interaction between ant Financial senior technical experts and on-site staff at the GIAC Global Internet Architecture Conference in June 2018.

preface

In the past period of time, Ant Financial has begun to use Service Mesh to help solve some architectural problems, and we have some experience in how to better combine Service Mesh with classic service-oriented architecture. We hope to share and exchange our practice in this part with you. This will enable us to have a better understanding of ant Financial’s current servitization architecture, how Service Mesh solves the problems in the classical servitization architecture, as well as some design considerations and future prospects of Ant Financial’s actual implementation of Service Mesh. I also hope to share the status quo of Ant Financial’s servitization architecture with the industry.

It has been nearly 10 years since Ant Financial transferred from single application to service-oriented architecture. In the whole process, in order to meet the requirements of Ant Financial, we also built a set of finance-oriented distributed architecture solution, namely SOFA.

SOFA actually includes financial grade distributed middleware, CICD and PAAS platforms. SOFA middleware includes SOFABoot development framework, SOFA microservice-related frameworks (RPC, service registry, batch framework, dynamic configuration, etc.), message middleware, distributed transaction and distributed data access middleware.

All the middleware mentioned above are based on Java technology stack. Currently, SOFA is in use in more than 90% of ant financial’s internal systems. Besides these systems, the remaining 10% are developed using NodeJS, C++, Python and other technology stacks. The remaining 10% of the system would like to be integrated into the overall SOFA architecture. One way to do this is to write the corresponding clients for each part of the SOFA middleware again in the corresponding language.

As a matter of fact, this is exactly what we did before. Ant financial had a client for SOFA components using NodeJS. However, in recent years, with the rise of AI and other fields, C++ has also been applied in more and more places in ant financial. Should we also write each client of SOFA middleware again in C++? If we continue to support C++ systems in this way, we will first encounter a cost problem. Each language has a set of middleware clients. These middleware clients are like chimneys that need to be maintained and upgraded independently. On the other hand, in terms of stability, some of the potholes that Java clients have trodden before may have to be trodden again by other languages.

With Service Mesh, we can move as many functions as possible from the middleware client to the Sidecar. In this way, we can implement all languages at once. This is an increase in cost and stability for the infrastructure team.

The other question is one that all companies face as they move to a cloud native architecture. Cloud native looks great, but how do we evolve to a cloud native architecture? Especially for legacy systems, what’s the best thing to do. Of course, an easy and crude way to do this would be to just write it all over again with cloud-native infrastructure and architecture, but the cost would be very high, and rewriting would mean introducing bugs that could lead to stability issues online. So is there a way to make it easier for legacy systems to take advantage of cloud native? With Service Mesh, we can add a Sidecar to the legacy system. We can make the legacy system enjoy Service discovery, current limiting fuses, fault injection and so on without even modifying the configuration of the legacy system.

Finally, we in the ant gold of the problems in the process of service is a problem of middleware to upgrade, the ant financial evolution from monomer used to service-oriented architecture, evolution to unitize architecture, evolution to flexible architecture, is accompanied by a large number of middleware to upgrade, upgrade every time, middleware to the new version to provide new ability, The business system also needs to upgrade the middleware that it depends on. If there is a Bug in the process, it has to be upgraded again. Not only the middleware developers are suffering, but also the application developers are suffering.

Our evolution from monomer used in the service-oriented architecture, from the original several team to maintain the same application, to maintain their respective fields of application, the teams through the interface to the communication between team, has various business team to do the maximum level of decoupling between, but for infrastructure team, or coupled with each business team together.

We have tried various methods to solve the problem of coupling in the upgrade process. One is to manage all the basic class libraries through the application server CloudEngine developed by ourselves, so as to reduce the upgrade cost to users as much as possible. Instead of relying on users to upgrade one by one, we can upgrade once.

However, with the continuous development and expansion of ant’s business, the number of teams, the scale of the business and the efficiency of our delivery have become the main contradiction, so we expect to develop infrastructure with higher efficiency, and do not expect the iteration of infrastructure to be limited by this scale.

Later, OceanBase, a database developed by ants themselves, also uses a Proxy to shield the logic of OceanBase’s cluster load, FailOver and other aspects, and the Sidecar mode of Service Mesh happens to be the same idea. We see infrastructure ability from the application to move to the sidecars it is a whole industry trends and direction, in this way, applications and middleware infrastructure into two processes, from now on we can upgrade for middleware infrastructure to separate, and not apply the release and upgrade tied together, Not only liberated the application research and development and the infrastructure team, also make the infrastructure team’s ability to deliver more, before may need through six months or a year or longer, infrastructure team to provide new capabilities can be spread to all of the business system, now we are through a month’s time, The new capabilities can be made available to all business systems. It also makes the infrastructure team more capable in the middle. In this way, we can put some of the key support points and logic in the architecture into Sidecar, because the whole ant Financial architecture has a lot of logic and capability on this set of architecture. These are the things that we have a big responsibility to support to move forward quickly and flexibly.

Selection of Service Mesh

Previously, problems encountered by Ant Financial under the current service-oriented architecture and some problems we hope to solve through Service Mesh were mentioned. Then, we faced a very realistic problem: how should we choose the framework of Service Mesh and what standards should we use to measure it? Then, I would like to share with you ant Financial’s thoughts on the selection of the framework of Service Mesh.

First of all, all architectural evolution is not an overnight process, it is a gradual process of evolution, the larger the company in the process of architectural evolution needs to consider this in fact. So we need to take this into account when we are selecting a model, and consider how well the target framework fits in with the current architecture. On the other hand, as a company dealing with money, we need to pay special attention to whether the target framework has been massively validated in production, and whether it has been validated in scenarios.

At present, there are three mainstream Service Mesh related frameworks in the industry, namely Istio, in which Google, IBM and Lyft are all involved, and Linkerd and Conduit, two open source Service Mesh frameworks under Bouyant.

First of all, let’s take a look at Istio. Istio is a ServiceMesh framework that has attracted the most attention at present and is supported by top companies, such as Google and IBM. It also includes a Data Plane and a Control Plane completely. However, Istio has always been challenged in the Mixer part of his Control Plane. Istio’s Mixer undertakes the capabilities of service authentication, Quota Control, Tracing, Metrics and so on. It is a central node. If caching was not enabled, all calls would have to pass through the Mixer. Even if caching was enabled, requests would inevitably have to pass through the Mixer. In ants, there are more than 20,000 services, and calls between services are very frequent. The Mixer became a single point, and the operation and high availability of the single point became an issue.

In addition, the performance of Istio has always been a concern, although the performance of Istio has improved with each release. However, let’s take a look at the performance data of Istio. It is 700 TPS at 0.5.1, 1000 TPS at 0.6.0, and 1700 TPS at 0.7.1. Compared with the general RPC communication framework, the lowest and lowest TPS are ten thousand levels. Istio’s performance data is a bit dismal, and it doesn’t meet the performance requirements of ants at all.

Linkerd is one of the most mature Service Mesh frameworks in the industry. However, Linkerd also has a problem. First, Linkerd is derived from Twitter Finagle, and its architecture is not open enough to adapt to ant environment. Linkerd also does not have a Control Plane layer, only Sidecar, and Linkerd routing rules DTab is actually quite difficult to understand. Linkerd is written in Scala and runs on the JVM. I took an image from a Linkerd blog on how to optimize memory usage on the JVM, which is a common problem. From this picture, we can see that the memory required by Linkerd is at least 100M, which is also Bouyant officially did not recommend one-to-one deployment of Linkerd and the application. Instead, It adopted DaemonSet deployment. One of the things we wanted to do was to deploy one on one with the application, which was too expensive for us, so we wanted to keep Sidecar around 10M.

Finally, let’s look at Conduit. Conduit is also a framework for Service Mesh that Linkerd introduced a while ago, which is not very mature, and secondly, Conduit’s language of choice is Rust. Let’s look at Rust’s ranking on Tiebo. Java has long been number one, C++ number three, and Golang has risen to number 14 through the boom of cloud infrastructure in recent years, while Rust, not much different from many other languages, ranks below 50.

Therefore, we finally chose to develop our own Service Mesh. On the one hand, of course, we judged the current popular Service Mesh framework in the industry based on the preceding two criteria, but none of them could fully meet our requirements. On the other hand, Ant Financial has a long-term and profound accumulation in servitization. Some of the lessons learned in this section can also help us develop our own Service Mesh framework.

Of course, we don’t want to build a Service Mesh framework from scratch. We want to absorb the best ideas in Service Mesh frameworks. On the other hand, We also want to Follow as many of the specifications in the Service Mesh community as possible.

Design of SOFA Mesh

First of all, SOFA Mesh actually uses Istio’s Control Plane Pilot and Auth directly, because we don’t think Istio has any problems in this area and even has some very good design in it. For example, the Universal Data API in the Pilot section is a very good design. The Auth part of Istio also takes full advantage of Kubernetes’ security mechanisms.

For the Mixer, as I mentioned earlier, we felt that there was a design problem, so the idea was to move the Mixer directly into the Sidecar.

Again, we all know that Istio’s Sidecar is Envoy written in C++, so how can we move the Mixer into the Sidecar? In fact, the Sidecar of our SOFA Mesh is written in Golang. So that opened up the possibility of moving the Mixer into a Sidecar. Of course, we chose Golang to develop a Sidecar not only to move the Mixer into a Sidecar, but also for other reasons. On the one hand, in the era of cloud computing, Golang has become the preferred language for building infrastructure. We see a lot of infrastructure written in Golang, including Docker, Kubernetes and so on. Golang is chosen in the hope of better fitting with these infrastructures in the cloud native era.

In addition, Golang is much easier to pick up and find talent for than the C++ Envoy, and Golang has a much lower Memory Footprint than the JVM. Our Sidecar written in Golang, The current peak memory under TPS is 11M, which has some room for optimization, but is 10 times less than the JVM.

In addition, although we adopted Istio Pilot, it was not enough for us to use the Pilot directly for internal use. First of all, Pilot in Kubernetes is directly connected to the service discovery mechanism of Kubernetes, whether SOFARPC, or Weibo Motan and other domestic service framework, in fact, is a single application of multiple services such a model, The service discovery mechanism of Kubernetes is actually aimed at a single application of a single service model, which is not very consistent in the model. In addition, SOFARegistry, SOFA’s service registry, has been practicing in Ant Financial for many years. In the face of large-scale internal service scenarios, SOFARegistry has been proven to be scalable and reliable in a large number of practices. Here are some data on SOFARegistry. There are about 2W registered services, and the Pub and Sub in one machine room add up to tens of millions of levels. Based on the above considerations, we choose to add Adapter of SOFARegistry on the Pilot so that it can get the service registration information on SOFARegistry.

Then, there is a problem with the Pilot. Originally, the Pilot would synchronize all the data related to service registration to the Pilot. This is very stressful for the Pilot cluster, so we choose to synchronize only the necessary data to one Pilot node. This section describes the memory pressure of the Pilot.

Finally, I’ll share a scenario of the ant gold service, in the ant gold, because of the many business and regulatory issues, no division between some of the machine may be network impassability, so they want to do service access, you must have a role to do the service access between across the environment, so we based on the concept of sidecars, The role of EdgeSidecar is proposed, which is very similar in technical implementation details to the Sidecar deployed with the application, except that the Sidecar is a “fringe” role responsible for service communication issues across the environment.

So, the overall picture of SOFA Mesh looks something like this. We developed a Golang Sidecar and incorporated Mixer into the Sidecar to prevent Istio performance problems in both Pilot and Auth roles. We chose to use Istio directly, and then we adapted it to the ant’s internal environment, and then we added an EdgeSidecar role to the overall deployment to solve the problem of service invocation across the environment.

I know you are very interested in the implementation of SOFA Mesh inside ant. So far, we have implemented multi-language scenarios, solving communication problems between other languages and SOFA. We have implemented about 20 or 30 systems. Then we are trying to use SOFA Mesh to better solve the security of calls between services, as well as the issues of blue-green publishing, and we will also try to use SOFA Mesh in the near future for communication between heterogeneous systems.

Of course, the implementation of SOFA Mesh in ants cannot be done without the open source community. Therefore, in the next two or three months, we will also open source SOFA Mesh and the results of Service Mesh practice in ants to provide more references in this aspect.

As for the future, IN fact, I think the integration of middleware as an infrastructure and cloud platform will be an irresistible trend. In addition to Service Mesh, there may also be Message Mesh, DB Mesh and other products in the future. I know that some students in the industry have started to make efforts in this regard. Finally, to summarize the content of my speech today, one is the problems solved by Service Mesh for Ant Financial, including multi-language, legacy system and coupling between infrastructure team and business team. In the selection of ServiceMesh, we mainly consider the compatibility with the current architecture, as well as the high availability and stability of the framework. In addition to ServiceMesh, there will likely be other meshes in the future, and further convergence between middleware and underlying cloud platforms is inevitable. Thank you very much!

The following is the q&A about Service Mesh between ant Financial’s senior technical experts and on-site participants in GIAC conference. We select some popular q&A to share with you.

I. The high availability and security of Mesh, can you explain it in detail?

A: Recently, we are working on security, which involves two aspects. One aspect is the souneness of RPC’s whole service call, which can be directly done in Mesh by using Istio’S RBAC. The other is the TLS authentication between the Mesh and the Mesh. As a matter of fact, there are some ready-made solutions in Istio and its integration with K8S is also very good. These things can be directly taken and used.

Ii. How to solve the problem of multi-version routing of services and multi-version routing of data units?

A: ServiceMesh focuses on service invocation. Let me explain multiple versions of routing. Internally, we use less service versions and more different implementations of the same service. In fact, if you know the Label of K8S, you can refer to its design in the whole Mesh, and then distinguish it by different labels. There will be some sharing in this aspect later.

Does Service Mesh mainly solve the problems of reliable transmission of requests and Service governance?

A: Service Mesh provides a better way to solve the problem of reliable transmission of requests and Service governance. In fact, if you want to build a whole Service governance architecture, the old way would have required all of your upper business systems to connect to your corresponding Service governance components. Now, with a Service Mesh, you can do Service governance in this Sidecar. It doesn’t solve new problems, it just solves old problems in a better way.

Why is Control Plane important for Mesh?

A: Actually, this involves the integration of the whole cloud platform and our whole service system. In fact, as you can see right now, the Pilot part, which was very strong in the original Istio design, was integrated with the K8S thing, and if you don’t have it, it’s a very upper-level middleware thing for Mesh. Of course, you can say that there is no Control Plane layer, only Sidecar, to the original set of service governance system, it is ok to do so, there is no big problem. But with the Control Plane, which defines a very general API, the architecture itself is closely bound to the overall architecture of the cloud platform, with a better degree of integration. So we think the whole Control Plane layer is very important.

In addition, Istio’s Control Plane is a big step forward in microservices standardization. It has a lot of standards for service discovery, standards for governance, and even though it’s a bold concept and assumption, we’ve seen some of its shortcomings, so we want to work with the community to standardize this layer. As I shared at the beginning, the infrastructure is being rolled up layer by layer. Like we think more and more of the middleware is actually going to be deposited into the infrastructure. There are cloud native languages as well, and we compiled them and found them slow and problematic, but we think it’s a direction to go. When you are writing, you may use such language to write, and a lot of ability will be improved. We want to push the infrastructure up a little bit to play that role. This is what we see as the greatest value of Control Plane.