The Singles’ Day of 2019 is an important moment for Ant Financial, which implemented Service Mesh on a large scale and ensured the smooth passing of the Singles’ Day. We immediately communicated with the person in charge of the landing.

The beginning of the interview: Hua Rou: “How do you feel about being on duty on Double Eleven after attending Service Mesh this time?” “Service Mesh is really stable.”

Introduction of landing person

Service Mesh is the core of the next generation architecture of Ant Financial. This year, Ant Financial launched Service Mesh on a large scale. I was fortunate to lead and face this challenge, and passed the grand examination of Double 11 very smoothly.

I personally focus on microservices. I have been working on service registration and service framework for many years. I have led the design and implementation of the fifth generation of SOFARegistry, and continue to explore new directions in the evolution of microservices architecture. In addition, I was responsible for the architecture design and implementation of the internal Service Mesh direction in the evolution of the fifth generation architecture of Ant Financial.

SOFAStack:github.com/sofastack

SOFAMosn:github.com/sofastack/s…

SOFARegistry:github.com/sofastack/s…

Service Mesh is in Ant Financial

Ant Financial has been paying attention to ServiceMesh for a long time. It launched the ServiceMesher community in 2018 and now has over 4000 active developers. On the technical side, the Service Mesh scenario has gone through its exploratory phase and entered full deepwater exploration this year.

Singles Day 2019 was a big moment for us. We had a massive landing, probably the largest practice in the industry so far. As a technical person to face world-class traffic challenges, it is very nervous and exciting. What happens when Service Mesh meets Singles’ Day? What is the responsibility of Service Mesh in the continuous evolution of Ant Financial’s LDC architecture? With the help of four “Double 11 exam questions”, we reveal one by one.

Service Mesh background

The concept of Service Mesh has been very popular in the community for a long time. There are a lot of articles on the background of Service Mesh on our public account. I will not introduce it in a redundant way.

istio.io/

The Istio architecture diagram clearly describes the two core concepts of Service Mesh: data plane and control plane. The data plane is responsible for network proxy, and does layer interception and forwarding on the link of service request. It can do service routing, link encryption, and service authentication in the link. The control plane is responsible for service discovery, service route management, and request measurement (which is controversial on the control plane).

We will not repeat the benefits brought by Service Mesh. Let’s take a look at ant Financial’s data surface and control surface products, as shown below:

** Data plane: SOFAMosn. ** Ant Financial uses the high-performance network agent developed by Golang as the data surface of the Service Mesh to carry the massive core application traffic of Ant Financial’s Double Eleven.

** Control surface: SOFAMesh. **Istio, streamlined to Pilot and Citadel during landing, integrates directly into the data plane to avoid the overhead of an extra hop.

2019 Service Mesh Double Eleven Exam Revealed

SOFAMosn and SOFAMesh went through a large number of large-scale exams to ensure a smooth double 11. This year, more than 100 core applications of Ant Financial have been fully connected to SOFAMosn on Double Eleven, and hundreds of thousands of Mesh containers have been produced. The peak value of SOFAMosn carries tens of millions of QPS, and the average processing time of SOFAMosn forwarding is 0.2ms.

In such a large-scale access scenario, we are faced with extremely complex scenarios, which require the full cooperation of various parties and guarantee the performance stability of the data surface to meet the requirements of large-scale promotion. The whole process is extremely challenging. Below, we will share the problems and solutions we encountered in this process from several aspects.

Double eleven exam questions

  1. How to maximize the business value of Service Mesh?
  2. How to achieve the goal of connecting hundreds of thousands of containers to SofamOSNs?
  3. How to upgrade hundreds of thousands of SofamOSNs?
  4. How to ensure the performance and stability of Service Mesh meet the standard?

Be born architecture

In order to better understand the solution of the above problems and the terms that may be involved in the subsequent introduction, let’s first look at the main architecture of Service Mesh implementation:

The above architecture diagram is mainly divided into the following parts:

  1. Data: With the help of Pod model in Kubernetes, SOFAMosn arranges independent images and App images together in the same Pod, sharing the same Network Namespace, CPU and Memory. After SOFAMosn is connected, all App RPC traffic and message traffic do not interact with SOFAMosn directly. SOFAMosn directly connects to the service registry for service discovery, Pilot configuration delivery, and MQ Server for message sending and receiving.
  2. Control plane: The control plane consists of components such as Pilot, Citadel, and service registry. It delivers service addresses, service routes, and certificates.
  3. Bottom support: Sidecar access and upgrade are dependent on Kubernetes capabilities, Sidecar injection through Webhook, Sidecar version upgrade through Operator, related operation and maintenance actions are inseparable from the bottom support;
  4. Product layer: encapsulate operation and maintenance capabilities based on atomic capabilities provided by the bottom layer, monitor docking, collect Metrics data exposed by Sidecar, conduct monitoring and early warning, flow control, security and other product capabilities;

Ant Financial’s answer sheet

1. How to maximize the business value of Service Mesh?

As a technologist, we are very insistent on not innovating for its own sake, but letting technology help the business and driving the business. With the changes in people’s consumption habits and online behaviors in recent years, the business scene faced by Singles’ Day is far more complicated than expected. For example, have you noticed that your girlfriend or wife live shopping on Li Jiaqi’s Taobao every day, and the anchors’ continuous, real-time red envelopes and new updates, etc., have brought far more complex business scenes and volume than the second kill. The mode of rush is also more abundant. Rush involves different applications in different scenarios, and each type of application needs sufficient resources to deal with unique flood peaks.

If operating students designed two activities at different time points, two kinds of activities will respectively corresponding to two different types of applications, if these two types of applications are in the big preparation is sufficient natural resources easily through promoting the peak, but the big promoting flood peak time is short, a lot of resources are idle for a long time, the pursuit of this nature is not in conformity with the technical people.

So how do you get through the crunch without adding resources?

The core problem is how to achieve large-scale resource transfer in a short enough time, so that a batch of machine resources can carry different flood promotion peak at different time points.

What are the solutions to this challenge?

When we piloted Service Mesh 618, we explained why we wanted to do this. The core value of Service Mesh is that the business and infrastructure can be decoupled so that both sides can grow in parallel and move forward quickly. So what real value can parallel evolution bring to the business? In addition to reducing the upgrade difficulty of basic components, we also explored the following aspects of resource scheduling:

1.1 Shift scheduling

When it comes to resource scheduling, the simplest natural is done directly resources crag, such as large and promote A peak after the big promoting A corresponding application by shrinkage release resources, promote B corresponding application to large application resources capacity, this way is very simple, the difficulty is that when the resource size is very large, low efficiency of expansion and shrinkage capacity, The application needs to apply for resources and start the application, which takes a long time. If the time interval between multiple promotions is very short, the time required to move is likely to exceed the time interval, and this scheduling scheme can no longer meet the demands of multiple promotions on Double 11.

1.2 Time sharing Scheduling

The biggest problem of towing scheduling is that the application needs cold start, and the time consuming of cold start itself, plus the time consuming of preheating after cold start, has great influence on the whole towing time. If the application can move resources without having to start, the overall scheduling speed will be much faster. Based on the idea that the application can complete resource relocation without restarting, we propose a time-sharing scheduling scheme: time-sharing scheduling will apply for enough containers through oversold mechanism.

We divide the application containers in the resource pool into two states (4C8G for example) :

  • ** Running: ** running applications are running at full speed (resources available up to 4C8GB). They can run at full speed with sufficient resources and carry 100% traffic.
  • ** Live: ** live applications run at low speed (resources can be used up to 1C2G). They can only use limited resources to run at low speed, bearing 1% of the traffic, and the remaining 99% of the traffic is forwarded to the running node by SOFAMosn.

The active and running states can be switched quickly. During the switching, only THE JVM memory Swap and traffic ratio based on SOFAMosn need to be switched, and the resources between A and B can be quickly switched.

The ability of SOFAMosn to switch traffic ratio is a core part of time-sharing scheduling. Only with this traffic forwarding, Java memory Swap can reduce traffic to maintain various connection activities, such as the connection activity between applications and DB. Through the technical means of time-sharing scheduling, we have achieved the target of “Double Eleven”, which can save a lot of cost and is also the important business value that Service Mesh will bring.

Each company takes different paths to implement the Service Mesh. The advice here is not to innovate for its own sake, but to explore the actual value that the Service Mesh can bring to the current business, and set goals based on the value to implement the Service Mesh step by step.

2. How to reach the goal of connecting hundreds of thousands of containers to SOFAMosn?

For data plane access, those of you who know about community solutions will think of Istio’s Sidecar Injector. Istio injected Envoy by integrating the Sidecar Injector in Kubernetes. The standard Kubernetes update behavior is also a Pod rebuild, creating a new Pod and destroying the old one, changing the IP during the update process, and so on.

In the scenario of Ant Financial, or in other words, when many domestic companies use Kubernetes to upgrade their internal operation and maintenance system, they will replace the bottom resource scheduling first and then the upper Paas products, because the existing Paas exist objectively and the capability of Paas cannot be directly transformed into cloud biogenesis overnight. So it makes sense to replace the underlying resource scheduling with Kubernetes first. This path also involves the compatibility of non-Kubernetes system and Kubernetes system, such as the Pod update ability to keep the IP address unchanged, with these capabilities for the upper Paas, the operation and maintenance system and the operation and maintenance of the traditional VM scenario similar, However, release deployment can be changed from upload zip package decompression to image release, and in more advanced scenarios, Operator can be used to achieve a more end-state oriented operation and maintenance mode.

When we introduced the landing architecture of the Service Mesh in Figure 4 above, we saw that SOFAMosn and application App are arranged in the same Pod. Then, before an application is connected to SOFAMosn, it needs to be mirrored first, which can be managed by Kubernetes. Then there is the injection of SOFAMosn, followed by an InPlace Update.

2.1 Replacement Access

In the initial stage of our landing, the access mode of SOFAMosn was mainly replacement access. Replacement access is to specify an application to do SOFAMosn access, we will create a new Pod, and then inject SOFAMosn during the creation. After the creation of the new Pod, the old container will be reduced. The reason why it is said to reduce the capacity of the old container rather than the old Pod is that in the early stage of access, our internal scheduling system Sigma (Ant Financial Kubernetes version) is still in the process of gradual replacement of the bottom layer, and some containers are not managed by Kubernetes, which do not have the concept of Pod.

The biggest problem of replacement access is that sufficient resource Buffer is needed, otherwise the new Pod cannot be scheduled, and the access progress will be blocked. In addition, the whole access cycle, including resource application and data initialization of a series of surrounding supporting systems attached to resource application, will be long and dependent. Therefore, this capability is difficult to meet our access schedule target of hundreds of thousands of containers in a short time.

2.2 In-place Access

Based on the known problems of alternative access, we try to combine the transformation of resource scheduling layer, simplify the access scheme greatly, and provide in-place access capability. In-place access will first replace containers not managed in Kubernetes cluster with pods managed by Kubernetes cluster, and then inject SOFAMosn into PODS by directly adding Sidecar during Pod Upgrade. The resource Buffer cost of capacity expansion is eliminated, and the whole access process becomes simple and smooth.

In-place access breaks the principle of injection only when a Pod is created, introducing greater complexity. For example, if the original Pod has 4C8G of resources, it cannot be occupied after SOFAMosn is injected. If new resources are occupied, these resources are not managed by the capacity control system. Therefore, we share the resources of App Container and SOFAMosn Container. The SOFAMosn and APP share the CPU capacity of 4C and have the same preemption ratio. In terms of memory, the available memory Limit of SOFAMosn is a quarter of that of App. For 8GB memory, SOFAMosn can see 2G memory, while the application can still see 8GB memory. When the application uses more than 6GB memory, Oom-killer kills SOFAMosn first to ensure that the application can still apply for memory resources.

The above CPU preemption ratio and Mem limit ratio are the actual pressure test adjustment ratio. At first, we tried to set the CPU ratio of SOFAMosn to 1/4 of the application and Mem to 1/16 of the application, which can work well in low-traffic applications. However, in the case of core applications with tens of thousands of connections, Memory and CPU resources are too tight, especially when the CPU contention capability is weaker than that of applications. As a result, the RT length becomes longer, which affects applications indirectly. After adjusting and verifying various parameters to ensure the stability of actual operation, it is concluded that SOFAMosn and APP share CPU 1:1, and the Mem limit is 1/4 of the application Mem limit.

With the in-place injection capability, we achieved a smoother access mode, which greatly accelerated the process of connecting massive containers to SOFAMosn.

3. How to upgrade hundreds of thousands of SofamOSNs?

After massive containers are connected to SOFAMosn, a bigger challenge is how to quickly upgrade the SOFAMosn version. We have always emphasized that the core value of Service Mesh is the decoupling of the business and infrastructure layers, which allows both sides to move forward more quickly. However, without the ability to upgrade quickly, this fast moving forward is useless, and any software can introduce bugs. We also need a quick mechanism to upgrade or roll back faster when bugs are fixed. The upgrade capability of SOFAMosn is also upgraded from inductive upgrade to insensitive upgrade.

3.1 Sensory upgrade

The sensible Upgrade is an InPlace Update of Pod. During the Upgrade process, the Pod traffic will be closed, and then the container Upgrade will be performed. When the container Upgrade is performed, both the application and SOFAMosn will be updated and started, and the traffic will be turned on after the Upgrade is completed. The duration of this process depends on the startup time of the application. For example, the startup time of some large applications may be 2 to 5 minutes, so the upgrade cycle will be extended and the application can sense the switch of traffic. Therefore, this method is called sensible upgrade. The advantage of this method is that there is no flow in the upgrade process and the stability is higher.

3.2 Insensitive Upgrade

An insensitive upgrade is also called a smooth upgrade. During the upgrade, the traffic is not suspended. The SOFAMosn version takes over the traffic of the previous version dynamically through the hot update technology and completes the upgrade. This upgrade method reduces the application restart time. The entire upgrade time is mainly spent on the connection migration between the old and new SOFAMosn versions. For the implementation details of the insensitive upgrade, please refer to my share on GIAC in the first half of 2019: The ant gold Service uniform Mesh ground practice and challenge | GIAC a memoir,”

3.3 Unattended

With the ability to upgrade quickly, you also need to be able to intercept risks during the upgrade process, so that you can be confident and bold to do version upgrades. Based on the concept of unattended, we judge the changes of traffic, success rate and business indicators before and after the upgrade of SOFAMosn based on Metrics Metrics, so as to realize timely identification and blocking of change risks. In this way, we truly achieve the rapid upgrade with confidence and boldness.

4. How to ensure the performance and stability of Service Mesh meet the standard?

The performance of SOFAMosn on Singles’ Day is stable. The peak QPS of ten million grade passes through SOFAMosn, and the average RT consumption of SOFAMosn internal processing is less than 0.2ms. We finally achieved the above results through multiple optimization and improvement in performance and stability, which can not be separated from continuous pressure testing and grinding in production. The following are some improvements in performance and stability for your reference:

4.1 Performance Optimization

CPU optimization

Golang Writev optimization: Multiple packages are written at once, reducing syscall calls. We found a memory leak bug in Writev when we were in Go 1.9. Currently, the internal use is Go 1.12 with patch writev Bugfix. Writev BugFix has been integrated with Go 1.13. See the PR we submitted to Go at the time: github.com/golang/go/p…

Memory optimization

Memory overcommitment: Parsing packets directly may generate a large number of temporary objects. SOFAMosn directly multiplexes packet bytes and points necessary information to the specified location of the packet through unsafe.Pointer to avoid generating temporary objects.

Delay optimization

Protocol upgrade: Fast reading Header: TR Both request headers and Body are Hessian serialized, resulting in high performance loss. In Bolt protocol, Header is a flat map with low parsing performance loss. Upgrade applications using Bolt protocol to improve performance.

Route caching: The complexity of internal routes (a request often requires multiple routing strategies to determine the destination of the routing result). By second-level caching of the routing results under the same conditions, we successfully reduced the full-link RT of a certain core link by 7%.

4.2 Stability Construction

  • Pod CPU Mem quota configuration, Sidecar and APP share CPU Mem;
  • Operation and maintenance surrounding construction:
    • In-situ injection;
    • Smooth upgrade;
    • Sidecars restart;
  • Monitoring construction:
    • System specifications: CPU, Mem, TCP, Disk IO;
    • Go indicators: Processor, Goroutines, Memstats, GC;
    • RPC index: QPS, RT, connection number;
  • Bypass enhancement:
    • Service registry performance improvements;

review

  1. How to maximize the business value of Service Mesh? Ensure that efficiency increases and costs remain constant
  2. How to achieve the goal of connecting hundreds of thousands of containers to SofamOSNs? Reduce access costs
  3. How to upgrade hundreds of thousands of SofamOSNs? Reduced app awareness
  4. How to ensure the performance and stability of Service Mesh meet the standard? Layer by layer optimization of performance and stability

Through in-depth interpretation of the above problems, we can understand why Ant Financial’s Service Mesh can stably support the Double 11, and hope to bring some different thoughts to the implementation of the Service Mesh for everyone. For more information about Service Mesh, please visit our public account “Financial Distributed Architecture” to learn more.

Service Mesh Singles Day special event

This is a special session of Service Mesh Meetup#8, which is jointly held by CNCF, alibaba and ant financial.

Not every cloud can survive double 11.

Transaction 268.4 billion, Alibaba core system 100% on the cloud.

Ant Financial launched Service Mesh as its core transaction link on a large scale.

This time, let double 11 carnival continue, let cloud native stand double 11 test, also let cloud native go to the developer side.

You will gain three lessons:

  • Double 11 baptism of Alibaba K8s super scale practical experience
  • Ant Financial’s first large-scale implementation experience of Service Mesh
  • Alibaba super large DCP K8s cluster operation and maintenance experience

Time: 9:30-16:30, November 24, 2019 (Sunday)

Venue: Multi-function Hall, 2F, Block B, Pohang Center, Hongtai East Street, Dawangjing Science and Technology Business Park, Chaoyang District, Beijing

How to register: Click “here” to lock a seat

Financial Class Distributed Architecture (Antfin_SOFA)