Baidu Aifanfan and Servicemesh have to tell the story

Note: reprinted from the “Aipanpan Technology” public account, also included in the “rack what structure” public account, welcome your attention

takeaway

Servicemesh swept the world with the official release of Istio1.0 in the summer of 2018, and domestic companies have also sprung up everywhere. The concept brought by Servicemesh has gradually been accepted by all parties and become popular. Based on its own pain points and the characteristics of the ToB industry, Aipanpan, together with the company’s infrastructure, officially launched the Servicemesh project at the end of August 2020, and quickly completed the full cut of Java business applications in only 3 months. Become baidu’s first commercial production system entirely based on native Kubernetes + Istio.

4. 1. Smart mattress

As a one-stop intelligent marketing and sales accelerator, Aipanfan aims to help enterprises achieve business growth. Continuous efforts in communication, marketing, sales, insight and other areas are facing fierce competition in the ToB SaaS industry, which means that there are high requirements for system stability and r&d human efficiency in technology. Looking back, Aipanfan now faces many challenges in its business:

Multi-language governance is difficult. There are Java, Golang, Nodejs, Python and other languages, which mainly support the needs of Java in terms of service governance, while the governance of other languages is either independent or basically missing. It will bring great governance cost and system risk.
Business coupling. Currently, the service governance framework of Smart Client is adopted, which makes it difficult to promote iterative upgrading. The average cycle of service governance is more than three months, bringing huge operation and maintenance upgrade costs.
Lack of ability. The current service governance framework lacks sufficient governance means, such as traffic limiting fuse, chaos, canary, service grouping, traffic recording and playback, dynamic configuration and other capabilities.
Human flesh configuration. The current service governance framework reduces the governance granularity entirely to the method level, which directly leads to the de facto unconfiguration of too many (2k+ method) human configuration requirements. Directly leading to aipanfan service governance platform in fact no one use state. This has led to some serious online problems.

Therefore, the current situation of service governance is that the marginal cost of governance cannot be reduced, but shows an exponential upward trend, and governance can only be carried out based on the problem driven due to the high cost. This is also the status quo of service governance of many companies in the industry. In the end, efficiency, stability and capability are facing great challenges. At the same time, due to high governance costs, the sale of our business to “multi-cloud/private deployment” seems to be a long way off. The author calls this kind of management to be: sunken type management. Look forever in governance, in fact, always sinking.

2. Decision: Next generation service governance system

In order to solve the above problems and reverse the dilemma of sunken governance, we have made a difficult and unavoidable choice. Is there a way to not only solve the problem of multi-language and business coupling brought by Smart Client, but also have the service governance capability with rich functions and appropriate governance granularity? And given the limited resources, can the problem be solved in a pragmatic and pragmatic manner? After layers of screening and discussion, the answer in front of us gradually becomes clear: Service grid (Servicemesh). We chose the current de facto cloud native standard service grid facility: Istio.

2.1 What is service Grid

The concept of Servicemesh (hereinafter referred to as Mesh) was formally proposed in the spring of 2017, and swept the world in the summer of 2018 with the official release of Istio1.0 jointly developed by Google, IBM, and Lyft. It mainly aims at solving a big problem brought by Smart Client — “how to solve the problem of strong coupling between service governance and business code and low efficiency of cross-language scenario governance”. The solution provided by Mesh is to promote the service governance capability to sink nearby and let Sidecar take over the north-south, east-west traffic. In this way, the most direct benefit is to realize decoupling, apply itself “black box”, further standardize the overall service governance, and improve operation efficiency. On top of this, the service governance capacity is rapidly enhanced, “one development, have everywhere”, completely liberating productivity, as shown in the figure below:

Istio can be logically divided into data plane and control plane, as shown in the following figure:

The data plane consists primarily of a series of intelligent agents (defaulting to Envoy) that manage network communications between microservices and collect and report telemetry data in all mesh.
The control plane manages and configures proxy traffic.

2.2 Zigzag progress of service grid

Service grid is a new concept, but itself is not a novel architectural design. As early as more than a decade ago, Airbnb has carried out practice in its governance framework Smartstack, ctrip’s OSP, And the service governance solutions in various clouds (Mesos/Marathon, K8S) have long been similar to the Local Agent architecture. But at this time, the industry has not formed a unified standard, and the complexity of its operation and maintenance makes many people back. As K8S redefines infrastructure, service grid emerges and redefines Local Agent. With the rise of the service grid, there are corresponding problems. Many people have expressed varying concerns and doubts about the performance, stability and resource cost of the Mesh concept, which led to the failure of Linkerd and the tortuous progress of Istio architecture. After a long period of questioning the performance of the control plane, Istio finally unhesitatedly removed Mixer and introduced WASM mechanism to enhance the plug-in capability on the data plane. This is a difficult and courageous step, but it also presents new risks. Up to now, whether to use the Mesh, when to use the Mesh, how to use the Mesh well, the positioning and future of the Mesh are still talked about. That’s part of its appeal. Overall, the Istio open source community has shown a positive and open mind, and we have reason to believe that Istio will continue to unleash more power as it becomes the de facto standard for service grids. A general overview of the implementation of Mesh in the industry:

Tencent Cloud launched TCM based on Istio, which supports cluster hosting or self-construction and can control traffic in multiple regions.
Ant Sofa Mosn rewrites Mesh in Golang language and evolves independently, which shines brilliantly in China.
Meituan-dianping is also promoting OCTO2.0 service governance system with Mesh transformation based on Envoy+ Self-developed control panel.
Baidu has two Mesh products, BMesh and TRW Mesh.
Toutiao and Kuaishou are carrying out corresponding construction. Netease Xiaozhou has implemented Mesh, and Momo has constructed Java version Mesh.
Azure, AWS, and Google Cloud have launched Mesh products.
.

The overall situation is shown in the figure below:

We can further conclude that:

Envoy (Istio defaults to Envoy) has become the de facto standard;
The Istio project continues to evolve rapidly and iteratively and is stable in production.
Global mainstream cloud vendors and a large number of domestic companies have landed Mesh;
The current mainstream approach uses (secondary development) Envoy + self-developed control panel;
The industry is trying to reap the Mesh dividend through middleware sinking.

Our choices:

In terms of ROI, we don’t want to build our own Mesh from 0-1. We want to concentrate more resources on business iteration, so we choose the next step with the idea of “80% capacity can be satisfied, and the remaining 20% can be compromised or enhanced”.
In terms of the language stack, resource control itself is particularly important because the Mesh is essentially a process that “parasitized” on the application machine. Therefore, it is unwise to choose the Java language for Sidecar development at this stage, and this is the main reason why Linkerd1.0 failed. So we are not going to introduce the Java technology stack Mesh.
From the perspective of open source ecology, Istio has been refined for several years. Although there are still many imperfections, it has become the de facto Mesh standard in the industry with its strong capabilities, endorsement by giants and active ecology. Therefore, we hope to build A Mesh system based on Istio.
In terms of cooperation with Baidu Infrastructure, we have communicated with the students of infrastructure cloud origin about whether to directly reuse Mesh products in the factory. Due to the premise of “privatization/multi-cloud deployment”, Aipanfan hopes to carry out lightweight deployment without changing the original structure of open source components. For example, try not to be coupled with the unique infrastructure in the plant, such as in a completely native way, etc.

Therefore, Aipanfan and infrastructure reached an agreement on the final solution: instead of directly using the infrastructure Mesh for the time being, infrastructure will operate and maintain THE K8S cluster and build the Calico network for us, and baidu TRW products will be used to manage and control the cluster. On this basis, Aipanfan chose Istio1.7 native components for landing.

2.3 Differences between ToB and Toc scenarios in the core demands of Mesh

In the ToC scenario, performance is often considered highly. Mesh’s current performance (RT & OPS) is not very impressive, and the official solution can bring delays ranging from a few milliseconds to ten milliseconds. The best self-development/secondary development solutions in the industry range from 0.5-2ms. In the high-traffic scenario of TOC, the landing of the Mesh may be obstructed. After the performance problem is solved, you can start to think about things like migrating well. In the ToB SaaS scenario, the core point is portability, which can well support private, multi-cloud deployment, and the product needs to have good portability and maintainability. In contrast, the absolute performance requirements of Mesh are not the point of highest consideration in the early and middle stages. In the middle and later stages, as middleware capabilities sink, higher performance requirements will gradually be on the agenda. That is, the difference between the two:

ToC scenario: Performance takes precedence over portability
ToB scenario: Portability takes precedence over performance considerations

Aipanpan, on the other hand, is a typical ToB scene. Mesh works well out of the box.

3. Practice: smooth migration and enabling services

3.1 Status quo of Aipanpan

Aipanfan currently has three IDCs in North China, South China and East China, 300+ K8S nodes, 300+ applications, 3k+ service points and 8K + POD. Over 1 billion PV per day. Most of the main business products are deployed in the east China cluster, so this migration is mainly aimed at the east China cluster.

3.2 Smooth Migration

3.2.1 POC Verification We chose THE Istio version 1.7 and tested POC performance based on the actual application scenarios of Aiphanan. It was found that the performance of the single machine can temporarily meet the current requirements of Aiphanan. When the single machine is about 100 QPS, the performance loss of introducing Istio is less than 1%. The core competencies of Istio were verified.

3.2.2 Migration Scheme

The general principles of migration are as follows:

Monitoring first;
Low perception of business side;
As lossless as possible.

Based on the general principles, the overall architecture of the generated migration scheme is as follows:

Overview of the overall scheme: Calico is used as the network facility to build a new Mesh container network cluster, and the entrance gateway is used for gray scale. Istio-gateway is used to communicate between the two clusters, and fault tolerant processing is performed in multiple links. With Istio as the core refactoring infrastructure for service governance. In the whole process, the gray scale migration process and the performance of the new cluster were visualized. The overall migration process maximizes the detail shielding of the business side through CICD and SDK.

3.2.3 Migration Difficulties

In the process of implementation, we encountered the following main difficulties:

Flow closed loop assumption cannot be made. In the complex distributed topology, it is very difficult to select the complete closed – loop subtopology for migration verification. Once the service is migrated to the container network cluster without any preparation, one of the links in the call chain remains on the host network cluster, which can easily cause online accidents. To solve this problem:

Skywalking was used to observe the link topology. In the early verification stage of migration, the traffic should not be too scattered.
With the help of the old registry and gray list, the services in container network cluster can be directly called back to the host when accessing non-gray applications. In this way, service migration can be carried out without concern for flow closed loop.

The container network environment is unstable at the beginning. In the initial stage of migration, Node, API Server and other infrastructure instability will occasionally occur in the new cluster, which can lead to serious business problems if no intervention and quick response is taken. To address this issue, we have implemented a number of usability safeguards, including:

At the infrastructure level, for the jitter of API server, ETCD, etc., quick stop loss and optimization, and develop the corresponding stability assurance SOP;
At the level of gateway entrance, grayscale and backcutting are performed based on any product line and any grayscale ratio.
For idempotent requests, provide automatic fallback mechanism on failure;
Provide automatic circuit breaker and recovery capability for failed requests;
For scheduled tasks and asynchronous MQ consumer processes that are easily missed, the capacity can be automatically reduced during one-click switchback after identification.
Support for connection/read timeout & retry capability on the caller when making calls in a Mesh container cluster.

Large-scale migration is difficult to shield the impact on the business side. It basically involves the migration of all 300+ business applications, and how to reduce the cost of the business side as much as possible in the high-speed iterative business scenario to achieve fast switching work. To address this issue, we have taken three measures:

The SDK defaults to forward compatibility as far as possible. Avoid large-scale transformation by the business side;
At the CICD level, the deployment details of new and old clusters are shielded, and the gray scale of batches can be carried out according to product lines, and two sets of cluster configuration can be controlled by one set of templates, so as to achieve complete transparency to the business side in CICD.
For urgent problems found in the process of large-scale migration, the hot loading mechanism of the Launcher provided by the Phoenix Nest business platform team realizes automatic replacement injection upgrade package to complete the zero-invasion replacement and rapid verification of new functions.

Governance challenges arising from the introduction of Istio. The introduction of Istio has brought about subversive changes to the original service governance framework based on the concept of Smart Client, which will also bring corresponding adaptation and switching costs. We shall deal with them as follows:

Concept change: The overall concept, namely, service governance concept and model, is fully aligned with Istio, and the idea of ServiceID (method-level) governance is gradually abandoned.
Configuration optimization: After Istio is introduced, two hops are added to the entire call link. Based on these two hops, the relationship between core configurations such as connection/read timeout retry and TCP BacklogSize is reviewed to avoid unnecessary stability failures.
Portal convergence: After Istio was introduced, most governance capabilities interacted through CRD. We temporarily integrated its governance portal into the CD system, prohibited core configuration changes in kiali and other places, and eliminated disorderly online management through portal convergence.
Compromise enhancement: Istio itself has very powerful functions, but some capabilities need to be further enhanced, such as current limiting circuit breaker and chaos engineering, so we also make trade-offs after tradeoff, making castration compromise for some functions (such as temporarily giving up cluster flow limiting) and complementing for some functions (such as introducing Chaosmesh to enhance chaos). In this way, Istio dividends can be quickly enjoyed.

3.2.4 Migration rhythm

The Mesh project was officially launched at the end of August, POC verification was completed at the beginning of September, MVP delivery was completed at the end of September, and 17% of aiphanan applications were switched. After October, the Mesh project was gradually expanded, and the stability of the new cluster was continuously enhanced, while Istio capacity was released. At the end of November, the main cluster business applications in East China were switched over. The whole input of 5 manpower, only took 3 months to complete the process from verification to switch, become the first Baidu commercial production system completely based on the original Kubernetes+Istio operation of the product.

3.3 Dividend Release

After completing the isTIO principal switch, we did not stop, but immediately started business enablement to maximize mesh’s value points. We delivered nearly 20 function points based on a standardized mesh base, helping our business achieve overall improvements in performance, stability, functionality, and cost.

3.3.1 Full-link grayscale publishing

Taking a case as an example, Aipanfan’s “full-link Grayscale publishing” platform, based on isTIO’s architecture design of “grouping multi-dimensional routing” through isomorphic bottom layer, solves the disadvantages of the mainstream Flagger/Helm scheme in the industry, A Set of architecture has been completed to support several core capabilities including ABTest, Canary, capacity assessment, multiplexing and Set (some capabilities are under development), and centralized control has been carried out on the whole life cycle and flow of grouping nodes. For the server scenario, the FGR Operator coordinates K8S and ISTIO VS/Dr Resources, and connects the monitoring alarm and CICD. For the scenario on the end, it integrates with the corresponding front-end resource packaging and acquisition process for user-level marking and route distribution. With ISTIO, our overall resource commitment has been significantly reduced.

3.3.2 Application status of Aipanfan to Istio

Istio has rich governance capabilities in service connectivity, service discovery, service protection, and service observability. Currently, Aiphanom’s use of Istio includes but is not limited to:

Service connection

Communication: long connection based on Http1 original protocol; Service discovery based on K8s Service;
Load balancing: default RR, consistent hashing for special application requirements (such as Aifan’s database middleware Dataio);
Routing grouping: Canary capability, test environment multiplexing, gateway entry traffic routing, ABtest, development machine direct connection, gray link, etc.

Services to protect

Authorization: sensitive interface call permission control (such as obtaining user mobile phone number);
Current limiting fusing: single-machine current limiting based on connection number, fusing based on slow call/abnormal number/rate;
Fault injection: Fault simulation of east-west traffic, the rest supported by Chaosmesh/ChaosBlade.

Service operation

Service management and control: it does not use the open source Kiali management end, but presents the corresponding node information on the one-stop platform of Aipanfan, and provides basic one-stop management capabilities, such as current limiting circuit breaker, configuration management and control, service migration, etc.;
APM: In THE APM of Istio itself, Logging is collected based on EFK architecture, Metrics is collected based on Prometheus, and one-stop management is achieved through Grafana. APM of service applications is temporarily maintained and controlled by Skywalking + EFK + Prometheus + Grafana without Mesh.

3.4 Benefits of Switching Servicemesh

Switching Mesh marks the achievement of another core milestone of Aipanfan cloud native. Aipanfan deconstructs and initially reshapes its own service governance, which initially changes the status quo of submerged governance. The previous multi-language governance difficulties, business coupling, lack of capacity and human configuration dilemma have been greatly alleviated. In terms of function, more than 10+ previously missing core governance capabilities have been quickly supplemented. In terms of efficiency, the life cycle of service governance has been linearly lowered from several months to minutes, and THE TIME of CI pipeline has been saved by 20%. The test environment multiplexing capability can subvert the existing development mode, realize parallel development and testing, and at the same time save more than 30% of the waiting time of test integration; In terms of stability, it provides the capability of current limiting fusing and chaos engineering, providing a solid means of self-protection for business. Through canary release, but also can realize online flow lossless at the same time, so that research and development personnel bid farewell to the situation of late night release; Relying on the stability guarantee system built by ISTIO, the overall stability of Aipanfan has been greatly improved. That’s the benefit now, but it’s going to be a lot more in the future.

4. Conclusion: Stars and sea

At present, from a pragmatic point of view, The service governance of Aipanfan still faces a lot of challenges to overcome one by one in order to maximize Istio’s core dividend. On the other hand, we are not satisfied with defining Servicemesh as the control of north-south, east-west traffic. In the face of efficiency problems, the dividend of Servicemesh can be released to a greater extent and solve a wider range of pain points. Sunk governance not only exists in the distributed service framework, but also in all middleware for a long time. We are also aware that Istio itself is doing some research in the industry, and we believe that this will become a mainstream trend in the future in the context of “multilingual microservices architecture”. Based on its own pain points, Aipanpan began to lead the incubation and successful Release of APM Mesh — Apache Skywalking Satellite. We hope that The Servicemesh system can truly become “the next generation of middleware governance core” **. I believe that this will be achieved in the near future with the cooperation of other departments of the company, completely bid farewell to the sunken governance, accelerate the delivery of customer value points.

5. Author introduction

Orange, chief architect of Baidu Aipanfan Business Department, Outstanding Producer of QCon, Star lecturer of ArchSummit, Most Valuable expert of Tencent Cloud, Apache Commiter, responsible person of platform & infrastructure & operation & maintenance of many companies.