Abstract: Baidu’s service governance based on RPC framework in the past has many problems, such as uneven capability levels of various frameworks, low efficiency of business service governance, and insufficient global observability. This paper introduces the practice process of Baidu’s internal landing Service Mesh, taking basic stability capability management and traffic scheduling capability as the business landing site, and elaborates the overall technical scheme and a series of key technologies of the internal landing service Mesh, such as the ultimate optimization of performance, extended advanced strategy, peripheral service governance system, etc.

The full text is 6835 words, and the expected reading time is 13 minutes.

The background,

= = = = = = = =

Most of Baidu’s product lines have been transformed into micro-services, and tens of thousands of micro-services have put forward higher requirements for architecture service governance capabilities. Traditional service governance generally through RPC framework to solve, for years, baidu internal also spawned many languages RPC framework, such as c + +, go, PHP, and so on framework, basic service governance ability and RPC framework coupling, RPC framework ability is uneven, brings to the company’s overall service governance capability and efficiency more pain points and challenges:

1. Advanced architecture capabilities cannot be reused in multiple languages and frameworks

For example, several avalanches occurred in a product line in the past two years, and frameworks such as PHP and Golang needed to be built repeatedly to customize advanced capabilities such as dynamic fusing and dynamic timeout, which were already supported by other RPC frameworks.

For example, common architectures are degraded and stop-loss capacity is repeatedly constructed in various product lines, and interface schemes differ greatly. From the perspective of operation and maintenance, operation and maintenance students expect that basic architecture stop-loss capacity can be generalized between different product lines, and interface standardization can reduce operation and maintenance costs.

2. The governance cycle of architecture fault tolerance capability is long and the coverage of basic capability is low

With the full implementation of chaos engineering, there are higher requirements for architecture capability. Most modules lack basic tolerance for single point anomalies, slow nodes and other anomalies, so each module is promoted to repair independently, which has high cost and long online cycle.

For example, a product line governance transformation took 2 quarters to complete; It is recommended that some types of recall services often have improper timeout and retry configurations. Centralized management and adjustment costs are high.

3. Insufficient observability. Is there a general mechanism to improve the observability of the product line?

For example, a recommendation service lacks the overall module call relationship chain and flow view, online faults are located by human experience, and the construction cycle of a new machine room is long and inefficient.

What problems does service Mesh solve?

= = = = = = = = = = = = = = = = = = = = = = = = =

In order to completely solve the pain points and problems of current business service governance, service Mesh is introduced. The basic idea decouples governance capability and framework, and governance capability descends to Sidecar. Internal cooperation with multiple departments to build a common service Mesh architecture to provide common basic stability capability and unified flow control interface.

What problems do we expect service Mesh to solve in the factory? Two points are summarized:

1. Key components of basic stability capability — providing universal basic fault tolerance capability, basic fault detection capability, unified intervention and control interface for microservices;

2. The core system of traffic governance — to realize the overall connection and hosting of each product line, the observation and fine scheduling of global traffic;

Service Mesh is an infrastructure layer used to handle communication between services. It is used for reliable request delivery in complex service topologies of cloud native applications. In practice, a Service mesh is usually a set of lightweight network agents that are deployed with the application but are transparent to the application.

Third, technical challenges

= = = = = = = = = =

In the actual process of landing the Service mesh, we faced the following challenges:

· Low intrusion: Baidu has hundreds of product lines, with tens of thousands of modules and millions of instances. How to make the business migrate seamlessly without changing the code is the first factor we consider in the design scheme.

· High performance: The online services of Baidu’s core product lines have very high requirements on delay, such as recommendation and search. A delay increase of a few milliseconds will directly affect the user experience and the company’s revenue. From the business perspective, performance degradation caused by the access to Mesh cannot be accepted. Therefore, we put a lot of effort in the landing process to optimize the mesh delay and reduce the performance loss after the mesh is connected.

· Heterogeneous system integration: First of all, we need to solve the intercommunication problem of multi-language frameworks in the factory. Secondly, we need to unify interfaces and protocols to get through multiple service governance systems in the factory, such as service discovery, traffic scheduling and fault stop-loss systems.

· Reliability of Mesh: Online business has high reliability requirements, which requires us to fully consider our own stability in the landing process to avoid major cases.

Summary: Our requirement is to implement a service Mesh architecture with low intrusion, high performance and complete governance capabilities that can solve practical business problems.

Iv. Overall structure

= = = = = = = = = =

· Technology selection: Our base layer is based on the open source ISTIO + Envoy component, which ADAPTS to the on-site components based on the actual on-site business scenarios. The main reason for choosing open source customization is to be compatible with the community, keep standard protocols with open source, absorb advanced features of the community and feed back to the community.

The overall architecture of our internal mesh is as follows, including the following core components:

· Mesh Control Center:

· Access center: Sidecar injection, sidecar version management, unified online entry;

· Configuration center: stability governance and traffic governance portal, managed connection, routing configuration, communication and other policies;

· Operation and maintenance Center: Daily operation and maintenance of Mesh, such as intervention to de-hijacking operations;

· Control panel: ISTIO-Pilot component, responsible for routing management, communication policy and other functions;

· Data panel: Envoy component, responsible for traffic forwarding, load balancing and other functions;

· Dependent components: Integrated Naming Service of the factory service discovery component, internal adaptation of various RPC frameworks, monitoring system, and low-level PAAS support;

· Peripheral management ecology: service management ecology derived from mesh unified management interface, such as intelligent reference system, automatic fault locating & stop loss system, fault healing, chaos engineering (fine fault injection based on Mesh).

Next, we analyze the key technologies such as access mode, performance optimization, stability management, traffic management, peripheral system coordination and stability guarantee.

4.1 Access Mode

= = = = = = = = = = = =

In the iptables traffic hijacking scheme adopted by the community, too many Iptables rules may cause performance problems. In particular, when tens of thousands of instances are forwarding, the linear matching rules of IPtables are limited, and the forwarding delay is very large, which cannot meet the scenario with low online latency.

Our solution: Envoy accesses internal service discovery components based on a local lookback IP address scheme, hijacks service discovery requests and transparently hijacks business traffic by sending lookback addresses back. At the same time, the local Naming agent will explore the envoy regularly, and automatically revert to direct connection mode in case of an envoy failure, so as to avoid the loss of traffic due to the envoy failure.

At the same time, for some businesses that do not go through traffic hijacking, we designed proxyless scheme, which ADAPTS ISTIO standard XDS through RPC framework to access pilot service governance pathway, and managed service governance policies and parameter distribution take effect. Regardless of whether service traffic is hijacked or not, unified control and governance of service governance are achieved through the standardized intervention entry of mesh. At present, proxyless scheme has been adapted in internal c++ and other RPC frameworks, and implemented in search, recommendation and other business lines.

Conclusion: We use two access schemes of transparent migration based on service discovery, traffic hijacking and Proxyless, to realize the low-intrusion mode that the business module can access the mesh without modifying the code and reduce the cost of service access to the mesh.

4.2 Extreme Performance optimization

= = = = = = = = = = = = = =

During the landing process, we found that the community version envoy had high latency and resource consumption. In some large fan out complex scenes, the latency caused by traffic hijacking increased by nearly 5ms, and the CPU consumption accounted for more than 20%, which could not meet the high throughput and low latency scenarios of in-factory online services. We analyzed the underlying model of Evnoy, the essence of which is that envoy is a single-process and multi-threaded Libevent thread model. An event-loop can only use one core, and a callback jam will jam the whole thread, which is prone to high latency and leads to poor throughput control ability.

We extend envoy network model & thread model based on Envoy extension interface and introduce BRPC underlying high-performance BThread coroutine model. Internally we call it the High-performance BRPC-Envoy version. At the same time, we opened pilot to realize online switching between original Libevent and BRPC-Thread, so that users can easily choose to start high-performance model by themselves. Note: BRPC baidu internal c++ high-performance RPC open source framework, internal several dozens of product lines to use, the number of instances of millions of scale, has been open source.

Testing results show a 60%+ reduction in CPU, 70%+ reduction in average latency, and 75%+ reduction in long tail latency compared to industry frameworks such as the Open source community version and MOSN, which completely solve the problems the community edition envoy cannot handle in large-scale industrial high performance scenarios. Clearing obstacles for large-scale landing mesh.

At the same time, we are investigating ebPF, DPDK and other new technologies to further reduce latency and resource consumption. Currently tested, ebpf has a 20% improvement over native lookbackip forwarding performance, and DPDK has a 30% improvement over kernel stack (under core-binding conditions).

4.3 Stability management

= = = = = = = = = = = = = = = =

Internal online and offline services are mixed on a large scale, and the online mixing environment is complex, which requires high stability of module architecture. Based on Mesh, we provide universal fault tolerance, fault detection, unified intervention and degradation capabilities to improve the overall stability of the product line.

4.3.1 Fault tolerance of local faults:

To enhance the architecture’s fault tolerance for everyday machine failures, we extend advanced stability fault tolerance strategies based on envoy, such as adding dynamic retry fusing policies, calculating fractional value time through sliding Windows, dynamically controlling retry ratio, and retrying requests while avoiding avalanche risk due to large retries. In addition, we introduce a feedback-type advanced load balancing strategy, which returns customized error codes according to downstream, reduces weight & shields failure instances, and protects the control value of the circuit breaker mechanism to avoid suspension of normal instances. After the launch of our internal core product line, the fault-tolerant capacity of the module under local failure was greatly improved, and the resilience of the architecture was greatly improved.

(As shown in the figure below, after an online core module is connected to the mesh, its availability increases from two nines to four nines)

Against avalanches governance scenarios (our statistics in the factory the core product line avalanche case history, more than 90% of the case are lack of avalanches governance capabilities, such as retry storm, timeout hangs, lack of degradation ability makes), we based on mesh custom fusing advanced retry ability to suppress the retry storm, provide dynamic timeout mechanism to prevent timeout upside down. After the extensive rollout of the core product line, covering 90% avalanche + failure scenarios in the past two years, the avalanche case loss in 2020 decreased by 44% compared with the avalanche case loss in 2019

4.3.2Local fault detection ability:

In the past, fault detection relied on basic indicators of machine granularity, with coarse granularity. Due to the lack of fine indicators for detecting container fault instances, fault instances could not be detected in a timely manner, usually requiring several hours to detect fault instances. We integrated the upper level fault self-healing system to provide universal, fast and direct fault discovery and detection capabilities based on envoy extended fault detection strategy. The external fault self-healing system collects fault instances through Prometheus interface and triggers PAAS migration of fault instances through aggregation analysis. For the service line that has been connected to the mesh, it can quickly discover and locate local anomalies at almost zero cost, and the detection timeliness of fault instances is optimized from the original hours to minutes.

4.3.3 Unified intervention and degradation capabilities:

For some large-scale faults, the architecture alone cannot solve the fault tolerance, so it needs to rely on stability plans to stop losses, such as typical downstream weak dependency removal plans. In the past, different product lines and modules themselves were relied on to build degradation capability, and interface schemes of different modules varied greatly. With the continuous iteration of the system, degradation capability may appear, resulting in high operation and maintenance costs and challenges. We combine mesh to achieve common degradation and intervention capabilities, such as supporting traffic discarding capability in multi-protocol scenarios to achieve a unified traffic degradation strategy. Second – level intervention timeliness is achieved through unified timeout and retry intervention capabilities.

The landing mesh provides unified intervention and control interfaces for multiple product lines and consistent operation interfaces for stability plans, greatly improving the efficiency of service governance, and shortening the iteration cycle of service governance for product lines from the previous quarterly level to the monthly level.

For example, when a service line is connected to the mesh in 20 years, the architecture governance transformation of 20+ modules in four directions can be completed within two weeks, while it usually takes a quarter to complete the transformation in the past.

= = =

4.4 Traffic Management Capability

= = = = = = = = = = = = = =

· Flow observability:

In the past, there was a lack of general solutions to build the upstream and downstream call chain of product line modules and the basic gold index. Most of them were based on RPC framework or business framework customization, and the coverage rate of module call chain and gold index was low. For example, an important product line involves more than 2000 modules end-to-end, and the call chain relationship is very complex. The source of specific traffic is not transparent, which seriously affects the operation and maintenance efficiency. For example, when building the computer room, we do not know the connection between upstream and downstream, and there is a large error by human carding. It takes nearly 2 months to build a product line at a time. In addition, fault location and capacity management are inefficient because of insufficient global observability.

The overall idea is to build the global ServiceGraph call chain with mesh as the center and surrounding RPC framework.

· On the one hand, module link relations and link attributes are abstractly expressed through INTERNAL CRD of ISTIO, and the upper layer of ISTIO builds its own mesh configuration center to shield details of underlying CRD. With the configuration center as the only entrance to connect to hosting, hosting module full link call relationship, the new machine room construction based on ServiceGraph to quickly build the topology of the new machine room, greatly improve the efficiency of the machine room construction, shorten the cycle.

· On the other hand, in combination with BRPC and Mesh, standard gold index format is formulated, unified gold index data warehouse is constructed, and upstream service governance construction is supported, such as capacity management analysis, fault location, performance analysis, fault injection, etc. For example, the fault self-sensing and stop-loss system we are landing can automatically, quickly and accurately detect and stop online faults based on ServiceGraph.

· Fine flow scheduling:

Most of the product lines in the plant are based on the overall cutting flow at the entrance, and they lack the ability to fine schedule and control the flow inside the module link. We combine mesh’s traffic scheduling capabilities with service discovery components in the factory, and integrate a series of cutting flow platforms to unify the traffic scheduling entry to the Mesh control center. Combined with the global call chain provided by ServiceGraph, the traffic scheduling capability of module fine connection relationship is realized. In addition, we realized fine-grained flow scheduling and flow replication capabilities of module instances based on mesh, which are typically applied to fine-grained flow assessment, offline pressure measurement and flow diversion scenarios of modules.

= = =

4.5 Peripheral ecological coordination

= = = = = = = = = = = = = =

= = =

A unified control interface is provided based on mesh, and a peripheral service management system is derived. Typical scenarios include automatic adjustment of management parameters, automatic fault stop loss, and self-healing of faults.

· Automatic parameter setting system

Service governance parameters depend on manually configured parameters (timeout ratio, weight ratio, etc.) and are completely dependent on human experience. Improper configuration frequently affects the effect of governance capability. Meanwhile, there are large differences in online environments, and static configuration cannot adapt to complex online environment changes. We designed a set of dynamic parameter tuning system, the core idea of which was based on the unified governance interface of mesh and real-time feedback of online indicators to adjust governance parameters in real time. For example, according to the downstream CPU utilization, dynamic tuning access downstream retry fractional value ratio; According to downstream machine load differentiation, dynamic parameter access downstream weight.

After the core product line is implemented in the factory, automatic adjustment can completely replace human adjustment to realize self-adaptive adjustment of service management parameters.

· Automatic failure-aware stop loss system

The traditional online fault location depends on manual experience, and the product line has the ability to deeply customize the plan. It strongly relies on experienced engineers, and the cost of beginners is high. And the planned stop-loss operations are scattered in the documentation, poor maintainability, and may fail or deteriorate as business iterations become unsustainable.

Based on the universal intervention capability and unified control interface of Mesh, we developed a set of automatic stop loss system for fault plan. Combined with the service Graph mentioned above, we provided global call chain and golden index to realize automatic perception of common faults and automatic stop loss for fault plan, and reduce MTTR time of fault stop loss. At the same time, chaos engineering is opened, and regular end-to-end injection of fault trigger plan drills are conducted to avoid the degradation of plan capability. At present, this system is typically applied in scenarios such as weak and weak dependence degradation and refined traffic scheduling. It is expected that by the end of the year, most of the online faults in the product line with Mesh can be automatically handled.

· Unified protocol and coordinated with peripheral systems

The mesh configuration center provides standard traffic control and service governance interfaces (such as traffic degradation interfaces) to coordinate with the surrounding system ecology, such as automatic parameter tuning, fault aware stop loss, fault self-healing, and traffic scheduling.

Based on open source XDS protocol, unified data surface protocol, docking with surrounding RPC frameworks, to achieve unified control of different RPC frameworks.

4.6 Self-stability guarantee

In-plant services such as search, recommendation and other key services have high requirements for stability. Online migration mesh is like “changing wheels on the highway”, which must be guaranteed without damage to services. Therefore, stability construction is one of our key concerns in landing mesh process.

First, we guarantee the reliability of traffic forwarding by multi-level bottom-pocket mechanism. For local failures, such as individual envoy configuration, process, etc., envoy has its own fallback mechanism, which automatically reverses into direct connection mode without human intervention. But some big fault, a lot of failures, such as envoy on envoy mechanism itself is no guarantee that the (possible seizure or not hijack mode swings back and forth), we issued by external intervention platform a key forward blacklist, forced intervention envoy to direct mode and full product line stop timing control within 5 minutes; In extreme cases, such as an envoy hanging on a large scale, the external intervention interface could fail. We have prepared a backstop to force the PAaS to kill the envoy in batches and retreat to direct engagement mode.

Secondly, in terms of service governance configuration release, our core idea is to control the fault isolation domain, for example, to get through the mesh configuration center, and to control the percentage of configuration release by gray scale. At the same time, build a one-stop mesh access platform, progressively publish the mesh, and control envoy to upgrade the impact surface on the business. We introduce the Monitor module to perform regular end-to-end inspections such as configuration consistency, envoy node service exceptions, version consistency, etc.

Finally, we regularly inject faults actively through chaos engineering, such as simulating envoy, Pilot, configuration center, etc., to conduct extreme anomaly case drills and avoid degradation of our stability architecture ability.

Five, the summary

= = = = = = = =

From the end of nineteen nineteen, less than two years later, dozens of internal product lines have been completed, among which the backbone modules of some core product lines have covered more than 80%, and the daily managed flow has exceeded 100 billion. The new access module provides basic stability management and traffic scheduling capabilities at almost zero cost. In combination with the surrounding ecosystem, we build a one-stop Mesh access platform to provide low-intrusion, low-cost and standardized service governance solutions for each business line, systematically solve the basic availability problems of each product line, significantly reduce the cost and cycle of governance iteration, and promote the overall stability of the system.

Recruitment information

If you are interested in microservices, please contact me and we can talk face to face about the N possibilities for the future. Whether you are a backend, front-end, big data or an algorithm, there are several positions waiting for you here. Welcome to submit your resume. Please follow baidu Geek, a public account of the same name.

Recommended reading

|Baidu students teach you how to become a copy master

Federated computing in Baidu stargazer practice

Baidu Love Fanfan and Servicemesh have to say the story

The invention relates to a system and method based on real time bit calculation

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding

Welcome to your attention