Zheng Jimin, joined the team of Domestic hotel Quotation Center in August 2019, mainly responsible for quotation related system development and architecture optimization. Strong interest in high concurrency and high availability, with daily orders of ten million distributed system high availability construction experience. Like to study algorithms, ACMICPC program design competition twice entered the Asian preliminary competition. I won the first prize in the first Hackathon Competition in Qunar.


background

The practice of cache governance was introduced before, for specific reference: cache governance of domestic hotel stability governance practice. After caching governance, we didn’t stop there.

Our application also relies on many external components, interfaces, and provides some external interfaces. All of these dependencies are subject to failure, and the impact of failure may be significant in individual scenarios. So after caching governance, we moved on to broader stability governance.

This article focuses on the governance of inter-system dependencies, including external components on which the system depends, external interfaces on which it depends, and interfaces provided externally, such as Dubbo, Http, DB, MQ, etc.


Management plan

Service sizing and dependency governance

1) Mark the level of each application (P1, P2, P3, P1>P2>P3) according to the core degree and impact of the business, and sort out the dependency level.

2) P1 applications should be deployed in multiple machine rooms, and the number of online machines of any P1 application in any machine room should not exceed half of the total number of online machines of the application under normal circumstances. After this adjustment, the impact of core services is significantly reduced when the network and individual components of a single room fail.

3) For strong dependence, to weaken, so that can be degraded; For weak dependence, asynchronous, fusible; Also eliminate unnecessary dependencies. For P3 application interface called by P1 application interface, we handled the call exception well in advance and supported fusing. After the online exercise, we removed all P3 application online machines, and P1 application interface could not be affected. For calling P1 application interface, we prepare a degradation plan in advance, and ensure that the calling end can quickly recover from faults by means of multi-channel and multi-copy. In order to reduce the number of failures and reduce the impact of failures, assess the impact of strong and weak dependency failures in advance and prepare measures to deal with them.

Current limiting

This mainly deals with the scenario where the application may be overwhelmed by the sudden increase in traffic. We choose sentinel component after unified access encapsulation. The main functions used include:

1) Application-level Single-machine Dubbo dynamic traffic limiting: Traffic limiting for all Dubbo interfaces can be dynamically configured.

2) Application-level single-server Http dynamic traffic limiting: Traffic limiting for all Http interfaces can be dynamically configured.

3) Service buried traffic limiting: Perform single-node (or cluster) traffic limiting based on specific parameters. For example, for the core interface, we can not only have the overall flow limiting of the interface dimension, but also can limit the flow of some special parameters. A typical example is the hotel quotation interface can distinguish the source from APP or PC by parameters. At present, the number of requests from APP is far greater than that from PC. You can increase the flow limiting for PC requests separately, so that the requests on app side will not be affected.

4) Cluster dynamic traffic limiting of application-level interfaces: From the perspective of system protection, this is not necessary and only useful for some special scenarios. At present, we mainly use the above three methods.

The sentinel component allows to directly limit the upper limit of traffic from the interface level. If there is abnormal traffic coming in, we can configure relevant rules to reject requests of a certain magnitude and provide services as far as possible according to the current cluster capacity. In this way, the abnormal volume of traffic does not kill all the machines in the application cluster, causing the application to be unable to provide services or the service capability to be reduced.

It is important to note that limiting traffic by its very nature affects the processing of requests and may have an impact on the user experience, so we do not want to use limiting traffic when the system can withstand it. Application traffic may increase abnormally during certain holidays or events. These are normal user traffic, and we expect the application to handle these traffic normally. At this time, we not only need to prepare the means of limiting the flow in advance, but also need to estimate the flow in advance, do a good pressure test, evaluate whether the application needs to expand and the number of machines to expand.

Dubbo governance

This is mainly for some Dubbo thread pool full and downstream Dubbo service timeout scenarios.

1) Dubbo thread pool monitoring: this is a point that is easily overlooked. Try to avoid temporarily adding machines or increasing the number of threads when the online Dubbo thread pool is insufficient. It is also convenient to do some needs assessment. By default, the service thread pool on the Provider side shares one thread pool. Each additional interface consumes resources of the default thread pool. Some changes have been made to the Dubbo thread pool inside Qunar, and the consumer business thread pool is also shared.

2) Dubbo thread pool isolation: core interfaces can be isolated, and individual non-core interfaces can be isolated. This prevents the Dubbo thread pool from filling up when the non-core interface fails, affecting the service capability of the core interface. For example, a core application adds a non-core Dubbo interface with a large size and a long response time. This interface continuously preempts the provider’s thread pool resources, causing the application’s core Dubbo interface to fail to get a connection. At this point, you can isolate the non-core Dubbo interface and use a separate thread pool without affecting the core interface.

3) Reasonable configuration of the Dubbo interface timeout period: The timeout period of the consumer must be set based on the normal interface response time and cannot be greater than that of the provider.

Http governance

This is for the scenario where a lot of timeouts and exceptions occur when calling an external Http interface.

1) Timeout configuration check: Set a proper timeout period (longer than P99) based on the actual situation of the interface and dynamically set timeout parameters. At present, timeout is the most abnormal application in the group. If the timeout time is too large, it will always occupy the main thread and eventually drag down the service.

2) Asynchronization: It is recommended to use asynchronous invocation and support synchronous and asynchronous switchover.

3) Retry: Check whether services can and need to be configured. Retry is not recommended on the interface for synchronization (prone to timeout). In practice, many interfaces may succeed after one or two retries. However, you need to communicate with the downstream to confirm whether the interface can be retried.

4) Isolation: For asynchronous calls, thread pool isolation and client isolation are carried out to prevent different interface calls from affecting each other.

The DB governance

DB and the previous caching governance is still very similar, the main focus is on the rapid recovery of components after problems, and core data storage to do multiple copies, multiple shards.

1) High availability guarantee: data multi-copy storage (master/slave, DB storage Redis cache), fast recovery and degradation processing, reduce the impact of possible slow query, DB outage and other services.

2) Optimize the storage of useless data, such as domestic hotel related services to remove the cache of international hotel data.

3) Monitor the average request data set (increase alarm when exceeding certain data set), and realize the completion of Inteceptor of Mybatis in the section.

MQ governance

This is mainly for scenarios where a single MQ cannot send messages properly, or consumption is heavily piled up. For example, the ZK failure of a single machine room led to the failure of sending SOME MQ topics.

First verify that MQ can quickly switch to a normal channel, such as masking messages from a failed machine room or drifting messages to a normal machine room.

Especially for core scenarios with multiple channels in place: sending messages to different topics or using two different MQ components, you need to be careful to consume idempotent, so that a failure of a cluster does not affect the core.

other

1) Check and improve the monitoring: Dubbo interface, Http interface, DB-related Mybatis Method interface and other successful operations and abnormal QPS, time and other monitoring.

2) Add components of appCode dimension and call monitoring panel to facilitate inspection and quick view in case of faults.

3) Check timeout configuration and correct rationality: Generally, connect timeout should be shorter, and Socket timeout can be larger according to the actual business situation. In real development, many people copy the demo or use the code somewhere and ignore whether these numbers are reasonable or not.


The governance process

The overall process is very similar to the previous cache governance, which is mainly “combing the scene → determining the scheme → developing and self-testing → testing → online → online exercise and improvement”. Of course, there are also differences. This time, in the governance of related applications, each application is published online in two or more times:

1) Add traffic limiting components and related component monitoring first: ensure that abnormal traffic can be prevented first, and prepare monitoring data for subsequent governance.

2) Optimize parameters and processes based on perfect monitoring indicators.

With that done, we move on to the online walkthrough. For those exceeding expectations, we revised the plan and continued to optimize and re-drill. For those that meet expectations, we collate them into a common governance scheme that will serve as a standard for future applications and new dependency governance.


conclusion

After each failure, we do a fault review and propose some improvements. Looking back over these improvements, there are many things we can do to prevent failures before they happen, and to reduce the impact or even prevent them from happening. The purpose of stability management is to prepare solutions and methods for some problems and scenarios before common faults, so as to minimize the number, duration, and impact of faults.

Inter-system dependency governance is not a one-time OK, our current governance is only for the existing dependency governance. In the future, we plan to automatically mark these dependencies and report them to specialized service governance or related systems for management. In this way, we can not only dynamically identify new dependencies, but also quickly take effective measures to deal with failures.

After finishing the inter-system dependency governance, we continued to manage the internal resources of the system, mainly using degradation, circuit breaker, isolation and other means, we will explain in detail in the next chapter.