This article is based on a speech delivered by Xunwei, senior test and development engineer of Meituan, at the 43th technical salon of Meituan: The road to quality assurance of Meituan Financial Ten-million-level transaction System. This paper mainly introduces the challenges meituan intelligent payment business faces in the direction of stability, and focuses on some methods and practices in QA stability testing.

background

Meituan Pay carries all the transaction flow of Meituan, which can be divided into online payment and intelligent payment services according to usage scenarios. Online payment, support users’ online consumption scenes, process all online transactions of Meituan, and provide payment capacity for business lines such as group purchase, takeout and hotel tourism; Intelligent payment supports users to shop consumption scenarios, processes all offline transactions of Meituan, and provides merchants with efficient and intelligent cashier solutions through intelligent POS, TWO-DIMENSIONAL code payment, box payment and other methods. Among them, smart payment, as a newly expanded business scenario, also became one of meituan’s fastest growing businesses last year.

Challenges

With the rapid growth of the business, the complexity of the system behind the seemingly simple payment continues to increase. Embodied in: Top business entry, the enrichment of the underlying pay channels and micro system under the background of the vertical stratification of service, service of lateral resolution, and external systems (marketing center, the member center, risk control, etc.), internal infrastructure (queue, cache, etc.) rely more and more, the whole article link on more than 20 core service node, business complexity.

In addition, the technical team has grown from a few people to nearly a hundred in a short period of time, which is also a potentially destabilizing factor. For a period of time, the whole system was in the state of “affecting the whole body”, even if their own system did not do any version upgrade, because of some infrastructure, upstream and downstream services, the business will be affected without warning.

Recalling the pain, we reviewed the online problems that had occurred and analyzed the reasons affecting the service stability. Through data, found that 72% of the serious fault on third-party services and infrastructure, corresponding to some typical accident scenarios, such as: the third party payment channel is not stable, infrastructure, such as message queues) is not stable, resulting in an avalanche, the whole system when a relying party after recovery, our business is difficult to resume immediately.

The solution

Based on these issues, we launched a stability building project with a clear goal: to improve service availability. The goal is to gradually increase the usability of the system from two nines to three nines, and then work toward four nines. The two most core strategies in this process are flexible availability, which means to ensure the availability of core functions as far as possible, or to ensure the core user experience as far as possible in the case of loss, so as to reduce the impact; The other is quick recovery, which uses tools or mechanisms to quickly locate and rectify faults, reducing the troubleshooting time.

Around these two strategies, common operations in stability construction include current limiting, fuse downgrading and capacity expansion, which are used to create flexible availability of the system. Fault response SOP, automatic fault handling, used for quick recovery during fault handling. QA is more focused on validation of these “common operations”. Based on experience, “three swords” are mainly introduced: failure drill, online pressure test and continuous operation system.

The origin of the fault drill

Take a real case, in a deal with a problem unstable online payment channel, the development of students has been tested before implementation through the plan (server shut down the channel, the client is expected to pay the channel switch of ash, and prompts the user to use other payment method), but found that the execution of plans can’t take effect (the server after the operation, The payment channel is still open on the client. The plan functions normally in non-fault scenarios, but fails in fault scenarios.

This is where the failure drill comes in. We need to recreate the failure scenario as much as possible to truly verify the effectiveness of the plan.

The overall plan of fault drill

The overall plan of fault drill is divided into three parts:

  • The load generation module is responsible for restoring the real operation scenarios of the system as far as possible (core business processes are required to be covered).
  • The fault injection module includes the fault injection tool and the fault sample library (covering external services, basic components, equipment rooms, and network dependencies, and focusing on timeout and exceptions).
  • Business verification module, combined with automated test cases and each monitor to carry out.

In order to conduct a failure drill more efficiently, our strategy is to conduct it in two phases. Firstly, a fault drill is carried out for a single system. Starting from the fault sample database, all protection plans of the system are comprehensively covered. On this basis, a full-link fault drill is conducted to focus on core service faults and verify the fault tolerance of upstream and downstream services.

The effect of failure drill

As it turns out, the failure drill did bring us a lot of “surprises” and exposed a lot of hidden dangers. Three types of problems are listed here: database master-slave delay affects transactions; Services are not degraded when infrastructure is faulty. The timeout setting of dependent services is inappropriate, and the traffic limiting policy is not considered enough.

Origin of on-line pressure measurement

In the face of exponential business growth, we must be aware of the amount of traffic our systems can carry. For QA, accurate and efficient system capacity assessment methods need to be found. The difficulties we encountered included: long link, many links, complicated service, great difference between offline environment and online environment, etc. Based on the test effectiveness and cost consideration, we decided to do online pressure measurement, and to achieve the full link online pressure measurement.

Overall plan of on-line pressure measurement

The implementation of the full link pressure measurement scheme is not much different from the mainstream scheme in the industry. According to the pressure measurement process, first of all, the scene modeling, in order to restore the online system operation scene more truly; Secondly, basic data structure should meet the requirements of data type and magnitude to avoid data hotspots. After that, flow is constructed, read and write flow is constructed or played back, and pressure measurement flow is marked and desensitized at the same time. After that, the pressure test is performed, during which the service running status and resource usage of each node of the link are collected. Finally, the pressure measurement report is generated.

Based on the full-link on-line pressure test solution, the single-link pressure test and layered pressure test can be flexibly implemented according to service requirements. More importantly, based on the pressure test, we can carry out online fault drill, which is used for more real verification of system current limiting, fusing and other protection plans.

The effect of on-line pressure measurement

Through the full-link on-line pressure measurement, on the one hand, we have a good idea of the system capacity, on the other hand, we also find the potential problems in the online system operation process, and these problems are generally high risk. Three problems are also listed: infrastructure optimization, such as unbalanced load in the equipment room and serious primary_slave delay in the database. System service optimization, such as improper configuration of thread pool, database splitting, etc. Fault plan optimization, for example, the setting of traffic limiting threshold is too low, and some people even do not know that they are close to the edge of traffic limiting.

The origin of continuous operations system

The stability construction of smart payment is being done as a special project, lasting nearly 3 months; With good results, we extended from smart payment to the entire financial service platform, running it again in the form of virtual project team for 3 months. It is true that most of the existing stability problems can be solved by project, but as the business is developing and the system is iterating, stability construction is bound to be a long-term work. Therefore, QA led SRE, DBA and RD to establish a preliminary stable and continuous operation system and continue to improve it.

The overall plan for a continuous operating system

Here are three strategies for a continuous operation system:

Process standardization tool, as far as possible to reduce human awareness factors, reduce the cost of human communication and maintenance.

For example, the configuration change process shall be regarded as the online code and submitted for review in the form of PR; Code specification checking falls into tools, extracting coding best practices as rules as possible, and evolving manual checking into tool checking.

Visualization of quality measures, extraction of indicators, and PDCA closed-loop driving related issues through data.

For example, we cooperated with SRE and DBA to extract indexes related to stability in online system operation and maintenance, such as slow query times of database, response time of core service interface, etc., and conducted real-time monitoring of index data, so as to promote the solution of related problems.

Normalize the drill pressure test, reduce the cost of drill and pressure test, and have the ability of normalizing the execution.

For example, verify the effect of emergency SOP in the actual execution of each team through automatic trigger drill alarm.

Based on the above three strategies, build a stable and sustainable operation system. Emphasize closed loop, from quality measurement and evaluation, to problem analysis and solution, and finally to complete the precipitation of methods and tools; In the process, the platform construction is used to implement operation data, improve operation tools and improve operation efficiency.

The effectiveness of a continuous operating system

Briefly demonstrate the effectiveness of the current continuous operations system, including risk assessment, quality benchmarking, problem follow-up, and best practice precipitation.

The future planning

To sum up, this is the key work of intelligent payment QA in stability construction. There are three main directions for thinking about future work. First, improve the effectiveness of testing, continuously expand the failure sample base, optimize the drill tool and pressure test solution; Second, continuous platform construction, operation platform, data platform; Third, intelligent operation, gradually from manual operation, automatic operation to try intelligent operation.

The authors introduce

Wei Xun, Senior Test and development engineer of Meituan, is in charge of testing intelligent payment business of financial services platform. He joined Meituan Dianping in 2015.

recruitment

If you want to learn the technical system of Internet finance and experience the explosive growth of Internet finance business, if you want to guarantee the high quality of business products with us, welcome to join Meituan Financial Engineering Quality Team. Interested students can send their resumes to: fanxunwei#meituan.com.