background

Meituan Takeout started in November 2013, and after several years of rapid development, it has been constantly setting new records. On May 19, 2018, the peak daily order volume exceeded 20 million orders, making it the largest food delivery platform in the world. The rapid development of services puts forward higher requirements on system stability. How to provide online users with stable service experience and ensure the all-link services and high-availability operation of the system requires not only back-end service support, but also comprehensive technical support on the end. Compared with the server, the operating environment of the client is very different, with many uncontrollable factors and poor emergency ability in the face of sudden problems. Therefore, it is not only a technical challenge for engineers to build a client high availability construction system to ensure stable high availability of services, but also one of the core competitiveness of the delivery platform.

The idea of high availability construction system

A well-designed large client system is often developed by a series of separate teams, each with clearly defined divisions of responsibility. Loose coupling between business modules is an effective way to improve both development flexibility and system robustness by enabling business modules to have the ability to change in isolation. This is the overall business structure of Meituan Takeout. On the whole, it takes commodity trading links (store recall, commodity display and transaction) as the core direction for construction. Locally, it is divided into several independent operation and maintenance units according to business characteristics and team division. The simplicity of independent operation and maintenance units is a prerequisite for reliability, which enables us to continuously focus on functional iteration and complete related engineering development tasks.

We divide problems into three stages according to the life cycle: discovery, positioning and solution. The continuous construction of these three stages constitutes the core of Meituan takeout high availability construction system.

Panorama of Meituan Takeout Quality assurance System

This is a panorama of the overall quality system of Meituan takeout client. Overall idea: monitoring alarm, log system, disaster recovery.

By collecting and reporting data of service stability, basic capability stability, and performance stability, the standard for measuring the quality of the client system is improved. By setting a baseline and using a specific business model to monitor and alarm these indicators, the client has the ability to sense the stability of the core link at the minute level. Through the establishment of log system, the whole system has the ability to extract key clues and locate problems quickly in multiple dimensions. Once the problem is located, we can carry out DISASTER recovery operations according to the online operation and maintenance specifications of Meituan Takeout: degrade, switch channels or limit traffic, so as to ensure the overall core link stability.

Monitoring & Alarm

Monitoring system, at the bottom of the whole service reliability hierarchy model, is an essential part of operation and maintenance of a reliable and stable system. To ensure all-link services and high availability of the system, faults in the system need to be detected before users are aware of faults. Without the monitoring system, we cannot identify whether the client is providing services properly.

Monitoring areas can be divided into system monitoring and service monitoring. System monitoring, mainly used for basic capabilities such as end-to-end success rate, service response time, network traffic, hardware performance and other related monitoring. System monitoring focuses on non-service intrusion and customized system-level monitoring, and focuses on the bottom layer of service applications, which is mostly single-system level monitoring. Service monitoring focuses on analyzing service running status in a certain period of time. The service monitoring system is built on system monitoring. It can combine and analyze data between multiple systems based on the data index calculation of system monitoring and specific service intervention, and provide real-time service monitoring and alarm according to the corresponding service model. Based on the timeliness of service monitoring, it can be further divided into real-time service monitoring and offline service monitoring.

  • Real-time service monitoring helps you quickly discover and locate online problems by collecting and analyzing real-time data, and provides alarm mechanisms and intervention responses (manual or system) to prevent system faults.
  • Offline service monitoring: Data mining, aggregation, and analysis are performed on the data collected within a certain period of time to identify possible service problems and optimize or improve service monitoring.

The business monitoring of Meituan Takeout is mostly real-time business monitoring. With the help of meituan’s unified system monitoring construction foundation, other departments of Meituan Takeout Joint Company transformed, co-built, integrated and reused part of the monitoring infrastructure, and formed a closed loop (monitoring, logging, retrieval). We built real-time business monitoring specifically in line with the delivery business process. Offline business monitoring, mainly through statistics of user behavior and mining and analysis of business data, helps to influence product design and operation strategy behavior, etc. At present, this part of monitoring is mainly provided by Meituan takeout data group. Worthy of special note is simply collect information display, without or not immediately make intervention action business monitoring, can be called the business analysis, such as activity in a specific area of the consumption, regional order quantity, the specific path conversion rate, exposure clickthrough rate, etc., unless the data used for decision-making system real-time state of health, help produce system maintenance, Otherwise, this part of monitoring is better handled offline.

We divide the client stability index into three dimensions: business stability index, basic capability stability index and performance stability index. For different indicators, we use different collection schemes to extract and report, and summarize to different systems; Once the metrics are set, we can set baselines and set alarm strategies based on the specific business model. Meituan takeout client has more than 40 quality metrics, 25 of which support minute-level alarm. The alarm channel supports email, IM and SMS channels according to the emergency level. Therefore, our team has the ability to detect changes in key indicators that affect core link stability in a timely manner.

A perfect monitoring alarm system is very complex, so we must pursue simplification in the design. Here are the principles for setting alarms from the book Site Reliability Engineering: How Google Runs Production Systems:

The most can reflect the real fault rules should predictability is strong, very reliable, and as simple as possible Not commonly used data collection, summary and alarm configuration should be regularly remove certain standards of SRE team is in the first quarter (unused delete) exposed to no monitor background, collect data index should be regularly remove the alarm rules

Through monitoring and alarm system, meituan takeout client team found more than 20 problems affecting core link stability in the second half of 2017, including crawler, traffic, operator 403 problems, performance problems, etc. At present, all the problems have been completely transformed.

Log system

An important feature of monitoring system is production emergency alarm. Once a fault occurs, someone needs to investigate the alarm to determine whether the current fault exists and whether to take specific measures to alleviate the fault until the root cause of the fault is found.

The process of simple positioning and deep debugging must be kept very simple and understood by everyone on the team. The log system plays a decisive role in simplifying this process.

The log system of Meituan Takeout is generally divided into three categories: full log system, individual log system and abnormal log system. The full log system is mainly responsible for collecting the overall indicators, such as network availability and buried point availability, through which we can understand the overall market of the system, understand the overall fluctuation, and determine the scope of problems; Abnormal log system, mainly collect abnormal indicators, such as big picture problems, sharing failure, positioning failure, etc., we can quickly obtain abnormal context information through it, analyze and solve problems; The individual log system is used to extract the key information of individual users, so as to analyze specific customer complaints. These three types of logs constitute a complete client log system.

A typical use scenario for logging is to deal with single point customer complaints and resolve potential system problems. The individual log system is used to simplify the steps for engineers to extract key clues and improve the efficiency of problem location and analysis. In this area, Meituan takeout uses the Logan service developed by dianping platform. As the basis of the underlying Meituan mobile end logging library, Logan group many access log system, such as end-to-end journal, user behavior, the collapse of the code level logging, log and so on, and all of these logs is stored locally, and there are multiple encryption mechanism and strict permission to audit mechanism, when handling user customer complaint to back out and analysis of data, Ensure user privacy security.

Through the design and implementation of meituan takeout core link log scheme, we got through the low-level data synchronization between various systems in the user transaction process, such as order, user center, Crash platform and Push background. Through the output of standard problem analysis manual, the analysis and treatment of common individual problems can be standardized; Through the development of log retrieval SOP and regular practice, the online traceability ability is greatly improved, and the majority of daily customer complaints can be located within 30 minutes. In this process, the problems that affected the stability of core links exposed through individuals have all been improved/fixed.

Troubleshooting is a critical skill in the operation and maintenance of large-scale systems. Using systematic tools and tools rather than just relying on experience or even luck, this skill can be learned on its own or taught internally.

Disaster backup

For different levels of service, different measures should be taken to effectively stop loss. Non-core dependencies that provide scalable services to users through downgrading; The core dependency adopts multi-channel dependency backup to ensure high availability of trade path links. Abnormal traffic: Multi-dimensional traffic limiting maximizes service availability and provides users with good experience. It can be summarized into three aspects: non-core dependency degradation, core dependency backup, overload protection and traffic limiting. Next, we will elaborate on these three aspects respectively.

demotion

Here, the overall system structure diagram of Meituan takeout client is selected to introduce the construction overview of non-core dependency degradation. The red part in the middle of the figure is the core key node, namely the core link of takeaway business: positioning, recall of merchants, display of goods and placing orders; The blue parts are the key services that the core link depends on. The yellow part is the downgradeable service. By sorting out the dependency relationship and reforming the communication protocol of the front and back end, we realize the degradation of the client’s non-core dependency. The back-end service, through various levels of caching, shields the isolation policy, realizes the degrade within the business module, the degrade between the business. This constitutes the overall downgrade system of Meituan takeout client.

On the right is the business/technology downgrade switch flow chart of Meituan Takeout client. By combining push and pull with cache update strategies, we were able to synchronously degrade configurations at minute levels and quickly stop losses.

At present, Meituan takeout client has more than 20 businesses/capabilities to support the downgrade. Through effective downgrade, we avoided an S2 accident and several S3 and S4 accidents. In addition, the overall plan of downgrade switch produced SDK Horn, which was promoted to other core business applications such as liquor travel and finance of the Group.

The backup

Core dependent backup construction, in this focus on meituan takeout multi-network channel. Network channel, as the core dependency of the client, is the least controllable part of the whole link system. Problems often occur: network hijacking, operator failure, and even physical optical fiber cutting seriously affect the stability of the core link. Therefore, reliable multi-channel backup must be built to address network problems.

This is the schematic diagram of meituan Waimai multi-network channel backup. The Meituan takeout client has four network channels, including Shark, HTTP, HTTPS, and HTTP DNS. The Shark long-connected channel is the main channel, and the other three channels are used as backup channels. The complete switching process can realize minute-level switching of network channels in different cities when the network index plummets. Through the development of fault emergency SOP and continuous drills, we improved the ability and speed of problem solving and effectively dealt with all kinds of network anomalies. Our network channel switching ideas are also exported to other departments of the group, effectively supporting business development.

Current limiting

Service overload is another typical type of accident. In most cases, the reason is that the performance of a few interfaces called by a few callers is poor, leading to the deterioration of the corresponding service performance. If the caller does not have effective fault tolerance for degradation, measures that can reduce the error rate under normal circumstances, such as retry after a failed request, will further deteriorate the performance of the service, and even affect the normal service invocation.

Meituan takeout business orders have reached a high scale in the peak period, and the business system is extremely complex. According to past experience, in the peak service period, once the abnormal traffic increases crazily and leads to the server downtime, the loss is immeasurable.

Therefore, the front and back ends of Meituan Takeout jointly developed a “flow control system” to implement real-time control of traffic. It can not only ensure the stable operation of the business system daily, but also provide a set of elegant degradation plan when the business system has problems. It can ensure the availability of the business to the maximum extent, and give users a good experience under the premise of minimizing losses.

In the whole system, the back-end service is responsible for identifying the marking classification and telling the front end the identified category through a unified protocol. In the front end, through multi-level flow control check, different traffic can be differentiated and processed: blast verification code, or queue waiting, or directly processed, or directly discarded. In different scenarios, the system supports multi-stage flow control schemes to effectively intercept system overload traffic and prevent system avalanche. In addition, the whole system has the ability of sub-interface flow control monitoring, which can monitor the flow control effect and detect system anomalies in time. The whole scheme has withstood the test in several abnormal flow growth failures.

release

With the development of takeout business, the number of users and orders of Meituan Takeout has reached a considerable magnitude, and the direct full release of versions/functions online has a large influence range and high risk. Version grayscale and function grayscale is A smooth transition to release: an ONLINE A/B experiment that allows some users to continue using product (feature) A while others start using product (feature) B. If all indicators are stable and the results are in line with expectations, expand the scope and migrate all users to B; otherwise, roll back. Grayscale release can ensure the stability of the system. Problems can be found, repaired, and strategies adjusted in the initial trial stage to ensure that the scope of influence is not spread.

Meituan takeout client in version gray scale and function gray scale has been relatively perfect. Version Grayscale iOS uses the phased release mode officially provided by Apple, while Android uses the EVA package management background developed by Meituan for release. Both types of distribution support incremental volume distribution. Function grayscale Function release switch configuration system is released according to user characteristic dimension (such as city, user ID), and the whole configuration system has two sets of different environments, test and online, with fixed online window, to ensure the standardization of online. Correspondingly, the corresponding monitoring infrastructure also supports monitoring by user characteristic dimensions (such as city, user ID), avoiding those gray scale anomalies that cannot be reflected in the overall market. In addition, regardless of version gray scale or function gray scale, we have corresponding minimum gray scale cycle and rollback mechanism to ensure that the whole gray scale release process is controllable and minimize the impact of problems.

Online operations

How to deal with the failure when it comes is the most critical link in the whole quality assurance system. No one is born with the ability to handle emergencies perfectly. It takes constant practice to deal with problems properly. The meituan takeout client team has established a complete set of processing procedures and specifications to deal with all kinds of online problems that affect link stability around the problem life cycle, that is, finding, locating and solving (prevention). Overall idea: establish norms, advance construction, effective response, afterwards summary. Different problems should be solved in different ways at different stages, complete accident process management strategies should be determined in advance, and smooth implementation should be ensured. Regular drills can greatly reduce the average recovery time of problems, and the high stability of Meituan takeout core link can be guaranteed.

future

Meituan takeout business is still in a period of rapid growth. With the development of the business, the technical system supporting the business is increasingly complex. In Meituan take-out client high availability system construction process, we hope to be able to get through a set of intelligent system, operational help engineers rapid and accurate identification of the core link each subsystem is unusual, find the root cause, and automatically execute corresponding exception resolution plans, further shorten the service recovery time, so as to avoid or reduce the number of online accidents.

Admittedly, there are a lot of explorations about automated operation and maintenance in the industry, but most of them focus on the background service field, and there are few front-end achievements. Our delivery technology team is also in the process of simultaneous exploration and is in the stage of basic construction. We welcome more counterparts in the industry to discuss and discuss with us.

The resources

  1. Site Reliability Engineering: How Google Runs Production Systems
  2. Meituan-dianping mobile base log library — Logan
  3. Mobile network optimization practice of Meituan-Dianping

Author’s brief introduction

Chen Hang, Senior technical expert of Meituan. I joined Meituan in 2015 and am now in charge of the iOS team of Meituan Takeout. I have a deep understanding of mobile terminal architecture evolution, monitoring alarm backup and disaster recovery, and mobile terminal on-line operation and maintenance.

Fuqiang, senior engineer of Meituan. He joined Meituan in 2015 and was one of the early developers of Take-out iOS. Currently, as the leader of iOS infrastructure team of Meituan take-out, he is responsible for take-out infrastructure and advertising operations.

Xu Hong, Senior engineer of Meituan. He joined Meituan in 2016. Currently, he is the main developer of takeout iOS team, responsible for mobile TERMINAL APM performance monitoring and high availability infrastructure support.

recruitment

Meituan is looking for senior/senior engineers and technical experts in iOS, Android and FE. We can Base in Beijing, Shanghai and chengdu. Please send your resume to chenhang03#meituan.com. If you are interested in our team, you can follow our column.