Super practical case: The construction of Meituan terminal active monitoring platform

On January 5, 2018, Li Yanqing, a senior technical expert of Meituan, delivered a speech titled “Construction of Terminal Active Monitoring Platform” at the 2018 Mobile Technology Innovation Conference. As the exclusive video partner, IT mogul Said (wechat ID: Itdakashuo) is authorized to release the video through the review and approval of the host and the speaker.

Read the words: 3430 | 9 minutes to read

Guest speech video and PPT: t.cn/RdtLzmF

Abstract

This speech mainly shares the construction process of Meituan’s terminal active monitoring platform, respectively introduces the limitations of the current passive monitoring and the difficulties in the implementation of active monitoring.

Why do active surveillance?

Many companies have monitoring systems, including front-end performance monitoring, business data monitoring, back-end API quality of service monitoring, etc., which are very important from an operations and maintenance perspective.

Generally speaking, monitoring in the server application will be more, so why do we also want to do active monitoring terminal? Before we get into that, let’s take a look at the two scenarios shown above. One of them encountered a blank screen in the ticket refund process, and the other was that the application loading took too long.

There are many reasons for the front end to have a blank screen. It could be a problem with Webview initialization. At present, after the use of React, Vue and other frameworks, the page has become a large module, other modules are accessed through this portal, there is no static content, so it is impossible to determine the source of specific problems. This is actually hard to detect in the testing phase because it is not a common phenomenon.

Finally, after investigation of these two cases, we found that the white screen problem was caused by carrier hijacking. The second problem is that the cities selected by users in The Meituan application are inconsistent with the positioning cities, resulting in data returned by the data interface.

Passive monitoring

Currently, monitoring systems in the industry are basically divided into three categories. One is business monitoring, which should be necessary for every company, such as order and GMV monitoring. The second type is customer service. The existence of customer service mainly depends on the nature of the company. The third type is user feedback, the most direct form of which is user reviews on the App Store. All three types of monitoring are passive.

There are three main problems with passive monitoring. One is the lag, only after the problem has occurred and has caused damage to begin to solve. Second, it is difficult to discover, because the common problems have been solved through testing before the launch, and the problems that appear after the launch are not universal. Third, it is difficult to reproduce. Considering the user base, equipment, network environment and other aspects, it will be found that there are many possibilities for the occurrence of problems. Even if users give feedback, it is difficult to reproduce the environment when the problem occurs.

How to do?

What does active monitoring do

Since passive monitoring has all sorts of problems, it’s time to start doing active monitoring, which is designed to find problems before users do. Our initial goal was to be able to monitor white screens, carrier hijacking, client interface errors, etc.

The target of active surveillance

The realization of active monitoring requires the achievement of several goals. The first is to cover as many sample rates as possible. The second is timeliness. Although problems can also be found from fluctuations of passively monitored business data, sensitivity will be different under different circumstances. The third is coverage, which mainly refers to the entire process of covering the business, as well as the various platforms. Finally, automation, minimizing the time of human intervention.

Let’s take a look at the specific issues involved in these four goals. The first is the sample rate, which mainly covers four aspects: Taiwan, equipment, network and mode of operation. In terms of platform, the priority of mobile terminal is higher than that of PC terminal. In addition, the network environment cannot be limited to WiFi. The second is timeliness, on the one hand to ensure real-time, on the other hand to be equipped with the corresponding alarm system, and finally to have someone on duty. In the automation phase, stability is a priority, followed by log storage, because some problems cannot be easily reproduced, and the selection of a suitable testing framework. Finally, coverage, we want to cover a wide enough range, including search, payment, refund, verification code and so on.

It can be found that there are still more problems involved in this, there are also many difficulties, so we do a relatively simple. First of all, in terms of sample rate, although we have a large number of Android users, the fragmentation is also very serious, so we only chose PC and iOS platforms in the initial stage. On devices, ipads and MacBook Pros run automated processes. In terms of automation, Appium plus WDA is used on iOS, and Google’s Headless Chrome (Puppeteer) is used on PC.

The flow chart of the PC

Let’s take a look at the initial plans for the PC. There will first be some business cases to ensure the smooth execution of the whole process, somewhat similar to automated cases. The left part of the figure runs on Node and uses Headless Chrome to support packet capture, coding, and diff.

The main function of packet capture is to capture each requested packet, then convert it into HLR packet, and finally store it for subsequent analysis.

Encoding is for internal verification code, mainly to meet the needs of automation.

Diff is to deal with the hijacking of the operator. We first run through the whole business process, then grab all the JS and CSS, and finally Diff them with the latest business code in git repository. Raise an alert in real time if any code is tampered with. On top of the system is a messaging system that notifies everyone in charge and collects messages, and it connects to an alarm system that alerts people internally by text message or email. You also need a logging system that records the progress of the business process on the one hand and the packets captured by each step of the request on the other.

Finally, there is the duty system. We have a fixed person on duty every week, and the duty system will sort out alarms according to levels every day.

The problem

When we ran through the first phase of the project, we found that it was barely usable and still had problems. First of all, the samples are still too small. After all, there is only iPad and MacBook Pro, so the probability of finding problems is too small. Second, the process is often interrupted, such as the payment process can not be automated when entering a verification code. Thirdly, when running on iPad, the APP will crash, which will naturally interrupt the process. Fourth, the case changes frequently. Once the business function changes, the case will change accordingly. Fifth, js on line will compress and confuse, so that the diff time will be different every time, and there is no meaning. Finally, there are a large number of alarms, of which there are not many effective ones, which need to rely on manual screening.

Meituan Dianping cloud testing

In order to solve the problem of too few samples, we connected the internal Cloud test of Meituan-Dianping. It is an automated test system, consisting of a Jenkins, multiple servers, a variety of test equipment and a storage system, which can automatically or manually submit automated test tasks.

The flow chart of ios

The figure above is the iOS solution after the second phase of meituan-Dianping cloud testing. Firstly, Jenkins would run the tasks automatically, and then a bunch of test cases. Then, the process was automated through Applum and WDS. The test device iPad was connected with Usb Hub. Other message system, alarm system and duty system remain unchanged, but there are more proxy and server. Proxy is used to capture packets in APP. It captures some requests from the server, including JS, CSS and images.

For the automation of payment, in fact, as long as the alipay whitelist is opened, the verification code can be eliminated, and Js diff can be correctly verified after changing to AST.

Case changes frequently

Our solution to Case variability can be aggressive. That is, iOS, Android, H5 only do display, a separate layer of business logic written entirely using JS, which also solves the problem of dynamic client. However, there are drawbacks to this solution, namely Android webView only supports ES5.

A large number of alarm

The problem of a large number of alarms is mainly solved by manual marking and decision tree. Firstly, the manual review marks the alarm, and then the decision tree classifies them, marking the effective alarm and the invalid alarm. After the accumulation to a certain amount, the decision tree will classify all the errors of a certain type as invalid.

Practice and Effect

After two periods of practice, automation has been achieved on iOS and PC, and the coverage rate of the process has reached 95%. The timeliness of the alarm is basically within 5 minutes. Four problems have been found, one of which is serious. Finally, we can monitor the scope including blank screen, page performance, resource loading, business interface, JS error.

Although two phases of active testing have been done, there are still some lingering issues. First, the sample rate is still insufficient after comparing the massive number of users. The second is the region problem, PC can also simulate the region by proxy, but iOS is not easy to solve. The third is the strong coupling of business, for each business to do a set of active testing.

In the future, we also plan to access android platform, solve geographical problems, and other business automatic access.

The whole process can be summed up in two main points. First, the combination of active and passive, in fact, in the case of large volume of business in fact, passive monitoring will be more sensitive, so to play their respective advantages. The second is to combine with automated testing.

This figure is our summary of the selection scheme, which is divided into two directions: universal problems and large business volume. Common problems can be solved directly through automated and manual testing. Passive testing is more sensitive in the case of high traffic volume and common problems.