Design and practice of iQiyi intelligent front-end abnormal monitoring platform

The background,

Front-end monitoring includes three aspects: exception monitoring, performance monitoring (First Meaningful Paint, First Contentful Paint, etc.), and behavioral data monitoring (PV, UV, page stay time, etc.). Front-end exception monitoring mainly monitors the situation that the page is abnormal and the expected result cannot be obtained due to front-end code execution exceptions or interface request exceptions. In terms of the performance of problems caused by exceptions, the front-end exception monitoring platform should be able to help developers easily deal with problems including but not limited to the following:

As the service expands, many important but low-frequency error events do not reduce the interface success rate or cause early warning. As a result, problems cannot be discovered in a timely manner.
For some problems reported by the report, similar data constructed by the test account cannot be reproduced.
The interface error event indicates that the front-end parameter is not correctly transmitted. However, the front-end code related to the current request has no logical problem. Therefore, it takes a lot of time to predict the repeatable scene and find the root cause.

There are many mature solutions covering front-end exception monitoring in the industry, but these platforms also have disadvantages such as difficulty in using customization:

Almost all of these platforms are sold for a fee, either as a package of cloud services or separately.
The architecture of these front-end monitoring platforms is not open source, so most users cannot get a front-end monitoring platform suitable for the current project through secondary development of such front-end monitoring platforms.
Most of these front-end monitoring platforms do not support private deployment, and users cannot access monitoring data for further analysis.
These platforms have poor capability of function customization. Even if individual platforms with good capability of customization want to achieve the desired monitoring effect, more complex customization or more intrusion into business project logic is required.

So we itself iQIYI intelligent front-end monitoring platform – eagle eye (Hawkeye), currently the platform has access to the iQIYI content creation, distribution and liquidate platform for most of the business, after several months of use, help business found many problems in time, also help business director for the operation of the respective system of the whole have a deeper understanding, It plays a very good role in safeguarding online projects.

This paper will elaborate the platform from the aspects of design and practice.

This section describes the front-end anomaly monitoring platform

Hawk-eye is a front-end anomaly monitoring platform designed to help detect problems in time and speed up the troubleshooting of business projects. It is especially good at dealing with various business scenarios, and it is ideal in monitoring relatively low-frequency but important business.

Hawk-eye has the following three advantages:

Provides event aggregation problem list, supports service type isolation monitoring alarm, and supports interface request error monitoring according to abnormal service code configuration, which helps front-end and back-end developers discover system problems more easily and in a more timely manner.
It records user operations and interface requests before error events occur, and connects the front and back end links by generating or recording Trace ids returned by interfaces, providing more clues for troubleshooting.
Access is simple and quick, low intrusion to code.

The overall architecture, as shown in Figure 1-1, includes three parts: JSSDK, back-end collection service, and monitoring background.

Figure 1-1 Overall architecture of Hawk-Eye

The JSSDK collects and reports exceptions. The information that can be collected includes: JS runtime exceptions (including TypeError, ReferenceError and other exceptions collected through window.addEventListener for error events and unHandlerejectio through window.addEventListener N Promise exceptions collected by events), interface request exceptions, static resource loading exceptions, front-end framework errors, user operations and requests before exceptions occur. The collection service first checks and cleans the data, then alarms according to a certain alarm strategy, and sends messages to the monitoring background. The monitoring background builds a real-time computing engine based on the company’s big data platform and streaming computing platform to process the received abnormal messages and provide data for the monitoring management page, report statistics and alarm platform.

The following four aspects of hawk-Eye anomaly monitoring design will be introduced, which are key to achieving the two main goals of timely problem detection and accelerated business project troubleshooting.

I. Intelligent event aggregation

If the page view of a platform-type project is high and the business is complex, there will be a large number of events collected for a certain type of problem, which is not conducive for developers to view and find problems. Therefore, we classify the events according to the error type, error information and visit page address according to rules, and generate an error ID for the events of the same type and store it in Redis.

Figure 1-2 Problem management

2. Service type isolation monitoring alarm, support according to abnormal service code configuration to listen to the interface request error

If a single-page application with 10+ service types is divided by project, it is difficult for service owners to discover their own concerns. Therefore, we support the establishment of mapping configurations of service types, error codes returned by interfaces, and alarm Topic IDS, which can be customized according to the configuration in the process of problem collection, monitoring and alarm, and statistical analysis, as shown in Figure 1-3.

Figure 1-3 Monitoring by service configuration

Specific instructions are as follows:

The front-end intercepts Ajax/Fetch interface requests and collects interface error information based on the incoming service error codes. This approach is non-intrusive to the business item logic. The window.addEventListener method can be used to detect network request exceptions. Only Ajax failed requests can be monitored, but HTTP status codes cannot be determined. In addition, the HTTP status CODE returned by some business interfaces is 200. The actual error code for the business is returned in the Response Body. Therefore, for Ajax request monitoring, XMLHTTPRequest is overridden to collect errors. At the same time, in order to achieve low intrusive service isolation monitoring, we use the service error code configuration to monitor the error of the interface.
The to-do list is displayed according to the service configuration, helping each service leader grasp the project problems of the day in a timely manner.
Under the same platform project, the alarms are differentiated according to business, and the alarms of different businesses will not affect each other.

Collect the error context and use Trace Id to connect the front and back end links

Some exception events occur not just because of a user action or interface request, but because of an interface timeout prior to the exception or because of a particular set of actions. Therefore, you need to collect information about user operations or requests before the current error and the request duration to help locate faults quickly.

Figure 1-4 Error context

If an interface request is abnormal, Hawkeye records Trace ids returned by the back end to monitor links of the front and back ends in series. In this way, when the front end detects a problem, the whole link of the front and back ends can be directly viewed based on Trace ids, and service logs can be associated to analyze abnormal links and quickly locate the problem.

At the same time, Hawkeye also supports the front-end generation of Trace Id, which is achieved by setting HTTP request header (as shown in Figure 1-5). The request header complies with the data specifications of Rover full-link tracking system of the company.

Figure 1-5 Trace Id generated on the front end

Iv. Build monitoring background based on the company’s big data platform and streaming computing platform

Hawk-eye anomaly monitoring system is built on the company’s big data platform and streaming computing platform. Figure 1-6 shows the background technical architecture.

Figure 1-6 Technical architecture of the monitoring background

Data source: The abnormal events reported by the front end are first entered into the message queue as a unified data source for subsequent storage and calculation. At the same time, using message queues is also a means of peak-cutting, preventing the service from being overwhelmed by the large number of abnormal events reported during peak periods.

Engine: The engine layer is divided into storage engine and computing engine.

Storage engine: Selects proper data stores based on data types and features.

- Use Redis to generate the error ID of the exception event, and generate the unique error ID according to the project, exception type, exception value and other dimensions. The same type of exception event will be aggregated under the same error.
- ES is used to store some fields in abnormal events, which is mainly used for retrieval in hawk-eye monitoring platform.
- HBbase is used to store all fields of exception events. The reported exception events contain information such as exception stack and context, and the amount of data is large and not used for retrieval. Therefore, HBbase is suitable for storing massive data.
- MySQL stores some configuration information, such as project Settings and alarm configurations.

Computing engine: The streaming computing engine Flink is used to calculate and aggregate the reported abnormal events in real time.

Looking forward to

At present, Hawk-Eye monitoring has been connected to most of the business of iQiyi content creation, distribution and realization platform, and has greatly assisted in improving the operation and maintenance efficiency in daily work, with specific performance as follows:

You can view the current problem list (as shown in Figure 1-2) to discover problems before users report faults. Business development students can choose their own business problems to view, back-end students can set the default display of Service API Error (interface Error) to view.
You can use Stack Trace, user device environment information, and Trace Id to quickly locate faults and troubleshoot faults in error context, as shown in Figure 1-4 and Figure 1-5. The whole-link monitoring based on Trace Id can significantly improve the troubleshooting efficiency of interface errors.
Through various dimensions of statistics, to analyze the running status of the project. Hawk-eye data can be easily accessed to various visual platforms for statistical analysis, which can analyze the trend of project problems and the situation of solving each project problem. For example, the excellent multi-data source support and reporting function of Grafana were utilized to develop and display data reports, as shown in Figure 1-7.

Figure 1-7 Statistical analysis

As mentioned above, Hawk-Eye has been able to meet the monitoring requirements of our daily business, but there are still several directions worth thinking about. For example, when making trade-offs between SDK code volume, requirement level and development cost, we can consider whether the following functions need to be implemented:

Page crash: After a page crashes, the normal reporting process cannot go through. If you need to post relevant information before the page crashes, you can use beforeUnload combined with sessionStorage to report the information when the page is opened next time. If the project uses PWA(Progress Web Application), It can also be reported in SserviceWworker.
Merge logs: If the number of visits is large, many logs are reported, or the same error is repeatedly triggered by users, we consider whether to merge several logs together and then report the logs.
HTTP2.0: HTTP2.0 report header compression, multiplexing advantages, will make monitoring post performance higher.

In the future, we will continue to optimize and expand Hawk-Eye’s capabilities, such as providing small program SDK as soon as possible, providing more WEB framework support, and helping more technology stack developers find problems in time and remove obstacles quickly. Open source will also be put on the agenda as soon as possible so that more development teams can use it and learn from it.

Maybe you’d like to see more

Mobile APM network monitoring and optimization practice

Dry goods | iQIYI automatic monitoring platform of all links of exploration and practice

Scan the qr code below, more exciting content to accompany you!

Design and practice of iQiyi intelligent front-end abnormal monitoring platform

Dry goods | iQIYI automatic monitoring platform of all links of exploration and practice

Related Posts

Redis five data structures – from shallow to deep, the article explains

Mysql transaction MVCC principle

Kubernetes study notes CSI Plugin registration mechanism source analysis