Check out the business front End Error Monitoring System (Sentry) policy upgrade

background

As we all know, front-end error monitoring is becoming more and more important in practical work. The benefits are as follows:

Collect front-end page errors
Assist in locating and analyzing errors
Detect errors before users do

This is different from the traditional user feedback mechanism where developers passively receive troubleshooting. The process is shorter and the experience is better.

Currently, around the front-end error monitoring system is based on Sentry. We have also compared and used several mature monitoring products in the industry, such as Badjs, FrontJS, Fundebug, etc. However, either it is cumbersome to use or it is not open source. Finally, we choose Sentry. After a period of use, Sentry is a relatively mature and complete front-end error monitoring scheme, which can basically meet the requirements of our project.

However, for the business team, we will pay more attention to online quality and efficiency in problem solving (efficiency is life). Every merchant is very important to us, and we should not make merchants think that we are copycat because of frequent online problems and efficiency in problem solving.

Therefore, a series of optimization and upgrading were carried out, from project transformation optimization to report information optimization to Sentry configuration optimization.

Monthly online bug line chart

Since we launched the following set of strategies in July, we have been able to control the errors discovered by online users within 2, with obvious benefits.

We will talk about our strategy in detail below.

Why should the reporting strategy be adjusted?

We summarized some of the issues we encountered with Sentry as follows:

Information collection is chaotic (all error information mixed together);

Locating problems is relatively slow;
Scope of impact assessment is difficult;
Error frequency cannot be counted;

Partial lack of monitoring (not comprehensive monitoring);

Applets lack monitoring;
The interface is not monitored.
404 requests lack monitoring

Warning emails are too frequent (which can cause developer fatigue);

Of course, compared to the past without error monitoring, it can be used very well.

However, if we can solve all these problems, we can not only improve the quality and processing efficiency of online, but also avoid some problems during the development phase, detect problems before users, and produce a common solution for our large front end.

How to solve it?

Attack instead of defense (unsolicited)
Multi-dimensional labels & auxiliary error information & custom error grouping rules
Modify mail sending rules (Report is critical)
Omnidirectional monitoring compatibility

Attack instead of defense (unsolicited)

Intrusion project, although the front-end actual work has been to the business non-intrusion as the research direction. However, in the actual business, it is necessary to occasionally invade the business to do some processing, and bring considerable benefits to the business. What we can do is to minimize the intrusion of business code, resulting in pollution. The following is our transformation strategy for the project:

Using react as an example, we did the following, and vue did a similar thing:

The page transformation

Add error catching component:

Component error catching & page error catching:

The above solution not only effectively catches errors and distinguishes error levels, but also effectively prevents sub-component errors from affecting the entire page rendering, resulting in a white screen.

The interface to monitor

Why are we doing interface monitoring?

Assist backend error monitoring and log troubleshooting to provide more effective information;
Monitor the abnormal status of interfaces and services, and find vulnerabilities in existing code, server, and product logic according to the abnormal status;
Strengthen front-end developers to pay attention to online problems, and pay attention to interface errors, better integrate into the business;

Because we had the same request package SDK, it was surprisingly easy to handle.

Multi-dimensional labels & error information & custom error grouping rules

Advantage:

Quickly locate the problem (within 1 minute) and quickly assess the scope of impact;
More information needed for problem analysis to help solve problems quickly;
Organize error list, view error frequency, optimize code, service, and product logic risk;

For example: quickly view the distribution of errors according to tags

Override error reporting methods

Make the error information reported more convenient to use, more complete auxiliary information.

Among them:

Tags: Error tags – Quickly locate errors
Extra: Auxiliary information – Auxiliary error troubleshooting
Fingerprint: User-defined group error – Eliminates error message confusion

Of course, the default error report can also be sent at the same time, and tags and extra can also be set, mainly to catch the errors that are actively reported and missed.

We ended up separating the overwritten methods into a common SDK that could be used by the big round front end.

The final sentry list is displayed

Before: All information is mixed, errors of the same type are not classified, and the date is not incorrectly distinguished, so the trend of error changes can not be seen

After: All error messages are grouped together to provide more effective tags for quick location

Modifying Email rules (Reporting is critical)

The code level

Control via isSendMail=1 tag
Controls the frequency of isSendMail=1

For example, the same user should set isSendMail=1 if the page displays a blank screen error and other errors of the same type occur more than 3 times during the page viewing

Sentry System rule configuration

By doing so, the overall frequency of email errors can be greatly reduced, and the developer experience can be focused on solving the errors that we are more concerned about. More rules can do more processing with tags reported from the front end to reduce alarm frequency.

Omnidirectional monitoring compatibility

Since this compatibility scheme was completed by other groups and the support group, here I show the compatibility principle design diagram, and the specific subsequent upgrade strategy is consistent with the above.

Small program monitoring scheme

404 Monitoring Scheme

What is the revenue?

Email is the point
Quick location within 1 minute
Abnormal Service Tracing
Discover new online requirements in real time
Achieve all-round monitoring
Quickly assess scope of impact (prioritize problem solving)
Discover and collect code risk logic (used for pre-commit verification to reduce online error reporting)

The actual case

2019.06.18 – Ios10 Page Compatibility Problem (Email Warning)

A new requirement was launched at 10pm and received an email. There was a problem in ios10. The impact scope was quickly assessed.

2019.07.22 – Video MD5 does not exist (reported by user)

Error messages were quickly found based on the user ID

This problem found that there was a problem with the parameter transmission, but there was no logic problem with the front-end code. Finally, it traced to other publishing portals and found that the MD5 parameter of the publishing function provided by the middle platform would be lost.

2019.07.20 – Failed to place an order after the promotion of marketing page (abnormal interface after online)

Cooperate with the backend, and find that one of the parameters is too long, exceeding the database storage limit.

At present, we have required all requirements to pay attention to the Sentry error report after the launch, so as to avoid the bad influence caused by the error exposure to users.

2019.08.06 – Interface error message for all Users (Incomplete code logic)

Found the front and at the same time request for many times, the backend interface unlocked, warehousing conditions caused by the code found many similar risks, including some code boundary value judgment, etc., in the end we compiled a set of front end development risk standard, and to develop a set of rules of the pre – commit to check risk code (not check code style, grammar, etc., Here if you are interested, you can arrange relevant students to sort out and share), avoid risk code, reduce online errors.

conclusion

Every troubleshooting error is a harvest for us. Many problems can be avoided in advance. Even if they cannot be avoided, we can use Sentry to solve them more efficiently.

Finally, embracing problems is what makes us grow.