This article will introduce a more focused grayscale monitoring alarm configuration.

background

Looking back at the past three years, the front fault total amount is not too big, but behind the data reflects the economic devices for the safety in production, especially the sub-domain high availability, is at a relatively low level: economies fault monitoring found that rate of 46.8%, but the front fault monitoring found that rate is only 22.7%, with expectations of monitoring level far! Therefore, we began to focus on the front end quality control, mainly through monitoring and alarm, for a period of time has also achieved certain results. In the analysis of several missing online problems, especially those not reported by the alarm and relatively serious (white screen, jump failure, etc.), they all have the following common points:

New changes
Incomplete, only part of the traffic in some specific cases will be a problem
It could have been discovered during the release phase, but it was left online for a while

Therefore, in the next phase of alarm configuration is more comprehensive, we need to focus on gray monitoring \

The importance of gray monitoring

In terms of stability

Limitations of pre-delivery testing: it cannot fully cover online user scenarios (including diverse user behaviors, rich client devices, massive business data, etc.)
Timeliness: Technical students pay more attention to problems and are more enthusiastic
Timely stop loss: timely detection of small trial and error phase, avoid to the full release of the greater impact

Look from the lift effect

Multi-terminal test improvement: in some multi-terminal shopping guide pages, more than 80% of the test points can be covered in 10% of the time, and the remaining 90% of the time may be in the multi-terminal individual abnormal scenes, which can be replaced by grayscale method
Gray level verification: after the gray level detection problem is fixed, in addition to the pre-test, some non-main process scenes can continue the small scale gray level test

The effect of grayscale monitoring is very obvious: taking our detail page as an example, the access monitoring was issued 27 times in 4 months, and 5 problems were found in the grayscale stage, and 1 problem was omitted, but the main process was not affected

How to monitor the gray stage

Log monitoring process in the gray stage

Grayscale monitoring is mainly from the beginning of the grayscale to 99% grayscale stage to maintain a certain frequency of monitoring report. Why is the analysis result report sent?

Now there are so many alarms and so much noise that it’s easy for technicians to subconsciously ignore them,
The idea of sending a monitoring analysis report is to add a sense of ritual, to make people pay attention to the content of the report,
Some problems are difficult to be found by alarm, but can be obviously found by analysis report
The ARMS system has mature alarm capability. Relevant alarms have been configured. We will focus on the analysis report

See the following figure for specific steps

Grayscale press conference triggers log monitoring, the first grayscale 5%
After 10min (generally maintained at 5%-20% gray level), the log analysis report will be automatically issued to list all data and anomalies after analysis (see Figure 2 for details).
If confirmed as risk, return to grayscale 0%, fix bug- “regression test -” release grayscale, and so on
If no abnormal risk is confirmed, continue to expand gray scale, and continue to maintain high frequency monitoring
Gray to 99% before keep through, release online

Monitor indicators and exception analysis

We collect SLS logs and analyze API errors, JS errors, traffic, business buried points and performance buried points respectively, while new errors are particularly important in the gray stage. The stock errors and total data will be compared and analyzed on a sequential basis, year-on-year on a daily basis and year-on-year on a weekly basisThe specific data is disassembled as follows

API error

Because the API error statistical standard is different from our actual demand (see the figure below), we mainly look at the new error, year-on-year and quarter-on-quarter data

Error rate: main statistics compared with last year. Why not look at the API success rate? The success rate dropped from 99.5% to 99% (down 0.5%), and the failure rate increased from 0.5% to 1% (up 100%). For example, we had a detail page interface that had a success rate of 99.5% all year round, and a release of a front-end bug that only dropped to 99.3%, but affected 1 million + users for a day
Number of errors: Indicates the number of errors added to an API. 1 to 2 errors (every 10 minutes) are of a WARN level
Number of users affected :(error message added to an API) number of users affected,
- A. Will combine the number of errors together to help analyze whether a large number of errors are concentrated on individual users.
- B. If the weight of the number of users is greater than the number of errors, the impact is wider
Call amount: Abnormal call amount can also reflect front-end bugs. 0 is usually an error that leads to no call, while abnormally high is usually multiple calls
- Case: 2020.12.01 – An order result polling bug is detected in the exception log
- When observing the log, it was found that there was a sudden increase in the volume of an interface compared with normal days. When checking the log, it was found that the same user repeatedly requested the same interface. It was speculated that there might be a problem with the polling logic

Js error

Error rate: a significant increase in year-over-year/month-over-month, requiring attention
- In gray scale, monitoring found that the error rate increased to 10%, and the number of errors increased to 5.6W.

Number of errors: indicates the number of errors. 1 to 2 (every 10 minutes) is a WARN level
Number of affected users :(added error information) number of affected errors,
- A. Will combine the number of errors together to help analyze whether a large number of errors are concentrated on individual users.
- B. If the weight of the number of users is greater than the number of errors, the impact is wider
- Case: 2020.11.26 publish treasure details page, found at 25% gray level
- Error: Split method does not exist because SPM array exists on some auction URL.

Traffic anomaly

Mainly look at PV and UV, but it is necessary to exclude the influence of unconventional factors such as promotion activities and a large amount of drainage from hand washing on year-on-year and quarter-on-quarter data, and it is necessary to combine the data of year-on-year, year-on-year and quarter-on-day data with a large amount of deviation before it is judged as abnormal (see noise processing later).

The service buried point is abnormal

It is used to customize service burying points, facilitating statistics with service attributes

Buried point of successful total amount: it shall be judged as abnormal only when there is a large amount of deviation from the data of day, week and month respectively (see noise processing below for details).
- Case: 2012.12.1 It was found that the connection to Apush was 100% suspended. The following is the statistics of Opcode that sent apush. It can be seen that the sample size was always 0 before the problem was fixed in 12.2

Buried points of total abnormal data: The buried points of abnormal data are analyzed based on specific service scenarios

Performance monitoring

The front end adds buried points in each link to report, and then makes data statistics. It is suggested to take more time to observe performance changes (the graph given here is the daily trend graph, just for example, the gray stage is to see the gray period and the data before gray level, and the whole cycle should be more than 2 days).

Page full load time:
- P50 represents all samples, ranking 50% of the data from small to large. For example, p50=1324 in the following figure shows that 50% of the pages are opened less than 1324 milliseconds.
- P70 represents all samples, ranking 70% of the data from small to large. For example, as shown in the following figure, 70% of pages are opened less than 1730 milliseconds.
- ValidAvg represents the average value after removing spikes (>15 seconds of abnormal data)
- WellRate, represents the ratio of < 2S. The figure below shows that about 78.16% of users access within 2S.

(According to the trend in the following figure, the release of detail page 12.24 leads to poor front-end performance. You need to check the cause.)

Page blank screen rate: it is considered that the full load time is over 5s (from the user’s somosensory point of view, 5s has skipped, and the loose point can be defined as over 15s).
Interface time: The return time of an interface also affects front-end performance. The average time is calculated based on the average time (excluding noise >10s). For details, see the success time and failure time

Performance monitoring and analysis is documented in more detail later

Eliminate the noise and improve the effectiveness of risk insight

There is a high probability that an exception scenario is reported

(Generally, however, the following situations are avoided.)

The problem	For example, to describe	measures
Business fluctuates inconsistently in different time periods	A phenomenon that fluctuates violently in the early morning hours and is more stable in the daytime	Automatically adjust thresholds dynamically to avoid false positives
The business of high fall phenomenon	A large number of bids when the auction is close to the fixed time point, and the rush and fall within 1h brought by the client push and merchant activities on the hour	If it falls back near the baseline, it will not be reported
A timed decline in business	There will be a sudden drop in business around a fixed time point, such as 01:00~07:00	Set not to report the time point
Great for	Business every year during the Double 11, double 12, red envelope data and daily inconsistent	Data fluctuations on specific days are monitored in real time and automatically routed to the big bang model
Automatic filtering of transient jitter in error code monitoring	In error code monitoring, error codes often jitter briefly and then return to normal water level	The system supports the filtering of the jitter that has similar history and the alarm is generated when the jitter is too high and the error persists