Recently the construction of the front – end project monitoring alarm, this paper summarizes the ideas

When do FRONT-END monitoring alarms need to be generated

When the project traffic increases gradually and the service value reaches a certain level, a front-end online fault occasionally occurs, causing a small amount of service loss, you need to set up front-end monitoring alarms to avoid unacceptable service loss caused by a critical fault in the later period

Monitoring alarms enable developers to quickly discover online faults and resolve them, ensuring stable service development

It can be thought of as an online strategy for project stability. Usually, the team’s requirements for technical Review, Code Review, RD self-test and QA regression test are offline means to ensure project stability

What are front-end monitoring alarms

Monitoring: report all kinds of information generated during project operation. When online faults occur, developers can quickly find, locate and solve problems according to the information

Alarm: According to certain rules, the alarm is notified from the reported information, which enables the developer to detect online faults without using naked eyes and improves the efficiency of fault detection

monitoring

The front-end consists of application monitoring (white box) and service monitoring (black box).

Application monitoring:

  1. Node server: CPU, disk, and memory running status and response status of external interfaces
  2. Client: static resource success rate, runtime exceptions, interface request and response status, dependent service status

Black box monitoring: Traffic volume of key nodes of business processes

The alarm

  1. Subdivide the information: Divide the reported exception information into different dimensions, such as the page, exception level, and exception name, and combine alarm indicators
  2. Configure alarm modes: Specify alarm severities based on the impact of faults on services

How do I perform front-end alarm monitoring

Have goals first. Without goals, results cannot be measured

The goal could be to avoid a low level of business loss, such as 50 lost orders

Then find the scenarios in the business process that can reach the fault loss on this line, calculate the corresponding abnormal data report quantity, and ensure that the monitoring alarm can find the problem before the critical value is reached

monitoring

Monitoring reports can be divided into the following categories:

  1. Request: static resource, interface
  2. Exception: Runtime exception
  3. Business: Key node visits of core processes

General reported information includes time, page URL, interface URL, error message, error type, error level, User Agent, gray traffic, webVersion, and userId.

The Gated parameter enables developers to pull the gray flow data to report in the gray flow verification stage and improve the efficiency of problem discovery

request

  1. XHR open, send, load, error, abort, onReadyStatechange
  2. Rewrite fetch to get the requested information

Report intercepted information to the data platform. This configuration is required

  1. Network success rate: Indicates the network status. Pay attention to XHR status
  2. Service success rate: indicates the service status. Pay attention to the service code when XHR status 200 is displayed

abnormal

Listen for error events and report non-promise errors. Listen for unHandledrejection events and report Promise errors

business

Sort out the business process, find the key nodes of each process, such as the exposure of the results page of the core business process, and send requests to the data platform

The alarm

Two principles

  1. It is better to have more than less: the alarm of the core process can be redundant, but it is indispensable (even if there is an alarm at the back-end, it is necessary to do it at the front-end. I have encountered the case that the alarm of the back-end happens to be invalid when a serious fault occurs online).
  2. Core first: In the implementation process, limited manpower and time, do the core process first, accumulate experience in data reporting and alarm configuration, form the best practice, and then gradually expand the scope

The alarm configuration

  1. Request: Static resource success rate, core flow interface Network success rate, core flow interface service success rate, Network Error, timeout, XHR not 200, XHR 200 but service code Error
  2. Exception: unhandledrejection, non-unhandledrejection
  3. Business: exposure of the home page, key components and results of each core process

The alarm rules

  1. Dimension: Alarm rules are divided into different dimensions based on the core services. Common dimensions include the page, exception name, and exception level
  2. Indicators: commonly used indicators are error number, continuous fluctuation percentage. You need to configure indicator thresholds based on the data fluctuation in the scenario with and without faults

The alarm way

  1. Level: The levels in descending order can be P0, P1, and P2. The different levels correspond to the urgency of the failure, the speed of notification, the response time and processing time of the developer, and the acceptable level of false positives
  2. Methods: Communication App, SMS, telephone

Take P1 as an example: the core path has fluctuations, communication App and SMS notification, need immediate processing, and require 0 false positives

An example of P1: JS exception, Error level, page exclusion of XX pages, Error total original value triggered 3 times in the last 5 minutes, use of communication App, SMS notification

Failure to rehearse

A self-test case is written for the target, and a fault drill is performed using puppeteer to test the effectiveness of monitoring alarms and the fault handling awareness of the puppeteer

  1. Abnormal request reporting: Intercepts the request or overwrites the return value of the request
  2. JS runtime exception reported: JS injection failure