background

Recent main work content is the school open platform (kaiping) related business, kaiping is simply a can provide pick up in the end for third party applications (such as WeChat, fly) platform application ability, stable and reliable in order to let the third party applications to kaiping, need to provide the basis for some of the underlying capabilities, including application monitoring is an integral part of among them. At present, how to do three-party application monitoring management in school kaiping is still in the preliminary stage of pre-research, so I have learned about the background knowledge of front-end monitoring. In view of our company has a very perfect APM platform, so the following many theories and source code refer to our APM Web SDK source code.

Monitor the process

  1. Data collection: determine which indicators need to be collected and how to collect them.
  1. Data reporting: Reports the data collected in the previous step based on a specific policy.
  1. Data cleaning and storage: After receiving reported data, the server cleans and stores it.
  1. Data consumption: Data will eventually be visualized in graphs, tables and other forms on monitoring platforms such as Slardar Web, providing consumption capabilities such as monitoring and alerting.

The above process seems not complicated, but there are a lot of technical details in each link. This paper mainly focuses on data collection and reporting from the perspective of the front end.

The data collection

The first step to do a good job in front-end monitoring is to clarify which data is worth collecting. Monitoring data in the front-end environment can be divided into environmental information, abnormal data and performance data from a large dimension:

Environmental information

The collected monitoring data generally sets some general environment information, which provides more dimensions to help users discover and resolve problems. The following figure illustrates some common environment information:

Abnormal monitoring

JS abnormal

Script Error

Put aside how to collect JS exception information first, if the error information is not complete before collection, even if such data is collected, it is invalid. Coincidentally, there is a scenario where a syntax error occurs when a page is loaded from a Script in a different domain (for example, the JS of the page is hosted in a CDN). For security reasons, the browser will not give details of the syntax error, but a simple Script error.

Therefore, if you want the detailed error information of your page to be captured by the monitoring SDK you need to add the crossorigin= anonymous attribute to the script in your page, And the service where the script resides sets the CORS response header access-Control-Allow-Origin: *, which is the first preparation for JS exception monitoring.

Compile-time error

Common JS errors can be divided into compile-time errors and runtime errors, where compile-time errors are indicated at the IDE level and generally do not flow online, so compile-time errors are not monitored.

Parse (json. parse) is a runtime error, not a compile-time error.

For exception monitoring, we mainly focus on JS runtime errors, and the handling methods in most scenarios are as follows:

Error scenarios How to report
The scene of a: self-awareSynchronization runtime exception try-catchError reporting is performed
Scenario 2: Runtime exceptions with no manual catch (including asynchronous butPromise exceptions are not included) throughwindow.onerrormonitor
Scenario 3: self-aware promise exception promise catchError reporting after capture
Scenario 4: Promise exception with no manual catch Listening for the Window objectunhandledrejectionThe event

On the whole, the monitoring SDK will globally help users to capture and report the exceptions that they are not aware of by themselves, and generally provide a manual reporting interface for self-captured exceptions.

SourceMap

It is assumed that the JS exceptions currently existing on the page have been collected and reported. When the final consumption is made, of course, what you want to see is the initial source of the error and the call stack, but the JS code that actually reported the error has been confused and compressed through various transformations, and has been completely changed. Therefore, we need to use the SourceMap generated in the packaging phase to do a reverse parsing to get the context of the original error message.

Taking Sentry (Slardar is also useful for Sentry) as an example, the general process is as follows:

  1. The collection side collects error information and sends it to the monitoring platform server.
  1. The added service uplows the SourceMap file to the monitoring platform server and deletes the local SourceMap file after the upload. The SourceMap URL is not required at the end of the packaged JS file, avoiding SourceMap leakage to the maximum.
  1. The server uses the source-Map tool in combination with SourceMap and the original error message to locate the source code.

Static resource loading is abnormal

Static resource loading exceptions can be caught in two ways:

  1. Element in which static resource loading exceptions occuronerrorMethod.
  1. The resource loading exception is triggerederrorEvents do not bubble, so enablewindow.addEventListener('error', cb, true)Capture occurs during the event capture phase.

The first method is too intrusive and not elegant, and the second method is adopted in the current mainstream schemes for monitoring:

Catch a static resource load exception

APM platform tend to have all static resource loading details, its principle is collected by PerformanceResourceTiming API static load resources basic situation, here is not to do.

Abnormal request

AJAX requests or Fetch requests in business may have unstable performance in different network environments or client environments, and these unstable situations are difficult to be tested or sensed through local testing, so we need to conduct online monitoring of HTTP requests. Error logs are collected by reporting HTTP request exceptions, and then analyzed and monitored.

A request exception usually indicates that the HTTP request fails or the status code returned by the HTTP request is not 20X.

So what about request exception monitoring? A common approach is to rewrite the native XMLHttpRequest object and fetch methods to implement status code listening and error reporting in proxy objects:

Overrides the XMLHttpRequest object

Rewrite fetch method

Of course, after rewriting the above method, not only abnormal requests can be monitored, but also normal response request status can be collected naturally. For example, Slardar will analyze the duration of all reported requests to obtain the proportion of slow requests:

PS: If monitoring data is reported through XHR or FETCH, the reported request will also be intercepted, and a layer of filtering can be selectively performed.

Caton abnormal

Stutter refers to the fact that the screen is not ready for the next frame when the monitor refreshes, resulting in the same frame being displayed many times in a row, which makes the user feel that the page is not smooth. This is also known as frame drop. The measure of whether a page is stutter is known as FPS.

How to get an FPS

FPS index is included in the Rendering column of Chrome DevTool, but currently there is no corresponding API in the browser standard, so it can only be implemented manually. This needs to be simulated using the requestAnimationFrame method. The browser will execute the rAF callback before the next redraw, so you can calculate the FPS of the current page by counting the number of rAF executions per second.

Calculate FPS by rAF

How to Report “Real Gridlock”

From a technical point of view, an FPS lower than 60 is considered to be stuck. However, in the real environment, many behaviors of users may cause FPS fluctuations, and it is not advisable to report all cases with an FPS lower than 60 mindlessly, which will cause a lot of invalid data. Therefore, it is necessary to redefine “real stuck” based on actual user experience. Here is the reporting strategy of APM platform in the company:

  1. Page FPS consistently lower than expected: the current page has 3s FPS below 20 in a row.
  1. Lag due to user action: rendering a new frame takes more than 16ms + 100ms after user interaction.

Collapse of abnormal

Web page crash refers to the phenomenon that the page does not respond at all during the running of the Web page. There are usually two situations that cause the page crash:

  1. An infinite loop occurs in the JS main thread, triggering the browser protection policy and ending the current page process.
  1. Out of memory

The main thread is blocked when the crash occurs, so the crash monitoring can only be carried out in the Worker thread independent of the JS main thread. We can use the way of Web Worker heartbeat detection to continuously detect the main thread. If the main thread crashes, there will be no response. Crash exceptions can then be reported in the Worker thread. Here’s a continuation of Slardar’s detection strategy:

  • JS main thread:

    • Heartbeat is sent to the Web Worker at a fixed interval (2s)
  • Web Worker:

    • Periodically (2s) check for heartbeat.
    • If no heartbeat message is received within a certain period (6s), the page crashes.
    • When a crash is detected, an exception is reported through an HTTP request.

Crash testing

Performance monitoring

Performance monitoring is not just about “page speed”. It is about measuring performance from a user experience perspective. Currently, the mainstream standard in the industry is Core Web Vitals, which is newly defined by Google:

  • Loading: LCP
  • Interactivity: FID
  • Visual Stability: CLS

It can be seen that in the latest standard, FP, FCP, FMP, TTI and other indicators that are familiar with in the past have been removed. I think these indicators still have certain reference value, so I will introduce these indicators in the following. (Google does not listen to 🙉)

Loading load

Indicators related to Loading include FP, FCP, FMP and LCP. Firstly, let’s take a look at some indicators that we are familiar with:

FP/FCP/FMP

A picture that’s been around for a long time

  • FP (First Paint): the time when the current page is rendered for the First time. Generally, the period from the time when the Web page is accessed to the time of FP is regarded as the white screen time. In simple terms, FP is the time when the pixels on the screen start rendering.
  • FCP (First Contentful Paint): The point in time at which the current page is First rendered with content, which usually means text, images, SVG, or canvas elements.

Both of these metrics are obtained through the PerformancePaintTiming API:

Obtain FP and FCP through Performance Timing

Let’s look at the definition of FMP and how to obtain it:

  • First Meaningful Paint (FMP): The First time Meaningful content is drawn, when the overall layout and text is rendered and the user is able to see the main content of the page, and the product is often concerned with this metric.

The calculation of FMP is relatively complicated because the browser does not provide the corresponding API, but let’s take a look at the following figures:

Here are some rules for page rendering:

  1. At 1.577 seconds, the page renders a search box, and 60 layout objects have been added to the layout tree.
  1. At 1.760 seconds, the entire page header is rendered, with a total of 103 layout objects.
  1. At 1.907 seconds, the main content of the page has been drawn and 261 layout objects have been added to the layout tree. From a user experience perspective, this point in time is FMP.

You can see that the number of layout objects is highly correlated with page completion. An accepted way to calculate FMP in the industry is “the drawing time after the biggest layout change during the page loading and rendering process is the FMP of the current page”.

The implementation principle needs to monitor the DOM change of the document as a whole through MutationObserver, calculate the score of the current DOM tree in the callback, and the moment when the score changes most violently, namely, the time point of FMP.

As for how to calculate the score of the current page DOM 🌲, the source code of LightHouse provides a weight calculation based on the current node depth as a variable. For details, refer to the source code of LightHouse.

Const curNodeScore = 1 + 0.5 * depth; Const domScore = sum of scores of all child nodesCopy the code

The above calculation methods cost a lot of performance and may not be accurate. As LightHouse 6.0 has clearly abandoned the FMP scoring item, it is recommended that you set the specific FMP value manually based on the actual situation in specific business scenarios, which is more accurate and efficient.

LCP

Largest Contentful Paint LCP (Largest Contentful Paint) is a performance metric that replaces FMP. It measures when the Largest content element in the viewport is visible. It can be used to determine when the main content of a page is rendered on the screen.

Largest Contentful Paint API and PerformanceObserver to get the value of the LCP metric:

For LCP

Interactivity interaction

TTI

TTI(Time To Interactive) indicates the Time between the page loading and the page being fully Interactive. A smaller TTI value indicates that users can operate the page earlier and have a better user experience.

Here’s what a fully interactive page is:

  1. The page already displays useful content.
  1. The event response function associated with the visible element on the page has been registered.
  1. The event response function can start execution within 50ms after the event occurs (main thread without Long Task).

TTI’s algorithm is a bit complicated. Let’s look at the steps below:

TTI schematic

Long Task: blocks a Task whose main thread is 50 milliseconds or more.

  1. Starting from FCP time, search forward for a silent window period not less than 5s. (Definition of silent window period: there is no Long Task in the corresponding time of the window, and the number of network requests in progress is not more than 2)
  1. After the silent window period is found, the system searches backward from the silent window period to the latest Long Task. The end time of the Long Task is TTI.
  1. If the FCP time is always found and the Long Task is not found, the FCP time is used as the TTI.

Long Tasks API and Resource Timing API should be supported. If you are interested in implementing the Long Tasks API, you can try to implement it manually.

FID

First Input Delay (FID) measures the Delay between the user’s First interaction with the page and the time when the browser can actually start processing event handlers in response to the interaction.

It is implemented using the concise PerformanceEventTiming API, where the callback is triggered when the user first interacts with the page and gets a browser response (clicking a link, typing a text, etc.).

Get FID

As to why FID is used in the new standard rather than TTI, there may be several factors:

  • FID requires the user to actually participate in the page interaction, only the user interaction will be reported to FID, TTI does not need.
  • FID reflects the user’s first impression of the interactivity and responsiveness of the page. A good first impression helps the user build a good impression of the application as a whole.

Visual Stability

CLS

CLS(Cumulative Layout Shift) is the measurement of the maximum Layout change score for every unexpected Layout change in the entire life cycle of a page. The smaller the Cumulative Layout Shift score is, the more stable your page is.

As complicated as it sounds, here’s a quick explanation:

  • Unstable element: A visible element that is not a user operation but has a large offset is called unstable.
  • Layout change score: the percentage of the page affected by the element’s offset from its original position * the percentage of the element’s offset distance

For example, an element that occupies 50% of the height of the page, but is shifted 25% downward, has a score of 0.75 * 0.25, greater than the standard definition of 0.1, and the page is considered less visually stable.

Use the Layout Instability API and PerformanceObserver to get the CLS:

For CLS

A little feeling: After reading a lot of reference materials, I think performance monitoring is a long-term practice, practical business-oriented thing, the mainstream standards in the industry change with each passing day, in the end what indicators are the most consistent with user experience we do not know, There is also some doubt about whether there is enough benefit to spend a lot of effort to explore the implementation of indexes such as FMP and FPS that do not provide API acquisition methods. However, it is undoubtedly a better choice to optimize relevant means based on the business attributes of their own pages and some user feedback. (More recommended in-depth understanding of browser rendering principles, write excellent performance of the page, let APM students unemployed

The data reported

Once you have all the error, performance, user behavior, and environment information, you need to consider how to report data. In theory, ajax can be used normally. However, some data reports can occur at page close (unload) and these requests will be cancelled by the browser’s policy.

  1. Preference for navigator.sendBeacon was created to address this problem by sending data asynchronously to the server via HTTP POST without affecting page uninstallation.
  1. If the above API is not supported, dynamically create a tag to pass the data through URL concatenation.
  1. Use synchronous XHR reporting to delay page unloading, but many browsers now prohibit this behavior.

Slardar took the first approach, using XHR without sendBeacon support, and the cause of the occasional log loss was found.

Because monitoring data are usually of a very large magnitude, it is not possible to simply collect and report one, and some optimization methods are needed:

  • Request the aggregation: Multiple data can be aggregated and reported at one time to reduce the number of requests. For example, open any page that has been connected to SlardarBatch requestRequest body:

  • Set sampling rate: 100% sampling rate is set for data such as crashes and exceptions. For custom logs, a sampling rate can be set to reduce the number of requests. The general idea is as follows:

conclusion

The purpose of this article is to provide a relatively systematic view of front-end monitoring and help you understand what we can and need to do in the field of front-end monitoring. In addition, having a better understanding of page performance and exception handling can help both in terms of self-management when developing applications (minimizing bugs, consciously writing high performance code) and developing your own monitoring SDKS.

How to design the monitoring SDK is not the focus of this paper. There may also be other solutions to the definition and implementation details of some monitoring indicators. There are many technical details to achieve a complete and robust front-end monitoring SDK, such as what configuration items can be provided by each indicator, how to design the reported dimension, how to do compatibility, etc. These all need to be polished and optimized in real business scenarios to mature.

reference

Google Developer

❤️ Thank you

That is all the content of this sharing. I hope it will help you

Don’t forget to share, like and bookmark your favorite things.

Welcome to the public account ELab team harvest dachang good article ~

We are from the front end department of Bytedance, responsible for the front end development of all bytedance education products.

We focus on product quality improvement, development efficiency, creativity and cutting-edge technology and other aspects of precipitation and dissemination of professional knowledge and cases, to contribute experience value to the industry. Including but not limited to performance monitoring, component library, multi-terminal technology, Serverless, visual construction, audio and video, artificial intelligence, product design and marketing, etc.

Bytedance’s internal promotion code: 7EZKXME

Post links: jobs.bytedance.com/campus/posi…