preface

On the front end, it’s all about experience, and in the front end, it’s all about performance.

It is believed that most users access front-end performance monitoring (RUM) is to verify front-end performance and quality through RUM quality evaluation system, and a series of indicators directly affect the performance and quality, so it is particularly important to understand the page performance indicators! Front-end Performance Monitoring RUM is the big front-end page quality and performance monitoring platform of Tencent Cloud, focusing on improving user experience. For details

In layman’s terms, a user wants to know how fast, if, and how fast a page is accessed. How do you measure it? A neutral referee is needed to decide, and RUM’s role is that of referee.

This article will combine front-end monitoring SDK source code -Aegis and Google’s latest page performance specification to explain the following two themes:

  1. Specifications and calculation rules for key performance indicators of front-end pages.
  2. How to understand RUM visualization chart? And use chart data to optimize the project?

What are the page performance metrics?

There are many and complicated indicators in front-end monitoring, such as white screen time, first screen time, FCP, FMP, LCP, FID, TTFB and so on, which are difficult for ordinary people to grasp. We’ve taken some of the most common and useful ones and explained them to you.

  • Network connection waterfall (TL; DR)

To explain these metrics, we need to start with the network connection waterfall graph, which anyone who knows anything about page performance has seen before.

The performance. Timing property in the browser corresponds to this picture one by one, and we print it out at the same time to illustrate the data comparison.

  • NavigationStart: Indicates the Unix timestamp at the end of the uninstallation from the previous document. If there is no previous document, this value will be equal to fetchStart.
  • UnloadEventStart: Indicates the timestamp of the previous web page (same-domain unload with the current page, 0 if there is no previous web page unload or the previous web page is in a different domain from the current page.
  • UnloadEventEnd: Returns the timestamp in which the callback function of the Unload time binding on the previous page has finished executing.
  • RedirectStart: The time when the first HTTP redirect occurs. The value is 0 only when there is a redirect within the same domain name.
  • RedirectEnd: The time when the last HTTP redirection is complete. The value is 0 only when there is a redirect within the same domain name.
  • FetchStart: The time when the browser is ready to fetch a document using an HTTP request, before checking the local cache.
  • DomainLookupStart/domainLookupEnd: DNS domain name of start/end time, if you use the local cache (i.e. no DNS queries) or persistent connection, with fetchStart values are equal
  • ConnectStart: time when the HTTP (TCP) connection is started or re-established. If the connection is persistent, the value is the same as the fetchStart value.
  • ConnectEnd: time when the HTTP (TCP) connection is established (handshake is completed). If the connection is persistent, the value is the same as the fetchStart value.
  • SecureConnectionStart: HTTPS connection start time, if it is not a secure connection, the value is 0.
  • RequestStart: The time when the HTTP request starts reading the real document (after the connection is established), including reading from the cache locally.
  • ResponseStart: The time at which the HTTP starts receiving the response (first byte is retrieved), including reading from the local cache.
  • ResponseEnd: The time when the HTTP response has been fully received (the last byte fetched), including reading from the local cache.
  • DomLoading: Parsing the time to render the DOM tree, document. readyState becomes loading, and readyStatechange events are thrown.
  • Document. ReadyState will change to interactive, and readyStatechange events will be thrown. Note that only the DOM tree parsing is complete, and resources within the page will not start loading.
  • DomContentLoadedEventStart: after DOM parsing is complete, the starting time of the resources within the web page loading, before the DOMContentLoaded event is thrown.
  • DomContentLoadedEventEnd: after DOM parsing is complete, within the web resource loaded time (e.g., JS) script loading has been completed.
  • DomComplete: When the DOM tree is parsed and the resources are ready, document. readyState becomes complete and the readyStatechange event is thrown.
  • LoadEventStart: The load event is sent to the document, which is the time the LOAD callback function starts executing.
  • LoadEventEnd: The time for the load event’s callback function to finish executing.

According to the above definition, we summarize the common page metrics calculation formula:

getPerformanceTiming() { const t = performance.timing; const times = {}; Times. loadPage = t. loadeventend - t.navigationStart; // Time to parse the DOM tree structure times. DomReady = t.domcomplete - t.esponseend; // times. Redirect = t. retend - t.redirectStart; // DNS query time times.lookupDomain = t. domainlookupend -t. domainlookupstart; Times.ttfb = t.esponsestart -t.navigationStart; Request = t. pagonseend - t. pagonstart; LoadEvent = t. loadeventend - t. loadeventstart; // Loadeventstart = t. loadeventstart; // Loadeventstart = t. loadeventstart; // DNS cache time times. Appcache = t. domainlookupstart -t. setchstart; // unloadEvent = t. loadeventend - t. loadeventstart; // Time when the TCP connection is established and the handshake is completed times. Connect = t.connectend - t.connectstart; return times; }Copy the code

What performance indicators does RUM use?

Below, I have an in-depth understanding of RUM’s performance indicators through the previous mainstream charts.

  • The page loads the waterfall diagram

The waterfall graph is a graph that shows how the resources of a website are downloaded and parsed by the engine. It contains eight performance indicators such as first time, request response, etc. It allows us to see the order and dependencies between resources. It helps to determine where important events are happening during loading, and it also allows users to easily see how well their site is performing to show exactly which speeds are slowing it down.

The calculation rules of Aegis SDK source code are as follows:

if (! t) return; // Sometimes t.loadeventstart -t.dominteractive returns a large negative number, Let resourceDownload = t. loadeventstart -t. dominteractive; if (resourceDownload < 0) resourceDownload = 1070; result = { dnsLookup: t.domainLookupEnd - t.domainLookupStart, tcp: t.connectEnd - t.connectStart, ssl: t.secureConnectionStart === 0 ? 0 : t.requestStart - t.secureConnectionStart, ttfb: t.responseStart - t.requestStart, contentDownload: t.responseEnd - t.responseStart, domParse: t.domInteractive - t.domLoading, resourceDownload };Copy the code

Note: resourceDownload sometimes has a very large negative number, so simple compatibility is made and the average of several items is taken. Other page performance index calculation rules we can see more intuitively through the code. RUM has a first screen time, so how does Aegis SDK calculate this indicator?

  1. By default, the API MutationObserver is used to monitor the DOM changes of the browser Document object, and only the DOM elements in the first screen are calculated. The DOM change time is used as the X-axis, and the number of DOM changes per unit time is used as the Y-axis. After drawing the curve, we find the highest point of DOM changes and consider it to be completed in the first screen.
  2. If developers feel that this algorithm is inaccurate and want to mark DOM elements themselves, they can add the attribute <divAEGIS-FIRST-SCREEN-TIMING> identify an element as the key element of the first screen. The SDK believes that the first screen is completed as long as the user’s first screen appears. You can also add the attribute <divAEGIS-IGNORE-FIRST-SCREEN-TIMING< span style = “box-sizing: border-box; color: RGB (51, 51, 51);

In addition to the above data, RUM also calculated the performance related indicators of the above pages according to the reported data. The calculation formula is as follows:

  1. First byte (TTFB) = DNS+SSL+TCP+TTFB
  2. DOM Ready = DNS+SSL+TCP+TTFB+ContentDownload+DomParse
  3. Page fully loaded =DNS+SSL+TCP+TTFB+ ContentDownload+DomParse+ResourceDownload
  • Basic indicator of a good website – Web Vitals

The above algorithm for calculating the first screen is independently provided by Aegis SDK. Due to the ever-changing user scenes, it cannot cover all the scenes, and this algorithm cannot be recognized by all developers. This is where Web Vitals comes in.

What is Web Vitals?

Google’s definition is the Essential metrics for a healthy site.

Why define a new set of metrics?

In the past, too many metrics were needed to measure a good website. The introduction of Web Vitals simplifies the learning curve. The site owner only needs to pay attention to the performance of Web Vitals metrics.

Currently, Google’s Web Vitals source code provides 5 metrics, which are:

  1. CLS (Cumulative Layout Shift- Cumulative Layout Shift) : CLS measures the Cumulative Layout Shift over the lifetime of a page. The score ranges from zero to any positive number, where 0 means no offset, and the higher the number, the greater the layout offset.
  2. FCP (First Contentful Paint- First Content Painting) : FCP measures the time it takes for any part of a page’s content to appear on the screen from the start of the page, including text, images (including background images), elements, or elements that are not white.
  3. FID (First Input Delay- First Input Delay) : The time between the user’s First interaction with your web page (clicking a link, clicking a button, etc.) and the browser’s response to that interaction. The object of this measurement scheme is any interactive element that is clicked for the first time by the user.
  4. LCP (Largest Contentful Paint- Maximum Content Rendering) : LCP measures the time from the time a user requests a URL to the time it takes to render the Largest visible content element in the viewport. The largest element is usually a picture or video, but it can also be a large block-level text element.
  5. TTFB (Time To First Byte- Time To receive the First Byte from the server) TTFB is the total Time from sending the page request To receiving the First Byte of the reply data, including the Time of DNS resolution, TCP connection, sending the HTTP request, and obtaining the First Byte of the response message.

RUM now has three of the most important properties: LCP, FID and CLS.

It can be seen that, different from the vague definition of “first screen time” before, Google has given a very clear indicator definition for Web Vitals and officially provided algorithm support. So can we directly replace the “first screen algorithm” written by ourselves with LCP? It is clearly not feasible at the moment, and due to the compatibility issues with the PerformanceObserver underlying LCP, it will not be completely replaced in the short term.

However, in the foreseeable future, Web Vitals will become the dominant metric in the industry, and at that point, we can move on and embrace open source algorithms.

How to analyze performance data & guide development optimization?

With RUM as a useful tool, you can use data to guide development and decision making.

After our project is connected to RUM, how to optimize the project according to the data displayed by RUM?

  • Optimization for

The following is an example of a targeted optimization that a team invited us to do for their project.

First of all, the core appeal of developers is fast page response, good performance. The figure above is not fast by any means. As you can see, the first screen time is 4.8s, the LCP time is over 4s, which is only “POOR”, and the CLS data is not good either. First of all, we only analyze the surface data, compared the data of all pages under this item, analyzed the pages through the “Top Page Access” TAB, and sorted in reverse order according to the “first screen time”.

Here found the first question, the developers put multiple projects page report report all use the same ID, lead to some more bad of the pages in the face of the whole data have an effect, so we suggest users as far as possible according to the code and business organizations report way to distinguish the different ID, convenient for fixed-point found the problem.

Excluding the interference of the page, let’s analyze the network and regional interference.

As can be seen from the figure above, network conditions and regional differences have little influence on the data on the first screen of the page.

Going back to the waterfall diagram, you can see that the main bottleneck of the project is the “resource loading” time. Through the analysis of the loading of resources on the user page, the cause was found immediately.

Users use the React framework. Without server rendering, the page is rendered after the main JS file is loaded. However, most JS files are packaged into a bundle, resulting in a large JS file, which becomes the bottleneck of user page rendering. In addition, it was found that the JS file does not support HTTP2 protocol.

  • Resource loading optimization

According to the above data, we suggest users to make the following optimization:

  1. Unpack by packaging common external dependencies as vendor and load components asynchronously.
  2. Remove some unnecessary packages, such as the user introduced a full loDash and changed it to LoDash-es, so that WebPack can do Treeshaking; Remove the moment that was introduced just to format a time; Getting rid of jquery, which was introduced just to query an element, is not worth the cost.
  3. It is recommended that you use Webpack-bundle-Analyzer to analyze the packaged code to see which packages do not need to be referenced or can be packaged separately.
  4. In terms of network protocols, HTTP2 was fully introduced, some small static resources were merged, and some small SVG was changed to Base64.

Through simple analysis, the depressions in page performance were found, and users simply optimized according to the suggestions. The effect was very significant. Soon after the full release, the data was greatly improved, the first screen time was optimized from 4.8s to 3.2s, and the most important thing was that the time of “resource loading” was directly halved.

Then I found another problem: the user’s “resource load time” has decreased significantly, but why hasn’t the “first screen time” decreased accordingly? Through the analysis of the user page, we found that after the page is loaded, it will execute a lot of JS code logic, including some data reporting, user behavior collection, loading sidebar, pop-up advertising and so on. Two problems arise.

  1. The main process of the page is seriously blocked, and some logic of Aegis SDK is affected during execution, resulting in the actual execution time being later than the set time, so the reported “first screen time” is actually later than the actual time.
  2. After the completion of the first screen, the user’s page will continue to load a lot of DOM elements, that is, there are a lot of CHANGES in DOM elements, resulting in the first screen time calculated by Aegis SDK later than the real “first screen time”.

We recommend that users perform all non-essential actions in timers to improve page performance and user experience. According to our suggestions, users greatly improved the “first screen time” of the page through timer and asynchronous transformation.

At this time, the “first screen time” is already half of the initial optimization, for most users, the 50% performance improvement is actually enough to cross, but we see that the user’s other indicators are still not perfect. CLS scores have been POOR.

  • CLS index optimization

CLS refers to the offset of the page layout. Once again, a simple analysis shows that the user has a long list that is the main rendering content of the page. The problem with this list is that the developer does not pagination the list because of the small amount of data, usually 4-10.

The problem with no pagination is that the length of the list cannot be determined at the beginning of rendering, resulting in a large offset of the page when rendering the list after retrieving the data, as well as a large number of DOM changes.

This is the core cause of large CLS, which of course leads to synchronous increases in “first screen time”, as well as asynchronous data such as AD widgets mentioned earlier.

Suggestions to users are as follows:

  1. Determine the height of the list at the outset (add pagination), optimize the loading effect through the skeleton screen, and reduce DOM variation.
  2. The AD widget uses an absolute layout that takes it out of the document flow and reduces DOM variation.
  3. Some other elements, such as images, determine the length and width attributes, values that allow the browser to preserve visual space before rendering the image into place.
  4. Some element changes are implemented through CSS rather than using JS to change element attributes.

After optimizing again, the first screen of user page and CLS data change dramatically, reaching the mainstream level of the industry. Finally, let’s look at the overall data effect.

According to the current data, there is still room for improvement in user page Performance, and further optimization needs to be understood by Using Chrome Performance tools. We will write another article on this topic for explanation and analysis, please look forward to it.

conclusion

The above are only a few of the optimization we have done with RUM, and the knowledge involved is familiar to front-end development. We aim to optimize the result by using the useful tools to guide students to make decisions. To quote Lord Kelvin, if you can’t measure it, you can’t improve it.

Author’s brief introduction

Li Zhen is a senior engineer at Tencent Cloud.