background

Bytedance has developed into an online Web project of an order of magnitude, serving hundreds of millions of users.

As the number of users grows, so does the need to measure the experience of a site, where users compare products to the best Web sites they use every day. If you want to start optimization, you must have relevant monitoring data before you can take the right medicine.

Performance is key to user retention. Numerous studies have documented the relationship between performance and business performance, and poor performance can cost your site users, conversion rates, and word of mouth. Bug monitoring allows developers to find and fix problems in the first place. It is not realistic to rely on users to report problems. When users encounter a blank screen or interface error, more people will try again a few times, lose patience and simply shut down your site.

Bytedance’s development team has gradually developed a performance monitoring platform based on the experience monitoring needs of dozens of internal products. After continuous tempering and precipitation, officially released on the volcano engine application performance monitoring full link version. This article will focus on what kind of monitoring platform it is and what pain points it can help enterprises solve.

Product description

Application Performance Monitoring Full Link edition is bytedance’s enterprise-level technical service platform, providing enterprises with APM services tailored to the quality, performance and customized burial points of application services.

Platform based on the analysis of huge amounts of data aggregation, and can help customers find more kind of unusual problems, and promptly report to the police, do distribution processing, platform provides rich attribution ability at the same time, including but not limited to, abnormal analysis, multidimensional analysis, custom reports, single-point log query and so on, combined with flexible capacity report can understand the trend of all kinds of indicators. For details about more functions, see the function modules of each sub-monitoring service.

Product highlights

This part only explains the highlights of application performance monitoring full-link edition from the perspective of the whole product, and more technical highlights and advantages, we will explain in detail for you in each functional module.

Lower access costs: non-invasive SDKS

You only need to initialize a few lines of code to access the SDK.

npm install @apm-insight-web/rangers-site-sdk
Copy the code
// Introduce the following code at the beginning of the project

import vemars from '@apm-insight-web/rangers-site-sdk/private'



vemars('config', {

  app_id: {{your appid}},serverDomain: {{private deployment server address}},})Copy the code

Or through a JavaScript script, directly through the CDN access:

<! -- Script --> <! Add the following code to the top of the page <head> tag<script>

(function(i,s,o,g,r,a,m){i["RangerSiteSDKObject"]=r; (i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(arguments)}),(i[r].l=1*new Date()); (a=s.createElement(o)),(m=s.getElementsByTagName(o)[0]); a.async=1; a.src=g; a.crossOrigin="anonymous"; m.parentNode.insertBefore(a,m); i[r].globalPreCollectError=function(){i[r]("precollect"."error".arguments)};if(typeof i.addEventListener==="function"){i.addEventListener("error",i[r].globalPreCollectError,true); i.addEventListener('unhandledrejection', i[r].globalPreCollectError)}if('PerformanceLongTaskTiming'in i){var g=i[r].lt={e: []}; g.o=new PerformanceObserver(function(l){g.e=g.e.concat(l.getEntries())}); g.o.observe({entryTypes: ['longtask']})}}) (window.document."script".{{your CDN address}}."RangersSiteSDK");

</script>

<script>window.RangersSiteSDK("config",{ app_id:{{Your app_id}}, serverDomain:{{Private deployment server address}}});</script>
Copy the code

Richer anomaly site reduction capability

Application Performance Monitoring Full Link edition not only helps you find all kinds of abnormal problems without dead Angle, but also provides rich on-site restore capabilities, including but not limited to stack trace, user interaction restore, and so on.

More flexible sampling methods to save money

The application performance Monitoring Full-link edition provides you with sampling configuration, including sampling by function module Settings and sampling by user Settings, which helps you save events.

There must be a mature methodology behind such a perfect performance monitoring platform. From the very beginning of the platform design, we have made detailed technical scheme design and measurement standard design, and I will introduce these designs and the detailed principles behind them in more detail.

How do you measure the Web experience

Site experience

First, in terms of site experience, Web Vitals defines the LCP, FID, and CLS metrics and has become a mainstream standard in the industry.

Based on the accumulation of experience index optimization for a long time, the latest core experience index mainly focuses on loading, interaction and visual stability. The loading speed determines whether users can access visual images as soon as possible, and the interactive speed determines whether users can feel the elements on the page as soon as possible psychologically. Visual stability measures the negative impact of visual jitter on users.

The overall results are the following 3 indicators:

Largest Contentful Paint (LCP)

Maximum content drawing is used to measure load performance. This metric reports the time at which the largest image or block of text visible in the viewport is rendered. In order to provide a good user experience, the LCP score should be kept under 2.5 seconds.

First Input Delay (FID)

The first input delay is used to measure interactivity. FID measures the time from the first time a user interacts with a page (for example, when they click a link, click a button, or use a custom javascripts driven control) to the time when the browser can actually begin to respond to that interaction. To provide a good user experience, sites should strive to keep FID within 100 milliseconds.

Cumulative Layout Shift (CLS)

Cumulative layout displacement, used to measure visual stability. CLS is a measure of the maximum layout change score for each layout change that occurs over the life of a page. To provide a good user experience, sites should strive for a CLS score of 0.1 or lower.

Monitoring errors

From the perspective of error monitoring, when the page reaches hundreds of millions of visits, no matter how many rounds of unit test, integration test and manual test before release, it is inevitable that some edge operation path tests will be missed, and even occasional metaphysical failures that are difficult to repeat. Even if these errors occur only 0.1% of the time, a site with hundreds of millions of page views will result in millions of user failures.

This is where a well-developed error monitoring system comes in handy.

We provide macro indicators for JavaScript errors, static resource errors and request errors, such as error number, error rate, the number of users affected, the proportion of users affected and so on. We pay close attention to the current remaining errors and the impact on users, so as to help developers to fix problems as soon as possible.

At the same time, for the monitoring of requests, in order to further ensure users’ experience in acquiring data, we further refine the indicators related to the success rate of requests and slow queries.

The SDK to collect

With these metrics in hand, let’s take a look at how the SDK concretely addresses them.

What indicators need to be collected?

  • RUM (Real User Monitoring) indicators, including FP, TTI, FCP, FMP, FID, MPFID.
  • Navigation Timing ****** Indicators of all stages **, including INDICATORS of DNS, TCP, AND DOM parsing.
  • JS Error, which can be broken down into runtime exceptions and static resource exceptions.
  • Request status code. After being collected and reported, you can analyze information such as request exceptions.

How are these indicators collected?

The collection of RUM indicators mainly relies on the Event Timing API for measurement.

Take the FID indicator as an example. First create PerformanceObserver and listen for first-input Event. After listening for first-input Event, use the Event Timing API to subtract the Event occurrence time from the Event start processing time. Is the FID.

// Create the Performance Observer instance.

const observer = new PerformanceObserver((list) = > {
  for (const entry of list.getEntries()) {
    const FID = entry.processingStart - entry.startTime;

    console.log("FID:", FID); }});// Start observing first-input entries.

observer.observe({
  type: "first-input".buffered: true});Copy the code

Navigation Timing metrics, which can be obtained via the PerformanceTiming interface, using loading time calculations as an example:

function onLoad() {
  var now = new Date().getTime();

  var page_load_time = now - performance.timing.navigationStart;

  console.log("User-perceived page loading time: " + page_load_time);
}
Copy the code

The window.onerror callback can listen for JavaScript runtime errors ** :

window.onerror = function (message, source, lineno, colno, error) {
  // Construct the exception data format and report it
};
Copy the code

Listen for Promise Rejections async errors via the unHandledrejection event:

window.addEventListener("unhandledrejection".(event) = > {
  // Construct the exception data format and report it
});
Copy the code

The request status code can be overridden with window.fetch and XMLHttpRequest objects to implement listening. Take overridden FETCH as an example, the following simplified code:

const _fetch = window.fetch;

window.fetch = (req: RequestInfo, options: RequestInit = {}) = > {
  // Omit some logic...

  return _fetch(req, options).then(
    / / success

    (res) = > {
      // The request was successfully reported

      return res;
    },

    / / fail

    (res) = > {
      // Failed request information is reported

      return Promise.reject(res); }); };Copy the code

Server-side processing

After the SDK data is collected, it is sent to the server for collection, cleaning and storage.

After receiving the data, the server performs cleaning tasks such as real-time latitude resolution supplement and stack inverse resolution for the data. According to the functions of products on different platforms, they are classified into different types of storage:

Unable to copy content being loaded

  • Data Collection layer: The data collection layer is a stateless API service with light logic. It only provides authentication verification and unpacking of data reported by THE SDK, and then writes the data into the message queue Kafka for consumption by the data cleaning layer

  • Data cleaning layer: The data cleaning layer is the logical center of data processing. Provides stack formatting, stack restore (SourceMap parsing), latitude supplement (IP -> geographic location, user-agent -> device information) and other processing work. It provides data support for multi-dimensional analysis statistics and data drilling of the platform.

  • Storage layer: The platform selects different types of storage solutions based on different functional requirements to implement real-time second-level platform query.

    • OLAP: We chose Clickhouse as the storage solution for our data analysis. Clickhouse’s powerful performance and targeted optimization within bytes helps us achieve hundreds of billions of data per day and second-level queries.
    • KV: byte self-developed high-performance KV storage data index information, combined with HDFS storage details. Realize the platform single point query and other details tracking function.
    • ES: For customized log analysis and Search scenarios, Elastic Search is used to implement flexible log Search and analysis.

In terms of alarm function, we implement an abstract alarm query engine (the bottom layer can be adapted to different data sources) to conduct real-time alarm analysis on data. We support flexible alarm rule configuration and access to various third-party notification platforms as the medium of message notification.

At the configuration level of SDK, we realized the platform-based sampling rate configuration function through SDK Setting service, and managed and controlled the reported data in real time.

Visual platform Display

After the collection and reporting, storage and cleaning, statistical analysis and other work mentioned above, it is necessary to hand over these data to users for consumption. The functions of the visual platform are also crucial. Next, I will introduce each function of the platform in detail for you.

Performance analysis

The performance analysis module is divided into two sub-modules: page loading and static resource performance.

Page load monitoring is used to monitor the performance of front-end pages during user sessions. You can view the following indicators: real user performance indicators and page technical performance indicators. By monitoring page loading, you can have a comprehensive understanding of the time consuming and page performance perceived by users.

Real user performance indicators (RUM as mentioned above) and some additional indicators extended by the platform itself include the following indicators:

  • First Paint Time (FP) : First Paint is the point in time at which the First rendering takes place.
  • First Content Rendering Time (FCP) : The First Contentful Paint is the point in time when content is First rendered.
  • First rendering time (FMP) : The time between the user starting the page loading and the first screen rendering of the page.
  • First Input Delay (FID) : First Input Delay, which records the Delay of the First user interaction during page loading. The FID metrics affect a user’s first impression of the interactivity and responsiveness of a page.
  • Maximum delay in interaction (MPFID) : indicates the maximum delay that a user may encounter during page loading.
  • Fully interactive Time (TTI) : Time to interactive. It records the Time from the Time the page loads to the Time the page is fully interactive.
  • First load bounce rate: the user bounce rate before the first page is fully loaded.
  • Slow start ratio: PV ratio that takes more than 5s to fully load.

On the page, you can clearly view the status of these indicators:

Page technical performance indicators

The indicators provided in the page technical performance indicators are defined from the description of Navigation Timing.

Unable to copy content being loaded

The slow loading list lists the pages that load slowly, so that you can optimize them accordingly:

In the slow loading list, the specific URL list is given. Click THE URL to enter the details page to analyze the time consuming of the URL.

In the multidimensional analysis function, you can query the dimension distribution and proportion of all session performance indicators. You can find and locate an exception through dimension analysis.

Static resource performance monitoring provides a similar function to the above chart, and supports viewing from other perspectives through waterfall charts and time distribution charts.

Abnormal monitoring

Abnormal monitoring is mainly divided into JS error monitoring, request error monitoring and static resource error monitoring. The macro consumption dimension of these error monitoring modules is mainly based on error number, error rate, number of affected users and proportion of affected users.

In JS error monitoring, we provide JavaScript error monitoring and analysis capabilities, as well as support for reporting custom errors. On the whole, it is divided into overview of market indicators and issue details analysis.

Issue status management and handler assignment:

You can also query the device information and version information of each error event in this issue. Click UUID/ session ID to go to single point tracing to query detailed logs of the user or a single session. Also:

  • Error stack: An error obloc stack occurred and the original stack can be viewed if SourceMap is uploaded.
  • Breadcrumb: Records of user operations before and after the error. In addition to the request types automatically collected by the system, the system also supports the interaction event types of user-defined buried points.
  • User-defined dimensions: In addition to dimensions automatically collected by the system, user-defined dimensions can be reported.

Static resource error and request error modules are similar to JS error modules, providing overview, Issue management, and detailed analysis.

A single point to track down

The function of single point tracing is to query the problems in the process of using products for specific users. Currently, you can query logs of a single user.

By entering the UUID (user ID) or Session_ID (session ID), you can query the front-end logs of a single user within a period of time. By restoring the operation path of the user, you can better locate the root cause of events.

Call the police

For monitoring platform, perfect alarm system is also indispensable. Create a business-related alarm mechanism for all kinds of data and exceptions, which helps to discover and solve problems as soon as possible.

In the alarm task, you can create alarm policies and manage existing policies.

It is very convenient to dock with the alarm notification robot in the flying book, and feedback to the flying book group in the first time after receiving the alarm:

Of course, you can also view the alarm overview and history list in the platform.

conclusion

Application performance Monitoring full-link version is a self-developed application performance monitoring product developed by the terminal technology team based on years of accumulation of internal bytedance mobile apps such as Tiktok, Toutiao, Tik Tok and Feishu. In addition, we have the practice of many external customers, such as: Hupu, Jikubang, Zhenyun Technology, etc., to provide one-stop APM services for enterprises and developers.

Currently, the Application Performance Monitoring Full-link forum offers a 30-day trial free service for a limited time to new users. Including App monitoring, Web monitoring, Server monitoring, small program monitoring, App monitoring and Web monitoring each 5 million events, Server and small program monitoring time limit is not limited.

For more product information, please click herehereContact me in the group

Product experience address: www.volcengine.com/products/ap…