Introduction:


This paper will describe the background, design, implementation and thinking of the front-end monitoring system Sentry in the process of building, and the perception and problems.


background

What is surveillance?

Monitoring is literally two things: monitoring, monitoring is code; Control, control is the quality. Monitoring is a tool, not a purpose. It is not to monitor for functionality, but to really understand the performance of the page through monitoring, so that the quality of the code can be controlled.

Monitoring is different from statistics. Statistics focus on the sum of access situations within a period of time. The real-time reporting is not so high, and can be delayed or accumulated. Monitoring, on the other hand, focuses on the page at runtime.

From the perspective of the whole big front end, there are many scenes of monitoring. For example, the server monitors the stability and performance of the interface. The client monitors crash and APP performance. For the Web front end, there is a greater focus on online runtime performance and error reporting.

Monitoring is meaningful both online and offline. Offline we can support automated testing and find some obvious errors when the code is running before going online, which can serve as a reference for offline anti-degradation. Online environments are more complex, with different geographies, devices, networks, browsing frameworks, and so on, resulting in mixed results when the same code is run in different environments. We want to understand this and assess where the page needs to be optimized.

The reason for building

The company’s existing statistics platforms, such as Thunder and Spy, can collect performance indicators and front-end exception information. However, the collected exception information is mostly used for statistics, focusing on the number of errors reported or the number of types of certain errors. Generally, the collected exception information does not contain detailed error information.

If there is an error on the sourcemap line, we can only use Chrome Devtools to see where the error is. However, it is difficult to reproduce the error on the sourcemap line.

Through the collection of the two problems mentioned above: the error information collected by the statistical platform is not detailed enough to reverse the number of error lines in the source code according to the error information, we choose to build a set of front-end exception monitoring service on Baidu Cloud based on the open source Sentry.


design

Scheme selection

Let’s first introduce the Sentry function.

Sentry is a centralized log management system. You can do the following things, and do them well compared to other systems.

  1. Rich SDK. Different languages, different projects through the integration of SDK. Take JS for example, not only collect detailed user device information, but also record the user’s operation behavior, by clicking the XX button, resulting in an error
  2. According to the collected data, the alarm is configured to realize the closed-loop monitoring.
  3. Support tag function, for example, we can give the error uniform interception, add the tag CUID, later in the background platform with CUID error filtering
  4. Easy to operate background. Sentry provides a background management system that is easy to implement
  5. API supports. All operation interfaces provide API services. Users can customize interfaces based on apis
  6. An active community. The Sentry team is very active on Github, and issuses are mostly addressed on the day.


Let’s look at K8S

First of all, let me mention the popular Docker (an open source application container engine). Docker has a slogan: Build Once, Run Anywhere. This means that an environment can be released in one build and run quickly in any environment that supports Docker, which solves the pain of our daily deployment environment. However, when we practice, we find that it is difficult to apply Docker to specific business implementation — orchestration, management, scheduling and other aspects are not easy. So K8S (full name: Kubernetes, container based cluster management platform) came into being, applied in a large number of practical projects


Introduce Baidu Cloud.

A wave of ads offered their own cloud service platform: Baidu Cloud. Baidu Cloud provides rich basic service support, such as powerful container service CCE, domain name resolution and distribution service ITM, data analysis service Sugar, etc. We can quickly achieve what we want based on the services provided by Baidu Cloud.

  1. CCE provides the basic K8S container environment
  2. ITM intelligent domain name resolution scheduling allows users to select machines in different rooms based on different regions and networks
  3. Sugar can do a drag-and-drop build of custom reports based on database data


The overall architecture

Sentry itself supports source code deployment and container (Docker, K8S) deployment. Considering the stability, to realize multi-machine and multi-room deployment, load balancing, intelligent scheduling and other practical problems, Sentry is deployed in container based on THE K8S service provided by Baidu Cloud, as shown in Figure 1.




Figure 1: Sentry Baidu Cloud deployment architecture


Take a closer look at Senty’s place in the overall project in terms of business, as shown in Figure 2.


Figure 2: Overall project architecture diagram

The Sentry service in this architecture has several advantages:

  1. Save a lot of manpower. Based on Baidu Cloud K8S operation and maintenance to save the daily machine operation and maintenance, database, traffic scheduling and other labor costs;
  2. Save a lot of machines. K8S operation and maintenance, the machine can dynamically expand or shrink according to the business operation;
  3. Based on the open source project, the community is rich, the problem is discussed in groups, and the problem is solved quickly;
  4. Covering a wide range of businesses, core services provide data collection and analysis capabilities, access layer can be customized implementation.


implementation

The platform was quickly built on the basis of sorting out the above process. However, in the trial run of the platform, every link of monitoring was exposed to different degrees except for some problems. For each process of monitoring, further analysis and optimization have been done. The main links are explained below.

Some report

Monitoring information is reported based on the SDK. On the previous side, for example, Sentry itself provides an SDK that prioritises page loading to catch as many errors as possible.

< HTML > <head> <title> Monitor alarm </title> <script SRC ="https://xx/sentry.js"></script>
	</head>

	<body>
	......
	<body>
</html>Copy the code




In practice, I found that the packaged Sentry. js was around 20K, and if you put it at the top of the page, it would be disastrous for the performance of the page, because loading js at the top would block the rendering of the page. So this is an SDK post-optimization for this situation.

< HTML > <head> <title> <script>let estack = err => {
			return win[ers].length < 10 ? win[ers].push(err) : false;
		};

		win.onerror = (a, b, c, d, error) => {
			estack(error);
			return true;
		};

		win.addEventListener('unhandledrejection', error => {
			error.preventDefault();
			estack(error);
			return true;
		});
	<script>
	</head>

	<body>
	<script src="https://xx/sentry.plus.js" async ></script>
	<body>
</html>Copy the code




  1. Custom intercepts error and promise errors. After the page is loaded, errors are reported uniformly
  2. Update sentry.js to Sentry. plus. Js, which supports collecting existing errors and reporting data in detail.

Through the TRANSFORMATION of SDK, the performance of the page is greatly reduced, and the use of Sentry service is more assured.

Log collection

After SDK integration, log collection is implemented through a POST interface request. When the page is running, if an error occurs, data is automatically sent through the interface.

At this point, it becomes a question to consider whether sampling needs to be supported. If there is no sampling and a large amount of traffic is caught, the log collection function is triggered with a high probability, and a large number of logs are reported but little useful information is generated. Sampling solves that problem. The SDK itself supports sampling, and only simple configuration is needed to initialize it

Sentry.init({
	dsn:'xxx', the sample: 0.5});Copy the code


This sampling refers to the probability that an error will be reported each time an error is triggered while the project is running. But for pages that are looking for page performance, it is often desirable that the SDK has low traffic access. Affect all pages as little as possible. Our current page is rendered on the first screen based on the Smarty template of Php, so the method of reporting collection has been modified:

{%assign var="random" value="{% math equation = rand (1100) / 100%}"|string_format:"%.2f"%}
{%assign var="sample" value="0.5"|string_format:"%.2f"%}
{%if $random < $sample%}
<script type="text/javascript" src="https://xx/sentry.plus.js? v={%$smarty.now%}" crossorigin="anonymous"></script>
{%/if%}Copy the code

Through the transformation, the influence of monitoring code on the business is further reduced, and the business is more convenient to access.


The logging stored

Sentry uses PostgreSQL. In K8S deployment mode, the database service provided by Baidu Cloud can be configured here.

postgresql:
  enabled: falseNameOverride: sentry -postgresQL postgresqlDatabase: sentry postgresqlHost: 192.168.1.1 postgresqlPassword: xxx123 postgresqlPort: 3306 postgresqlUsername: sentrydbCopy the code


The cost of the database is also very expensive, accounting for 1/3 of the overall resource consumption. The more services are added, the more data is stored. The 50 GB space purchased at the beginning is quickly used up.

I analyzed whether the data should be used or not: it is not necessary to store the data of the error itself, especially the detailed information of each error, but for the content of the report nature, such as the same quarter of the overall situation of the error, it is necessary to store the results earlier.

The first method, the practice of tuhao: database expansion, continue to increase investment to buy. This is the budget input;

The second way: according to the existing, collect the wrong content, the results of the report need to be calculated and stored separately, and then periodically clean up the database, such as a script, regular please in 10 days before the data.


Here’s a catch: Sentry officially provides an API for cleaning databases, which can be done by command:

// Enter the container Dockerexec-ti XXXX /bin/bash // cleanup sentry cleanup --days 0Copy the code


Although data is cleared, the PG database capacity is not released. This is because postgresQL data is deleted by using the delete command. Postgrdsql only marks the corresponding row DEAD for delete and update operations, but does not release disk space. Run the following command to clear the file.


sudo -u postgres vacuumdb -U postgres -d sentry -t nodestore_node -v -f --analyzeCopy the code


The report shows

As mentioned earlier, Sentry provides a simple and friendly back office management system. In the project overview, the report display can reflect the data. As shown in figure 3.



Figure 3: Market display

In practice, especially when reporting to your boss, it turns out that such a presentation is not enough. After investigation, it was found that Baidu Cloud provides the Sugar service. You can directly link to the database, customize the desired data for data aggregation, and produce the corresponding report. This output demand is still under investigation, but the effect of the output is very expected. The configuration is shown in Figure 4 and the expected effect is shown in Figure 5.




Figure 4: Configuration content



Figure 5: Expected effect

Monitoring alarm

As a closed loop for monitoring, we can set email alarms in the background of Sentry according to different rules. The email service suggested by Sentry’s official is also constantly changing. The original Exim4 is changed to Mailgun. Considering the cost comprehensively, we finally choose Docker to build Exim4. Finally, it realizes the function of timely notification of error information. See Figure 6 for the effect.




Figure 6: Monitoring effect

The above is in the specific practice process, the critical path encountered some problems, share with you.


thinking

Significance of monitoring

profit

The purpose of a company is to make a profit. On the one hand, the company needs to control the cost, we can see a reduction in the number of QA personnel, on the other hand, we need to run more stable code online, what can we do? One effective way to do this is to increase surveillance.

The efficiency of

Through monitoring, problems can be found quickly and autonomously.

The collected data can be classified and screened, and then alarm notification can be set. According to our monitoring content, we set different thresholds. When the threshold is reached, we will inform the corresponding RESEARCH and development personnel for analysis and repair in the first time.


Through monitoring, you can fully understand the page operation.

Through information collection, you can obtain basic user information such as the area that the user accesses, model, APP version and so on. At the same time, you can also collect the performance of the page when it is running. You should know that performance is money (the speed of opening will affect the user to some extent).


Through monitoring, it can guide the later project development.

After a period of time precipitation, we collected the errors, can draw a report, statistics of the cause of the problem in a certain period of time, error report, solutions.

specification

From the perspective of research and development process (as shown in the figure below), our project is not complete and ready to go. It is necessary to monitor the online operation of the page, make timely review according to the monitored data, and optimize the cycle iteration to form a continuous r&d closed-loop ecology.

The significance of the above monitoring is the guide for the future process mechanism, the code of conduct for students in RESEARCH and development, and one of the references for bosses and senior engineers to make decisions at certain critical moments. Through monitoring, more intuitive, all-round three-dimensional understanding of the product.

China strategy

Middle Taiwan is the most recent strategic direction for technology within the company. In my opinion: monitoring can output a solution of the middle station, and each department can make customized extension based on the middle station service.

In fact, the Sentry system we share here can be output as a midstation service within the company. Each department can customize the SDK according to business, and can also customize reports and alarms according to data requirements. The whole Sentry black box provides API services externally. Because of the convenience brought by K8S, the corresponding departments only need to provide the budget, apply for the machine, and add or reduce to K8S.

There are already many different kinds of monitoring tools on the market. For example: FrontJS, Arms, Yue Ying, etc., but it is a pity that most of them are charged. The data volume of products at the company level is relatively large (but the investment in this aspect is often worthwhile), and sometimes the data output is slow, which is often replaced by new products in stages. There are some front-end monitoring platforms in the factory, such as: It is possible that data collection, storage, filtering, display, alarm and other aspects of the monitoring process need to consume manpower and financial resources, all of which lack some ideas of the middle station.

conclusion

Through the construction and implementation of the whole process of monitoring, on the one hand, I fully realized the convenience brought by the technology provided by cloud services to products, which must be the future direction; On the other hand, the technology that is separated from the business is unreliable, and the technology serves the business. The technology we realize must be verified in the business, so as to improve the technology and solve the practical problems of the business.