BGM! Analysis and alarms based on error logs!

Front-end early chat conference, a new starting point for front-end growth, jointly held with the Nuggets. Add wechat Codingdreamer into the conference exclusive push group, win at the new starting line.

The 14th | front-end growth promotion, 8-29 will be broadcast live, 9 lecturer (ant gold suit/tax friends, etc.), point I get on the bus 👉 (registration address) :

The text is as follows

This is the fifth session – Front-end Monitoring System construction special lecturer – Neng Xiang’s share – brief lecture version (please see the video for the complete version) :

Hello everyone, I’m Zhou Nengxiang from Beibei Group. Today I’m going to share with you how to analyze and alarm based on error logs. I am currently working in the big front-end architecture of Beibei Group, focusing on improving developer efficiency through technical products and maintaining client bottom architecture.

The overview

Allan has talked about how we went from zero to one to build an error monitoring platform, and I’m going to talk about some of the practical problems we encountered and the optimizations we made to address those problems based on the existing error monitoring. Next, I will talk about the specific optimization solutions we made from four aspects: data log cleaning, auxiliary analysis capabilities, user interface presentation and error alarm tracking.

Problems encountered and evolutions

Data cleaning

Let’s take a look at the problems we encountered and the solutions. Let’s take a look at the problems and solutions we encountered in the data cleansing phase.

First we ran into a problem with the poor readability of the iOS error log. This is we get to a stack in the iOS original error log information, it is actually a base address and an address offset, in the face of such error stack development to troubleshoot the problem is very troublesome, at the same time for different application version for the code change will have different offsets, lead to the same error in different versions can have different offsets, There is absolutely no way to do error aggregation. Therefore, we will upload the symbol table of each iOS version, and convert the base address and offset into specific files, methods and line numbers through symbolic tasks, so that our development can easily locate the cause of the error. It also allows the same error stack for the same error from one version to another, laying the foundation for subsequent error aggregation.

Speaking of error aggregation, let’s take a look at the various pitfalls we encountered during the error aggregation stage. Our error aggregation experienced from the initial hash aggregation of the whole content to some general optimization to improve the degree of aggregation, and then to some specific optimization for some special problems.

Here we take an Android error log as an example, this is an Android error log we actually reported, including three parts: error name, error description, error stack. The error name is the name of the class that captured the error. The error description is the specific description of the error, which generally contains the cause of the error and some related feature information. Finally, the error stack is the key to troubleshooting.

The most basic aggregation method is to hash all the contents of the report, calculate the MD5 value, and aggregate all the errors with the same hash value into the same type of errors. It is simple and crude. The problem with this is that we find that the convergence is so low that many of the same errors are aggregated into different errors.

So we did some optimizations for this situation, and here’s another log for a similar error, and we found that the error description was different. It is a index out of range error, his specific error description contains the error occurs when the illegal index values, error also contains some of the views, and the page information, these features content may be a single error analysis is useful for us, but they are for error polymerization information we don’t need to pay attention to. We found that the same part is the error name and the error stack, so we can carry out aggregation optimization, hash the error name and the error stack together to obtain the MD5 value, which can effectively reduce the aggregation failure caused by different feature information in the error description information.

Now that I have removed the error description, I am left with only the error name and the error stack. Is this a perfect aggregation? Unfortunately, it’s not. We grabbed another error log, and we manually determined that it was the same error, but the error stack was different! Final reason is that the client code is run on a different system, my APP can run in the android 5, 6, android android 7, etc., and differences between the system version of the system class code is different, my error method call link is still the same link, but the system class code line number is changed. This resulted in a stack of errors that were different on different versions of the system. The solution is that we can remove the line number of all the system class stack, find the part of the system class line number through the regular expression, remove. Now we add the error name and the stack with the system line number removed to make a hash. Aggregation with this MD5 value can effectively reduce the aggregation failure problem caused by different system versions.

Is it possible to completely aggregate error logs from different systems by removing the system line number? Well, it’s not. Look at this example. Sometimes the call stack of some methods varies from system version to system version not only in line numbers, but also in method links, which can not be aggregated very well. For this kind of situation we found if the error stack with our business code, call the business link is the same, we can targeted business code contained in the error stack stack named log by mistake with business to calculate the hash code stack, so that you can further reduce the system version of polymerization under such a circumstance. This, of course, sacrifices some degree of aggregation accuracy, with a low probability of aggregating errors that are not of the same class. This can be further optimized by taking the system code stack closest to the business code. At the same time, we found that this kind of situation rarely happened in the actual process of use. In fact, the impact of occasional occurrence on our finding and solving problems is far less than the impact of failure to aggregate similar errors. After all, our ultimate goal is to better discover and solve online problems, so we can safely use this kind of relatively aggressive aggregation algorithm.

There are also some out-of-memory type errors in the error log, such as Android. This kind of mistake send likely error stack method when there is no actual problems, just because other places memory leaks and other problems led to this line of code in the implementation of the time of memory, if error stack aggregated with it, the 10 OOM give you polymerization of 10 different errors are possible. This is obviously not the result we want, and is not conducive to the follow-up investigation and analysis. Therefore, we will aggregate all the memory overflow errors together according to the error names, and conduct follow-up investigation and optimization by counting the characteristic indexes of error logs, such as the page on which a large number of errors occur, the model on which a large number of errors occur, and the system version. You can’t just stick to the aggregation algorithm we defined earlier, taking unmarked stacks and aggregating them.

Data persistence

We do peak clipping when errors occur in large numbers. Keep our monitoring system from going down every time there’s a flood of errors, just when it’s needed the most, like we did with Sentry.

Our peak-cutting mechanism is that the polling task of data processing will pull the latest 10000 error logs from ES each time, and at the same time, after aggregation, only 200 errors of the same type will be written into the specific event database, and we will increase the number of errors of this type for the others.

The reason for doing this is that the most important thing we care about is always the latest errors. We can’t process the stock data in chronological order because there are a lot of unprocessed errors in front of us. Then we think 200 specific error event is enough we relevant error analysis, so that when mistakes occur in great quantities, we only recorded 200 specific error events, can not only alleviate the pressure of the db, avoid a large amount of data read and write, also can make data processing tasks faster finish, trigger the alarm as soon as possible to the relevant person in charge.

Auxiliary error analysis

Aiming at the front, we recorded the front end users send Ajax requests on the page, click on the event, and console error log, when an error occurs, we will take these actions relate log and error log, make development in screen front fault can be better through these log user behavior analysis of the cause of the problem.

Such requirements are even more important when analyzing client issues. We found on the analysis of client problems only by the error log reported information, most of the time is not enough, because of lack of users’ browsing path, information such as the operation behavior, and the client a lot of errors are in specific trigger conditions, only the error stack is hard to recurring problems, resulting in the development of screening problem of experience is very poor, Either you have to guess how the error occurred, or you have to find your own user behavior logs from the massive amount of behavior logs.

So we give the client to establish a set of log link, since the client a cold start, we will use uuid generated a link id, log all subsequent behavior, the collapse of the dot-com log and log will bring the link id, we will record some of the key nodes, page adjustment, change the network environment, the wrong network request, Some user operations, etc. In this way, we can directly associate the link ID with our related behavior logs to facilitate subsequent troubleshooting. For example, our client has a WEEX error, the error content only WEEX JS content is illegal, can not render display, we have no idea how this error occurred, can not be repeated. Then through the investigation of specific behavior logs, we found that before the WEEX error occurred, users have entered the H5 page, we pulled the log link of all users that occurred this error, to determine that before the error occurred, users have entered the H5 page, and these pages are Http protocol, At the same time, we also pulled an advertisement request, so we confirmed that the error occurred because of network hijacking, loading the advertisement JS, and one of the jump links happened to be considered by our Webview as our WEEX page and jumped. The content of the AD page cannot be recognized by our WEEX page, so this error occurred. We couldn’t reproduce and solve this problem with bugly information alone, so the log behavior link was a good way to analyze and resolve errors.

Data visualization

Error list	Can be retrieved	Can be sorted	The error message
Error details	The error stack	User behavior	Feature information	Can be retrieved
trend	Wrong trend	Event trend

The visualization requirements of the error monitoring platform are mainly the following three parts: For the error list, we hope to support multi-condition retrieval, which can be sorted according to various dimensions to reveal the key error information; For error details, we want to be able to easily view the error stack, be able to associate with user behavior, confirm the characteristic information of error occurrence, and provide some retrieval ability for the aggregated error. Then we want to provide trend charts for all errors and aggregated errors so that we can confirm if any error fluctuations occur.

Error list page	Wrong trend	Retrieve the area	Error list
Error details page	Retrieve the area	Event trend	Event information	Feature information	The event list

So we grouped these functions into two pages, the error list page carries the overall error trend, the more comprehensive retrieval capability, and the error list. The error detail page carries relatively simple retrieval capabilities, event trends, specific event information, error characteristics, and a list of specific events under aggregated errors

This is our Skynet page, we put a whole column of search areas on the left side of the page to provide more comprehensive search capabilities; We put the error trend chart in a prominent place to facilitate development to quickly locate the error trend over a period of time; For the largest part of the rest of the page, we left the error list. We provided sorting ability in the table header, including sorting by the number of errors, the number of errors, time, and for the number of errors within 24 hours, we also provided sorting by the number of errors within the selected time range. In this way, it can avoid the problem that some long-term accumulated errors occupy the front row and the recent errors cannot be revealed. Finally, there is the list, which basically shows the cause of the error, as well as the URL where the error occurred. Total number of errors, number of users affected.

This is the details page from the list page. The page is roughly a three-column layout structure, from left to right for five areas, respectively: search area, where the search is mainly to provide time version and other simple search conditions; A list of events, each specific error event of the aggregated error; Event trend chart showing the occurrence trend of such errors; A large area in the middle is reserved for error details, including stack details, user behavior, and event handling timelines that developers care about most; Finally, feature information other than error details is provided to help developers obtain error features for easy troubleshooting.

The purpose of the error monitoring platform is to discover and resolve errors in time. We hope that the error processing link can not stop at discovering problems, but can connect the subsequent problem processing process. We through the alarm or develop active discovery of the error check our monitoring platform, the platform, assigned to follow up and will be on our demand management platform to create a demand for follow up, follow up after completion of error analysis and processing through sustained release system release code after launch, new code was our error monitoring platform for monitoring, This creates a complete error loop that continuously monitors the quality of the group’s online code.

Monitoring alarm

Here is a more basic alarm mechanism, monitoring platform to receive after error, determine if the error report for more than a certain time, and number of errors per minute during more than a threshold, we will trigger the alarm, the trigger general alarm, how have no marked handled, and number of errors per minute during more than a higher threshold, we will trigger the alarm escalation. Common alarms have a low trigger threshold and are relatively easy to be triggered. Meanwhile, we use email notification only to reach the development that is highly concerned about this. At the same time, email can also be used as a file for follow-up investigation. After the alarm is upgraded, we will try to notify the responsible person in more and more aggressive ways, such as nails, text messages and phone calls, to ensure that the problem can be solved in a timely manner.

The preceding information is simple. After data cleaning, an alarm is generated if the number of errors within one minute exceeds the threshold. If the number of errors does not exceed the threshold, the alarm is ignored. But have you ever wondered what the threshold of nearly a minute is? Is the error reported when the server time is set to nearly 1 minute, or when the client time is set to nearly 1 minute? If it is in accordance with the reported error nearly 1 minute, we found that the error of end side is cumulative reported, probably due to the bad network environment before the client accumulated error of 3 days one-time report, led to the incorrect increased nearly 1 minute trigger the alarm, but in fact the current and no error occurs, report the error is happened before. If errors occur in the last minute, we will find that the error reporting at the end side is delayed, and the errors occurring in the last minute may be reported after several minutes. If we only take the errors occurring in the last minute, the number of errors will be much less than the actual number, and it is difficult to trigger the alarm threshold. If the alarm threshold is lowered, the alarm accuracy deteriorates.

How to solve such a problem? We can determine the validity period of an alarm, for example, one hour. It indicates that all the faults that occur within one hour are the ones we need to be concerned about. In each polling error, we only clean the log that actually happened one hour ago into the database, but do not count it as a valid error reported in the past one minute. We only determine whether the number of valid errors reported in the last minute exceeds the threshold, which can reduce the occurrence of alarms caused by accumulated errors that we did not care about a long time ago.

According to the scheme of the occurrence time, we also determine the validity time of an alarm first, for example, one hour. Each polling task removes all errors according to the time when they occur. Then, we check whether the number of errors in a minute exceeds the threshold in the past hour, rather than whether the alarms in the last minute trigger the threshold. If the number of errors exceeds the threshold, we generate an alarm.

Here’s a concrete example: See first report in accordance with the time to calculate, it has been nearly 1 minute we pull to the error, a total of five errors, are 10 seconds before the report, but some errors happened two or three days ago, we put in one hour before the actual time of mistake, warehouse cleaning, but not effective error calculation for nearly a minute, The remaining two errors occurred 5 minutes ago and 20 seconds ago respectively, both within the valid alarm time range of one hour, we will record two valid errors occurred in the last minute.

We received an error log with an actual time of 8:59 at 9:00, 20, 40, 10, and 10:20. If the logic used to be that when it was 9 o ‘clock sharp, the 8:59 log would have occurred within a minute, but at 9:20, we would only count errors that occurred between 9:19 and 9:20, and the 8:59 log would have been ignored. According to the new logic, we receive 8:59 logs at 9 o ‘clock, 20 o ‘clock, and 40 o ‘clock. We will give the number of logs at 8:59 +1 when we receive them. By 9:40, we will count three errors at 8:59. When we wait until 10 o ‘clock, we receive an 8:59 error. Since the error occurred an hour ago, we no longer care about the number of errors at this point in time and do not continue to accumulate and trigger the alarm.

In addition to the log time processing issues, we also found that some cleaning tasks were relatively slow, such as iOS symbolization, which resulted in the number of errors counted per minute being limited by task processing speed. Assume that 20 logs can be symbolized in a minute, and the alarm threshold is set to 30 logs in a minute. No matter how many errors occur in a minute, only 20 logs can be counted because the cleaning task is slow, and alarms cannot be triggered. To solve this problem, you can read the original number of errors synchronously during polling tasks, generate internal alarms based on the number of errors, and notify responsible persons. Then, you can perform log data cleaning through asynchronous tasks to complete the error information after cleaning. In this way, the alarm will not be blocked due to time-consuming errors. When the development goes to view, we will synchronize the log details that have been processed to the database for display.

Results show

Our Skynet monitoring platform has completed multi-terminal support, while supporting Web PC, H5, Node, small program, Android, iOS, Weex. Access to 80+ online projects and have collected more than 200 million error logs up to now, which greatly supports the quality monitoring of the group’s online projects.

conclusion

Finally, let’s sum up what we’ve talked about. We optimize the aggregation algorithm to achieve more accurate aggregation, but also not limited to the theoretical sense of accuracy, to better find and solve problems as the goal. When a large number of errors occur, peak clipping is used to maintain the stability of the monitoring platform, while retaining the most important information for development to troubleshoot and solve problems. Design visualizations according to the needs of development troubleshooting, which facilitates development to locate and solve problems more quickly. Through the internal system to form a complete monitoring loop, do not let the found errors stay in the monitoring platform no one. Finally, we optimized the alarm to give consideration to the timeliness and accuracy of the error alarm, and let the relevant responsible people know the problems in a more timely manner.

Thank you for joining us.

This article is formatted using MDNICE