01 Automatic processing of crash information is the trend

Program crash is a phenomenon that the program cannot continue to execute due to some serious error, so it is abnormal exit. It is one of the most frequent problems encountered in the process of quality assurance, and usually we need to pay great attention to this kind of problems. Frequent flashbacks or crashes while using your APP can cause massive user loss.

There are many reasons for a program crash, usually because of the following:

1. Program logic problems, such as group out of bounds, stack overflow, null pointer exception and other problems;

2. Device compatibility problems, because of the diversity of devices and systems, especially android system, which may reach thousands, it is difficult to achieve complete device compatibility;

3. Memory management error, internal memory leakage problem, long time running and unable to release the accumulation of objects, resulting in memory overflow, and finally the program crash; Or the memory required to run the program exceeds the device limit.

In the real production process, due to the iteration of multiple versions, we may face a huge amount of crash data, so we need to manually check each piece of data one by one to screen out repeated problems, retain valid data, and submit bugs for tracking. Not only is it time-consuming and inefficient, but it also has the potential to cause serious problems. Therefore, it is necessary to establish a convenient and efficient automatic closed-loop processing process.

02 Deficiencies of common solutions

A common solution was to integrate Tencent’s Bugly SDK to catch crashes in Android or iOS apps. Bugly provides a complete crash information monitoring and solution. Developers integrate mobile applications into Bugly, and then monitor the background services for crashes, which can conveniently display the crashes /ANR and other problems that occur during the use of APP by users, and quickly locate and solve the problems according to the reported crash information. However, this solution doesn’t work on desktop applications (Windows/Mac, etc.), and Bugly doesn’t currently open up third-party interfaces that allow us to retrieve lists of crash data for automated analysis. Crash information has to be processed and filtered manually, and the final result is not satisfactory.

03 Cross-platform crash automation closed loop solution

To solve these problems, Agora developed a cross-platform (Android/iOS/Mac/Windows) crash information collection and processing solution. When a crash occurs, Agora’s SDK will submit relevant information (version number, platform, compile number, crash offset address, symbol table address, DMP file link, etc.) to our background system. The background will symbolize the bound stack information and symbol table to extract the corresponding relationship between address and symbol. This is then restored to crash stack information that developers can understand.

After the symbolization is completed, the system will judge whether the current SDK version has submitted the JIRA of the same problem. If not, the JIRA will be added. In the process of adding JIRA, the corresponding person in charge can be assigned according to the crashed module. For example, if the crash of audio module is finally identified, the person in charge of the development of audio module will be assigned; if the crash of video module is determined, the person in charge of video module will be assigned; if the crash of network module is determined, the person in charge of the development of audio module will be assigned; The process of manually assigning the person in charge was optimized and the efficiency of problem solving was greatly improved. If the result of the current analysis is that the JIRA has already been committed, it is automatically correlated to the related problem and updates the number of crashes for the corresponding problem. The whole processing process is shown in the figure below:

How do you tell if a problem with the same version has been submitted? At present, there are two dimensions, one is to confirm by compile number and crash offset address. If the compile number and crash offset address of multiple crash data are consistent, we will classify these crash data as the same problem, and The Times of the same problem will be summarized when submitting JIRA. However, after a period of practice, we found that in many cases the compile number and crash offset address of the same version are inconsistent, but it may be caused by the same problem. So we need to introduce a second dimension to extract the stack details for analysis. We concatenate the information of the lines that can be parsed to get the Hash value of the concatenated string, and then judge whether the same problem has been submitted before according to whether the Hash value is the same. Through the filtering of two dimensions, JIRA submissions of duplicate problems can be effectively removed, and the statistics of the number of crashes can be carried out more effectively. We can develop some different Hash generation schemes to filter the repeated crash problem. We can use the loosest scheme, such as the class name + method name to generate Hash. Strict schemes such as the file name + module name + class name + method name + parameter name of the final crash Hash. Here is a JIRA submitted by automating the parsing of crash data in our practice:

JIRA contains the current version number, compile number, crash offset address, crash count, system information, and so on. The analysis also extracts key information from the crash stack and places it in the JIRA description, making it easy for developers to locate related issues. Through this effective screening analysis, we can aggregate tens of thousands of crash data in one version into dozens of JIRas, which greatly improves the processing efficiency of crash problems.

04 Crash Statistics

Platform-based management of current crashes makes it easier for development/test/project managers to find statistically related issues based on platform, version number, JIRA status, and so on:

We can also use the Hash summary to quickly see which versions have the same problem and how often, which can be better avoided in future development:

At the same time, it can also regularly record daily crash data, obtain the highest incremental and unresolved JIRA crash daily Top10 alarm, remind related developers to follow up the highest priority problems:

Through our automation practice, we have improved the efficiency of problem solving, reduced the accumulation of tasks and research and development costs, accelerated the speed of version iteration while constantly improving the quality of code, objectively improved the user experience, and continued to contribute to the development of the company’s business.

The Dev for Dev column

Dev for Dev (Developer for Developer) is an interactive and innovative practice activity jointly initiated by Agora and RTC Developer community. Through technology sharing, exchange and collision, project construction and other forms from the perspective of engineers, it gathers the strength of developers, excavates and delivers the most valuable technical content and projects, and fully releases the creativity of technology.