Author: Xianyu Technology — Yun Cong

Status quo and Problems

Xianyu public opinion management, relying on ali Group’s facility construction, has the following capabilities:

• Crash exceptions, performance online aggregation query; • Local log: TLog; • Online logs: buried logs (T +1) and user behavior logs (paths and requests)

However, there are still many problems in dealing with public opinion governance:

There are quite a number of flash back, black screen, stuck public opinion without Crash or ANR logs;

Difficulties in locating business problems in technical public opinion;

• Service problems, such as failure to upload pictures and videos, abnormal content display, failure to exit, server error, etc. • Missing log content: Most service logs are written to the console instead of local or online logs. • Buried logs are used for some important services, but the content is insufficient and t+1 is not timely.

The ability of local logs to locate problems is limited

• Local TLog logs contain detailed SDK logs of the two parties of the group, but lack basic device information, service logs, and user behavior logs. • Unknown abnormal problems, such as video green screen, and no console logCAT logs. • Log specifications are not perfect, and log viewing efficiency is low

Local logs cannot be retrieved. Procedure

• Local log reporting is not triggered when users give active feedback • Xianyu belongs to the transaction App, users are not online for a long time, and the success rate of online command push is low • There is no caching mechanism in the background for command push

The feedback problem deviates from the actual situation on line

• The feedback entrance of Xianyu is deep. Compared with the hundred million user base of Xianyu, the feedback accounts for a lower proportion. There are some online problems but no feedback

Overall design of public opinion governance scheme

Based on the current situation, the public opinion governance system is reorganized and supplemented as shown in the figure below:

Online public opinion issues governance system

The following will focus on the content:

• How to improve the ability to locate local logs • Supplement the ability to detect online delays • Supplement the ability to proactively detect problems

Improve the local log location capability

Added local console logs

In the early stage of governance, a large number of service logs enter the console log. Even if a large number of logs are transferred to tlog after governance, some logs still enter the console. For example, Android Maven introduces independent modules and Flutter plug-in package modules. In addition, logCAT logs also provide the possibility of locating some unknown abnormal problems, such as green screen or black screen.

The Android Log module provides a variety of Log caching types, as shown in Android Logging [1]. Logs of the corresponding types can be obtained using the logcat command. In this case, LOG_ID_MAIN, LOG_ID_EVENTS and LOG_ID_CRASH types are read and written to the local file when the user responds. After that, logcat logs and tlog files are packaged and uploaded to OSS through AUS.

// LOG_ID_MAIN main application log, Adb logcat -d -v threadtime -t 20000 // LOG_ID_EVENTS System events adb logcat -d b events -v threadtime -t Adb logcat -d -b crash -v threadtime -t 6666 // LOG_ID_CRASH Application crash logs ADB logcat -d -b crash -v threadtime -t 6666Copy the code

Logcat. TXT content

Local log retrieval capability

As mentioned above, due to the low success rate of online command push and no command caching mechanism in the background, it is difficult to retrieve local logs of Xianyu App. To solve the problem of difficult log acquisition, a variety of log retrieval strategies are designed in the public opinion governance system, as shown in the following figure.

Local log retrieval policy

Download and view local logs

• To improve the log upload success rate triggered by active user feedback, AUS is used to upload the log package file to the OSS platform, and the URL is written into the feedback content (the feedback content cannot be written if the package and upload exceed 5 seconds) and alicloud SLS real-time log platform. • To improve the efficiency of downloading local logs, upload them to the TLog platform using the TLog SDK, and more than 50% of them can be uploaded successfully.

On-line caton /ANR detection capability

Online users gave feedback of ANR and provided screenshots to prove that it was difficult to locate the problem because there was no lag log.

Online users report a black screen. Without valid logs, it is difficult to locate the problem

Another sign that pages are stuck

Status of technical scheme

In the hybrid engineering scenario of Xianyu App, the lag detection on iOS terminal has been realized relying on the Emas platform, and the monitoring and thinking of the lag problem on Flutter terminal has been checked by the lag detection scheme on Flutter terminal [2]. This section mainly describes the lag /ANR detection on Android terminal.

In offline scenarios, the method for detecting stutters /ANR on Android is mature. Common stutters detection schemes include BlockCanary[3], and ANR detection can be obtained by using adb bugreport to check testamp. TXT. However, in the online environment, the above stuck detection scheme has some problems.

Traps. TXT file permissions are incorrect

In online scenarios, in order to listen for ANR and read ANR content, APP needs to listen for changes in the tetrace. TXT file and try to read the contents of the tetrace. TXT file. In Android 6.0 and later, there are permissions problems, and most scenarios cannot detect ANR

• Unable to access /data/anr/ localted. TXT file • Unable to read the valid system attribute dalvik.vm. Stack-trace-file

If the following code is executed on The Redmi phone Android 11, mSystemTraceFilePath is “”, and the path of mSystemTraceFile is “/”

File mSystemTraceFile; . this.mSystemTraceFilePath = "/data/anr/traces.txt"; this.mSystemTraceFile = new File(this.mSystemTraceFilePath); if (! this.mSystemTraceFile.exists()) { String propSystemTraceFilePath = SystemPropertiesUtils.get("dalvik.vm.stack-trace-file"); . this.mSystemTraceFile = new File(propSystemTraceFilePath); . }...Copy the code

BlockCanary detects principle and performance issues

Core principles

BlockCanary detects 500ms lag

The core principle behind BlockCanary is that by setting the Looper. MLogging field of the UI thread, the main thread executes the print method every time it processes a message.

public void start() {
    if (!mMonitorStarted) {
        mMonitorStarted = true;
        Looper.getMainLooper().setMessageLogging(mBlockCanaryCore.monitor);
    }
}
Copy the code

BlockCanary.java[4]

public void setMessageLogging(@Nullable Printer printer) { mLogging = printer; } public static void loop() { ... for (;;) {... // This must be in a local variable, in case a UI event sets the logger final Printer logging = me.mLogging; if (logging ! = null) { logging.println(">>>>> Dispatching to " + msg.target + " " + msg.callback + ": " + msg.what); }... }}Copy the code

Looper.java

In the print method, StackSampler work is triggered, that is, cancel the last asynchronous thread delay task and trigger a delay task again, the delay time is BlockThreshold * 0.8f (suppose to detect the stuck stack of more than 500ms, The delay time is 400ms). If UI Looper task is stuck (>BlockThreshold), the delayed task will be executed, and the main thread stack information will be obtained during the stuck time. Then, when the next print method is executed, if the stuck time is confirmed, the main thread stack information can be reported as the stuck stack record.

@Override
public void println(String x) {
    ...
    if (!mPrintingStarted) {
        mStartTimestamp = System.currentTimeMillis();
        mStartThreadTimestamp = SystemClock.currentThreadTimeMillis();
        mPrintingStarted = true;
        startDump();
    } else {
        final long endTime = System.currentTimeMillis();
        mPrintingStarted = false;
        if (isBlock(endTime)) {
            notifyBlockEvent(endTime);
        }
        stopDump();
    }
}
Copy the code

LooperMonitor.java[5]

Performance issues

In the online environment, stutter does not occur in most scenarios, but the stutter detection SDK will always execute. It can be found that every frame of app (16.6ms), tasks in UI Looper will be executed multiple times, resulting in a large number of invalid string splicing operations and a large number of small object fragments. In particular, the performance of APP will be degraded in the case of low-end online computers or high CPU load, affecting the user’s sense of motion.

logging.println(">>>>> Dispatching to " + msg.target + " " +
                    msg.callback + ": " + msg.what);
Copy the code

Catton detection scheme

The project design

For tracted. TXT read permission problems, you can check the main thread 5s stuck as anR. In view of the performance problems used on BlockCanary lines, looper.mlogging must be abandoned in order to reduce the frequency of delayed task cancellations and triggering and to avoid the frequent creation of string objects. Rethink why it is possible to detect latons by setting Looper.mlogging. It satisfies two conditions:

• The execution time of each Task can be monitored via looper.mlogging

Using Android frame callback instead of Looper Task on idle fishing line satisfies the above two conditions:

• The interval between frame callbacks becomes longer if the main thread method execution stutters. • The duration of each frame can be monitored by registering frame callbacks

In addition, in order to avoid the problem of too high frequency of delayed task triggering and cancelling in BlockCanary scheme, the timestamp is only recorded at the frame callback, and the delayed task is not cancelled any more. However, the main thread stack is also recorded when the delayed task is executed.

Assuming that more than 500ms is stuck, the flow chart of the overall scheme is as follows:

No lag scenario

Caton scenario

• the delay task can be executed for a maximum of two times at 500ms (the lag threshold). • the delay task cannot be cancelled any more. • when the task is being executed, the difference between the current time and the timestamp time of the last frame can be determined. The main thread stack is dumped only when a crash occurs

In order to detect ANR, the lag detection of 500ms is also passed and the stack aggregation of the lag is performed. When the lag is greater than 5s continuously and the stack information remains unchanged, ANR is considered to have occurred. When ANR occurs, record the CPU load of the current device.

Test results

Idle fish home card click event deliberately made 500ms and 5s card to check the card detection results.

// CardView61801.java private void doOnClick(String redirectUrl, Map<String, String> trackParams) { if (null == mCardBean) return; try { Thread.sleep(500); // Thread.sleep(5000); } catch (Exception e) { e.printStackTrace(); }... }Copy the code

View test results

Sleep 500ms

Sleep 5s

summary

The above is the online stuck detection scheme of Xianyu Android. More than three versions have been run online. It can be found through the logs that the user feedback card is dead, no response, flash back, black screen and other phenomena may be caused by ANR. The overall program has the following characteristics:

There is no performance problem in online stacken detection

Take more than 5s carton as ANR

Compared with BlockCanary Looper Task obtaining the stuck stack at 0.8* stuck threshold, this scheme has a higher probability of obtaining the error stack at 500ms stuck detection. However, through continuous stack comparison, the accuracy of 5s stack can be ensured. The 500ms caton improves accuracy through online statistics

• If the service needs to improve the accuracy of a single stuck stack at 500ms, you can simply modify the scheme: obtain a stuck stack at a rate greater than 250ms, and determine the final stuck stack at a rate greater than 500ms for two consecutive times

Ability to find problems actively

Due to the deep feedback entry of Xianyu App, compared with the hundreds of millions of users, the daily technical public opinion feedback accounts for a lower proportion, so we can know that the technical public opinion feedback cannot accurately reflect the online quality situation. To this end, we built online key public opinion problems and basic performance market by monitoring the buried points, and actively found online key problems to be solved by monitoring the market, so as to accelerate the convergence speed and quality improvement of online public opinion, as shown in the flow chart below.

Proactive problem finding flowchart

Monitor the market

Active problem detection sample

Business public opinion issues market

Through the statistics of public opinion problems over a period of time, get the key public opinion problems, in order to add monitoring buried points and build online reports. The number of online problems and important attribution can be obtained through the market data, and the attribution of the top few can be solved to achieve the purpose of rapid convergence of public opinion problems.

Fundamental questions

In addition to online marketplaces such as Crash, Flutter anomaly and performance, we built basic marketplaces such as 5S lag monitoring, network request failure, slow request and wrong TOAST.

Locating monitoring problems

When a problem is found based on the market, corresponding logs are needed to locate the problem. However, users corresponding to the problem may not feedback public opinion or the feedback content is not the problem. Therefore, real-time log query and local log batch retrieval capabilities are built.

Real-time log query platform

Self-built public opinion tracking platform

For key monitoring problems, the client adds corresponding REAL-TIME SLS logs and obtains users’ online logs from the self-built public opinion tracking platform. The platform supports the query of user name and time, as well as the combined query of user behavior, user anomaly and public opinion log type.

Batch retrieving capability of local logs

Online journal is difficult to avoid the problem of insufficient content, such as basic journal, other key module, however, a single user’s local log back to get success is very low (see above), to build public opinion log back to the fishing platform, through the batch back to get the way to ensure that can get to a user’s local public opinion question all log.

Enter IssueName to query the user ID

Batch fishing result query

Summary and Outlook

After governance, the overall proportion of online technology public opinion decreased from 10.5% to 4.7%; Based on the initiative to find problems and solve key problems, the number of failed uploading pictures was reduced from 10W+ times to less than 7K+ times per day. Other data will not be listed.

The initially established idle fish public opinion management system is as follows:

Log types: local logs and online logs.

Online logs: real-time online SLS logs, user behavior logs, buried point (T +1) logs, high availability SDK logs (crash exception, blank screen, performance);

Log query: Provides log specifications and log filtering documents to improve query efficiency.

Log content: Basic logs and service logs.

Basic logs: LogCAT logs, caton /ANR logs, high availability SDK logs (crash, exceptions, etc.), device information, account information, etc.

Log retrieval: upload logs proactively during feedback (TLog platform and OSS platform), and retrieve logs by online command (including TLog platform and self-built message channel).

Problem finding: user feedback and active problem finding

Feedback entry: general customer service feedback and grayscale screenshot feedback

In particular, screen shots from the MIUI system can listen in on broadcasts of miui.intent.take_screenshots

In addition, there is still great room for evolution in the future as follows:

Log Visualization

Log semantic description display

Visualization of user behavior, memory, CPU, traffic, etc

This section provides an example for memory visualization after local logs of idle users are manually parsed

Intelligent Log Parsing

• Key logs are associated with historical problems to form a knowledge base

Key log User logs are retrieved in reverse order

• You can retrieve user logs based on the configuration or other key online logs

Log association aggregation query

• Aggregates server, front-end, and client logs based on user and time, and aggregates multiple online and local logs.

References

[1] Android logging: developer.android.com/ndk/referen… [2] Flutter caton monitoring and thinking of the problem: zhuanlan.zhihu.com/p/148985175 [3] BlockCanary: github.com/markzhai/An… [4] BlockCanary. Java: github.com/markzhai/An… [5] LooperMonitor. Java: github.com/markzhai/An…