I. Crash monitoring

A Crash is an abnormal exit of an App caused by a code exception. As a result, the application can no longer be used and all work stops. An application needs to be restarted (in some cases automatically) after a Crash, and no matter how good the application is during development, a Crash cannot be avoided

There are two types of Crash in Android applications, Java layer Crash and Native layer Crash. There are differences in monitoring and obtaining stack information between the two types of crashes.

1, Java Crash

In Java, Thread defines an interface: UncaughtExceptionHandler; Used to handle thread termination caused by uncaught exceptions (note: Can’t be catch exceptions is to capture), when our application crash will go UncaughtExceptionHandler. UncaughtException, in this method can obtain the information of abnormal, We pass through the Thread. SetDefaultUncaughtExceptionHandler the method to set the default Thread exception handler, we can save exception information to the local and then uploaded to the server, convenient we fast positioning problem.

public class CrashHandler implements Thread.UncaughtExceptionHandler { private static final String FILE_NAME_SUFFIX = ".trace"; private static Thread.UncaughtExceptionHandler defaultUncaughtExceptionHandler; private static Context context; public static void init(Context applicationContext) { context = applicationContext; defaultUncaughtExceptionHandler = Thread.getDefaultUncaughtExceptionHandler(); Thread.setDefaultUncaughtExceptionHandler(new CrashHandler()); } @Override public void uncaughtException(@NonNull Thread t, @NonNull Throwable e) { try { File file = dealException(t, e); } catch (Exception exception) { } finally { if (defaultUncaughtExceptionHandler ! = null) { defaultUncaughtExceptionHandler.uncaughtException(t, e); } } } private File dealException(Thread thread, Throwable throwable) throws JSONException, IOException, PackageManager.NameNotFoundException { String time = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(new Date()); / / private directory, without having to access the File f = new File (context) getExternalCacheDir () getAbsoluteFile (), "crash_info"); if (! f.exists()) { f.mkdirs(); } File crashFile = new File(f, time + FILE_NAME_SUFFIX); PrintWriter pw = new PrintWriter(new BufferedWriter(new FileWriter(crashFile))); pw.println(time); pw.println("Thread: " + thread.getName()); pw.println(getPhoneInfo()); throwable.printStackTrace(pw); // Write crash stack pw.flush(); pw.close(); return crashFile; } private String getPhoneInfo() throws PackageManager.NameNotFoundException { PackageManager pm = context.getPackageManager(); PackageInfo pi = pm.getPackageInfo(context.getPackageName(), PackageManager.GET_ACTIVITIES); StringBuilder sb = new StringBuilder(); //App Version sb. Append ("App Version: "); sb.append(pi.versionName); sb.append("_"); sb.append(pi.versionCode + "\n"); //Android Version number sb. Append ("OS Version: "); sb.append(Build.VERSION.RELEASE); sb.append("_"); sb.append(Build.VERSION.SDK_INT + "\n"); // Phone manufacturer sb. Append ("Vendor: "); sb.append(Build.MANUFACTURER + "\n"); Sb.append ("Model: "); sb.append(Build.MODEL + "\n"); //CPU architecture sb. Append ("CPU: "); if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.LOLLIPOP) { sb.append(Arrays.toString(Build.SUPPORTED_ABIS)); } else { sb.append(Build.CPU_ABI); } return sb.toString(); }}Copy the code

2, the NDK Crash

1) Linux signaling mechanism

Signal mechanism is an important way of Linux interprocess communication. On the one hand, Linux signal is used for normal interprocess communication and synchronization, and on the other hand, it is responsible for monitoring system anomalies and interrupts. When an application runs abnormally, the Linux kernel generates an error signal and notifies the current process. The current process can handle this error signal in three different ways.

  • Ignore the signal;
  • Capture the signal and execute the corresponding signal processing function (signal processing program);
  • Perform the default actions for the signal (such as terminating the process);

When a Linux application goes horribly wrong during execution, it usually crashes. Linux specifically provides a class of crash signals. When the program receives such signals, the default operation is to record the crash information to the core file and then terminate the process. Common crash signs:

  • The SIGSEGV memory reference is invalid.
  • SIGBUS accesses undefined portions of memory objects.
  • SIGFPE arithmetic error, divide by zero.
  • SIGILL illegal instructions, such as executing garbage or privileged instructions
  • SIGSYS bad system call
  • The SIGXCPU exceeded the CPU time limit.
  • SIGXFSZ file size limit.

Generally there is a crash signal, the default default operation of the Android system is to directly exit our program. However, the system allows us to register a signal for a particular signal of a process, that is, to modify the default processing action for that signal. Because of this, NDK Crash monitoring can adopt this signal mechanism to capture Crash signal and execute our own signal processing function to capture NDK Crash.

2) Tombstone files

The Android native application is essentially a Linux application, and when it is executed badly, it will crash, producing a tombstones file that records the scene of the crash.

Common applications do not have permission to read tombstones under path /data/tombstones/. The addr2line tool is used to parse tombstone files and later breakPad.

3), BreakPad

Google BreakPad is a cross-platform collection of crash dump and analysis frameworks and tools. Its open source address is github. Breakpad is implemented in Linux using the Linux signal capture mechanism. Because it is implemented in C++, you must use the NDK tool to use it on Android.

2. Introduction to ANR analysis

Anr full name application not responding, the application does not respond, is detected by AMS, divided into four types

  • KeyDispatchTimeout Indicates the timeout period of 5s
  • BroadcastTimeout indicates the BroadcastTimeout, which is 10s in the foreground and 60s in the background
  • ServiceTimeout Indicates that the service response times out. The value is 20 seconds for the foreground service and 200 seconds for the background service
  • ContentProviderTimeout The content provider has timed out. The trigger has not been processed within 10 seconds

The above four types is a special input event timeout, the other three types in the specified time is not processed if will anr, but there is no processes the input event if the 5 s not necessarily trigger anr, as long as the next input events without waiting to be processed will not trigger anr, the most is that we contact with usually input events overtime;

There are three main reasons for ANR

  • The main thread frequently performs time-consuming IO operations, such as file read and write, database operation, and SP operation
  • The main thread is deadlocked in multiple threads
  • System resources, such as CPU, I/O, and pipes, are used up

Anr problem solving main methods

When ANR occurs in the development process, we can locate the problem by combining logCAT logs and trace files, and then make specific analysis according to the specific situation of business. There are several points to note when analyzing the trace file

  • If the CPU usage is high, the current device is busy and CPU hunger may be causing the ANR
  • If CPU usage is low, the main thread may be blocked
  • If the IOwait ratio is high, it is most likely that I/O time-consuming operations were performed on the main thread

Anr online monitoring method

  1. Through the watchdog
  2. Listen for changes to the data/ ANr folder through FileObserver

After an event occurs, whether it is broadcast, service or input event, a record will be buried in AMS of the system_server process, and then the event will be processed. If the processing is completed within the specified time, the record will be eliminated. If not, the record will be detonated. Trigger anr.

We should keep good habits in the development process, UI thread do not perform time-consuming operations, time-consuming operations to the child thread processing, and then through the handler in the child thread and UI thread communication, the child thread processing through the handler notification to the UI thread can be;