• Solving crashes, like solving crimes, requires experience. The more skilled we are at analyzing problems, the faster and more accurately we can locate them.
  • The crash site is our “first crime scene,” and the operating system is our “best witness” to the crash.

What information should be collected at the crash site

1. Crash information:
  • Process name, thread name:

Whether the crash occurred in the foreground or background process, and whether the crash occurred in the UI thread.

  • Crash stack and type:

Java crash, Native crash, or ANR

2. System information
  • Logcat: Contains run logs of applications and systems
  • Model, system, vendor, CPU, ABI, Linux version, etc
  • Device status: Root or emulator
3. Memory information

OOM, ANR, virtual memory running out, etc. Many crashes are directly related to memory

  • Remaining system memory:

OOM, lots of GCS, and frequent suicidal pull-ups of the system are all very common when the system has very little available memory (less than 10% of MemTotal)

  • Application memory usage:

Including Java memory, Resident Set Size (RSS), Proportional Set Size (PSS), we can figure out the memory usage and distribution of the application itself

  • Virtual memory:

The /proc/self/status file can be used to obtain the distribution of virtual memory. The /proc/self/maps file can be used to obtain the distribution of virtual memory. Sometimes we don’t pay attention to virtual memory, but many problems like OOM and TGkill are caused by virtual memory

4. Resource information

Sometimes we will find that both the application heap and the device memory are very sufficient, and there will still be memory allocation failure, which may be related to resource leakage

  • File handle fd:

The maximum number of file handles that can be opened by a single process is 1024. However, if the number of file handles exceeds 800, it is dangerous. All FDS and corresponding file names need to be output to the log to further check whether there are file or thread leaks

  • The number of threads:

The current thread size can be obtained from the status file above. A single thread can take up 2MB of virtual memory. Too many threads can put pressure on virtual memory and file handles. In my experience, more than 400 threads are dangerous. You need to output all the thread ids and corresponding thread names to the log to further check whether thread-related problems occur.

  • JNI:

When using JNI, it is easy to have some crashes, such as reference failures and reference table bursts, if you are not careful. We can use DumpReferenceTables to count JNI reference tables and further analyze whether there are JNI leaks and other problems.

5. Application information

In addition to the system, our application actually knows more about itself and can leave a lot of relevant information

  • Crash scene

Which Activity or Fragment the crash occurred in, and which business it occurred in

  • Critical Operation Path

Instead of a detailed log of the development process, we can record key user action paths, which will help us to reproduce crashes

  • Other customized information

Different applications may have different priorities. For example, NetEase Cloud music focuses on the music currently playing, while QQ browser focuses on the url or video currently opened. Information such as elapsed time, whether a patch is loaded, and whether the installation or upgrade is new is also important.

6. Other information

In addition to the general information above, specific information such as disk space, power, and network usage may be required for specific crashes. So a good crash capture tool will gather enough information for us based on the scenario to give us more clues to analyze and locate the problem. Of course, data collection needs to pay attention to user privacy, to achieve sufficient strength of encryption and desensitization.

Crash Analysis Trilogy

Step 1: Identify your priorities

Identify and analyze key points. The key is to find important information in logs and make a general judgment on problems. In general, I suggest focusing on the following points in the prioritization step.

1. Identify the severity

To solve crashes also depends on cost performance. We give priority to solve Top crashes or crashes that have a significant impact on the business, such as startup and payment process crashes.

2. Crash basic information

Determine the type of crash and the exception description to get a rough idea of the crash.

  1. The Java crash type is obvious
  2. Native collapse:

You need to look at signal, code, Fault ADDR, etc., and the Java stack at the time of the crash; 3. ANR: First look at the main thread stack, if lock wait is caused. Then take a look at the ANR logs for ioWAIT, CPU, GC, system Server, etc., to further determine whether it is an I/O problem, a CPU race problem, or a large number of GCS causing a deadlock.

3. Logcat

Logcat usually contains some valuable clues, and the log level is Warning or Error. From Logcat, we can see some behaviors of the system and the state of the phone at that time. For example, when ANR occurs, there will be “AM_anr”; When an App is killed, “am_kill” is displayed.

4. Individual resources

Combined with the basic crash information, we then see if it is related to “memory information” and “resource information”. For example, the physical memory is insufficient, the virtual memory is insufficient, or the file handle FD leaks.

Step 2: Look for commonalities

If the above methods are not effective in locating the problem, we can try to find out if there are any commonalities in this type of crash. If the commonality is found, the difference can be further found, and the problem can be further solved by model, system, ROM, manufacturer and ABI. The collected system information can be used as dimension aggregation, and the commonality can be found, which can give you clearer guidance for the recurrence of the problem in the next step.

Step 3: Try to reproduce

“As long as I can reproduce locally, I can solve” is what many developers and testers say. The main reason for this is that in a stable replay path, we can add logs or use Debugger, GDB and other tools to do further analysis.

Problem: Solution to system crash

1. Find the possible causes.

By sorting through the commonalities above, let’s first see if it’s a system version problem or a vendor-specific ROM problem. While the crash log may not have our own code, by manipulating the path and log, we can find some points of suspicion.

2. Try to avoid it.

Look for suspicious code calls, inappropriate APIS, and alternatives to avoid them.

3. Hook it.

There are Java Hook and Native Hook.

Crash attack and defense is a long-term process, and we want to prevent crashes as early as possible and nip them in the bud. This may involve the entire process of our application, including personnel training, compilation check, static scan work, as well as standardized testing, gray scale, release process, etc.

Get logcat and Jave stacks:

Get logcat

The application layer uses the ring buffer to store data in liblog.so –> logd.

1. Run the logcat command to obtain the value
  • Advantages: Very simple, good compatibility.
  • Disadvantages: The whole link is long, poor controllability, high failure rate, especially when the heap is damaged or the heap memory is insufficient, it will basically fail.
2. Hook liblog.so implementation

The __android_log_buf_write method in hook liblog.so redirects the content to its own buffer.

  • Advantages: simple, compatibility is relatively good.
  • Cons: Always open.
3. Customize the acquisition code
  • By porting the underlying implementation of logCAT, directly interact with logD through sockets.
  • Advantages: More flexible, pre-allocated resources, higher success rate.

Cons: Implementation is very complex

Get the Java stack

When Native crashes, you can only get the native stack by unwinding. We want to get the Java stack for each thread at that time

1. Thread. GetAllStackTraces ().
  • Advantages: Simple, good compatibility.
  • Disadvantages: a. The success rate is not high, relying on the system interface will fail in extreme cases. After B.7.0, this interface has no main thread stack. C. Using Java layer interfaces requires thread suspension
2. hook libart.so

The same stack as ANR is obtained by hook ThreadList and Thread functions. For stability, we’ll fork it.

  • Advantages: Very complete information, basically the same as ANR logs, native thread status, lock information and so on.
  • Disadvantages: Black tech compatibility issues, can use Thread.getallStacktraces () for the back end when failed

Fetching the Java stack can also be used in karton, where the main process is not karton at all because of the fork process. We’ll talk about this in more detail later.

Practice after class

A way to “completely resolve” TimeoutException github.com/AndroidAdva…

Those who want to know more are recommended to purchase the original shaven course

I am Jinyang, if you want to advance and learn more dry goods, welcome to pay attention to the public number “jinyang said,” receive my latest article