On The Android platform, we probably paid little attention to Native crash. I remember that when I was doing development in Changsha, I would hardly use the SO library written by myself. When integrating third-party functions like maps, I would copy several SO into the directory, and I didn’t even know what SO was at that time. Later, due to the particularity of the project, BUgly and QAPM could not be integrated directly, so I was forced to learn the crash capture of Native layer. Although the implementation is relatively more difficult than the Java layer, it is not very complicated. We can look up some materials or use some third-party open source libraries to sum up the following aspects:

  • Understand the crash handling mechanism of the Native layer
  • Native crash signal is captured
  • Deal with special situations
  • Parse the crash stack for the native layer

1. Understand the crash handling mechanism of native layer

Open source libraries include Coffeecatch, Breakpad and so on. We can integrate Bugly directly in common projects. Since Bugly is not open source, it is not meaningful to use for reference. Breakpad is Google’s source code authority, but the code volume, Coffeecatch implementation is simple but compatibility issues. In fact, whether it is Coffeecatch, Bugly or written by ourselves, the internal implementation principle is definitely the same, as long as we understand the crash handling mechanism of native layer, everything will be solved easily.

On UNIX-like systems, all crashes are related to programming errors or hardware errors, and the system will trigger a crash mechanism to exit the program when it encounters unrecoverable errors, such as division by zero or segment address errors. When an exception occurs, the CPU interrupts the exception processing process. Different processors have different interrupt types and interrupt handling methods. Linux unifies these interrupt processing as semaphores, which can be registered as semaphore vectors for processing. Signal mechanism is a method of transmitting messages between processes. The signal is called soft interrupt signal.

Functions run in user mode, and when system calls, interrupts, or exceptions occur, the program enters kernel mode. The signal involves a transition between these two states.

The task of receiving the signal is assigned to the kernel agent. When the kernel receives the signal, it puts it in the signal queue of the corresponding process and sends an interrupt to the process, causing it to fall into the kernel state. Note that at this point the signal is still in the queue and the process is not aware of its arrival. When a process falls into kernel state, signals are detected in two scenarios:

  • The process checks for signals before returning from the kernel state to the user state
  • A process in the kernel state performs signal detection when it is awakened from sleep

When a new signal is found, it enters signal processing. Signal processing functions run in user mode. Before calling the processing function, the kernel copies the contents of the current kernel stack to the user stack and modifies the instruction register (EIP) to point to the signal processing function. The process then returns to the user state and executes the corresponding signal handler. After the signal handler completes, you need to return to the kernel state to check if there are any other signals unprocessed. If all signals are processed, the kernel stack is restored (copied from the backup of the user stack), the instruction register (EIP) is restored to where it was before the interruption, and the user state is returned to resume the process. At this point, a complete signal processing process will end, if multiple signals arrive at the same time, will continue to detect and process the signal.

2. Native crash signal was captured

To understand the crash handling mechanism of the Native layer, our implementation solution is to register signal handlers, which can be used in the Native layer sigaction() :

#include <signal.h> // signum: indicates the signal encoding, which can be any specific valid signal except SIGKILL and SIGSTOP. If you define your own handler for these two signals, the signal installation will fail. // act: a pointer to an instance of the sigAction structure that specifies the processing of a particular signal. If set to null, the process will perform the default processing. // oldact: similar to act, except that it holds the original processing of the corresponding signal, which can also be set to NULL. // int sigaction(int signum, const struct sigaction *act, struct sigaction *oldact)); Void signal_pass(int code, siginfo_t *si, void *sc) {LOGD(" catch native crash signal."); } bool installHandlersLocked() { if (handlers_installed) return false; // Fail if unable to store all the old handlers. for (int i = 0; i < kNumHandledSignals; ++i) { if (sigaction(kExceptionSignals[i], NULL, &old_handlers[i]) == -1) { return false; } else { handlerMaps->insert( std::pair<int, struct sigaction *>(kExceptionSignals[i], &old_handlers[i])); } } struct sigaction sa; memset(&sa, 0, sizeof(sa)); sigemptyset(&sa.sa_mask); // Mask all exception signals when we're handling one of them. for (int i = 0; i < kNumHandledSignals; ++i) sigaddset(&sa.sa_mask, kExceptionSignals[i]); sa.sa_sigaction = signal_pass; sa.sa_flags = SA_ONSTACK | SA_SIGINFO; for (int i = 0; i < kNumHandledSignals; ++i) { if (sigaction(kExceptionSignals[i], &sa, NULL) == -1) { // At this point it is impractical to back out changes, and so failure to // install a signal is intentionally ignored. } } handlers_installed = true; return true; }Copy the code

3. Dealing with special situations

The complexity of crash capture in Native layer is complicated by the need to deal with various special cases. Although a function can listen to the crash signal callback, it needs to prevent the occurrence of various other abnormal cases. Let’s take a look at each one:

3.1 Setting extra stack space

SIGSEGV is most likely caused by stack overflow, and if run on the default stack it is likely to destroy the program running scene and fail to get the correct context. And when the stack is full (too many recursions, too many objects on the stack), the system will call SIGSEGV’s signal handler on the same full stack, causing the same signal again. We should open up a new space as a stack to run the signal processing functions. You can use sigaltStack to register an optional stack on any thread to reserve space for emergency use. (The system will point the stack pointer to this place in case of danger, so that the signal processing function can be run on a new stack.)

/ * * * create a sigaltstack first, because it is possible that the signal from the stack overflow * / static void installAlternateStackLocked () {if (stack_installed) return; memset(&old_stack, 0, sizeof(old_stack)); memset(&new_stack, 0, sizeof(new_stack)); // SIGSTKSZ may be too small to prevent the signal handlers from overrunning // the alternative stack. Ensure that the size of the alternative stack is // large enough. static const unsigned kSigStackSize = std::max(16384, SIGSTKSZ); // Only set an alternative stack if there isn't already one, or if the current // one is too small. if (sigaltstack(NULL, &old_stack) == -1 || ! old_stack.ss_sp || old_stack.ss_size < kSigStackSize) { new_stack.ss_sp = calloc(1, kSigStackSize); new_stack.ss_size = kSigStackSize; if (sigaltstack(&new_stack, NULL) == -1) { free(new_stack.ss_sp); return; } stack_installed = true; }}Copy the code
3.2 Compatible with other Signal processing

Some signals may have previously been installed with signal handlers, and sigAction can register only one handler per semaphore, which means that our handler overrides other people’s handlers. Save the old handler, after processing our signal handler, re-run the old handler to complete compatibility.

/* Call the old handler. */ void call_old_signal_handler(const int sig, siginfo_t *const info, Void *const sc) {// if (sig -> %d, sig) {// if (sig -> %d, sig); handlerMaps->at(sig)->sa_sigaction(sig, info, sc); }Copy the code
3.3 Preventing deadlocks or deadloops
void signal_pass(int code, siginfo_t *si, void *sc) { /* Ensure we do not deadlock. Default of ALRM is to die. * (signal() and alarm() are signal-safe) */ // Consider using non-signal methods to prevent deadlocks signal(code, SIG_DFL); signal(SIGALRM, SIG_DFL); /* Ensure we do not deadlock. Default of ALRM is to die. * (signal() and alarm() are signal-safe) */ (void) alarm(8); /* Available context ? */ notifyCaughtSignal(); call_old_signal_handler(code, si, sc); LOGD("at the end of signal_pass"); }Copy the code

4. Parse the crash stack of the native layer

As for crash stack parsing of native layer, it is not clear in one or two sentences, so we plan to take a separate lesson to tell you about it.

Video address: pan.baidu.com/s/1FeZjyrnv… Video password: GR11