This article is the original open source laboratory, reproduced in the form of a link to indicate the address: kymjs.com/code/2018/0…

The core of this paper is to explain the steps of implementing an Android Native Crash collection and the solutions to key problems. On the Android platform, Native Crash has always been a troublesome problem, because it is troublesome to capture and obtain incomplete content, incorrect information with complete content, and difficult to deal with correct information. How many times more trouble than Java Crash.

Today, I would like to tell you that I recently lost hundreds of hair to write a Native Crash collection function (hair loss has become more and more serious). The log information of a Native Crash is shown as follows:

I found this picture on the Internet (since I did not write a demo, it was not convenient to directly take out the screenshot of the project, so I stole it). In the figure above, the memory address following the PC in the stack information is the stack address of the current function. We can use the arm-linux-androideabi-addr2line -e command line to determine the number of lines of code that failed. To realize the collection of Native Crash, there are four key points: knowing the occurrence of Crash; Capture the location of Crash Get the function call stack where Crash occurred; Data can be sent back to the server.

Know that Crash happened

Unlike the Java platform, C/C++ does not have a common exception handling interface. In the C layer, the CPU triggers the exception handling process by means of exception interrupts. Different processors have different types of interrupt exceptions and interrupt handling methods. Linux unified these interrupt processing into semaphores. Each exception has a corresponding signal, and you can register the callback function to handle the semaphores that need attention. All semaphores are defined in the

file, and here I’ve highlighted almost all of them and what they mean:

#define SIGHUP 1 // Issued at end of terminal connection (normal or not)
#define SIGINT 2 // Program terminates (e.g. ctrl-c)
#define SIGQUIT 3 // Program exit (ctrl-\)
#define SIGILL 4 // An invalid instruction was executed, or a stack overflow was attempted
#define SIGTRAP 5 // generated when the breakpoint is used by the debugger
#define SIGABRT 6 // Call abort to signal that the program is abnormal
#define SIGIOT 6 // Same as above, complete, IO exception will also be issued
#define SIGBUS 7 // Invalid addresses, including memory address alignment errors, such as accessing a 4-byte integer whose address is not a multiple of 4
#define SIGFPE 8
#define SIGKILL 9 // Force the end of the program, with the highest priority, this signal cannot be blocked, processed, or ignored
#define SIGUSR1 10
#define SIGSEGV 11
#define SIGUSR2 12
#define SIGPIPE 13 // Pipe break, usually caused by interprocess communication
#define SIGALRM 14
#define SIGTERM 15 // Terminating the program, similar to gentle SIGKILL, can be blocked and processed. Usually a program will try SIGKILL if it can't kill
#define SIGSTKFLT 16 // Coprocessor stack error
#define SIGCHLD 17 // The parent process receives this signal when the child process terminates.
#define SIGCONT 18 // Let a stopped process continue
#define SIGSTOP 19 // Stop the process. This signal cannot be blocked, processed, or ignored
#define SIGTSTP 20 // Stop the process, but the signal can be processed and ignored
#define SIGTTIN 21 // When a background job reads data from the user terminal, all processes in the job receive SIGTTIN signals
#define SIGTTOU 22 // Similar to SIGTTIN, but received when writing terminal
#define SIGURG 23 // Generated when urgent data or out-of-band data reaches the socket
#define SIGXCPU 24 // issued when CPU time resource limit is exceeded
#define SIGXFSZ 25 // When the process attempts to expand the file to exceed the file size resource limit
#define SIGVTALRM 26 // Virtual clock signal. Similar to SIGALRM, but counting the CPU time consumed by the process.
#define SIGPROF 27 // Similar to SIGALRM/SIGVTALRM, but including CPU time used by the process and system call time
#define SIGWINCH 28 // issue when window size changes
#define SIGIO 29 // The file descriptor is ready to start input/output operations
#define SIGPOLL SIGIO
#define SIGPWR 30 // Power supply is abnormal
#define SIGSYS 31 // Invalid system call
Copy the code

Usually, when we do crash collection, we mainly focus on the following semaphores:

const int signal_array[] = {SIGILL, SIGABRT, SIGBUS, SIGFPE, SIGSEGV, SIGSTKFLT, SIGSYS};
Copy the code

The corresponding meanings can be referred to above,

extern int sigaction(int, const struct sigaction*, struct sigaction*);
Copy the code

The first argument, of type int, represents the semaphore to be concerned about. The second argument, sigAction, specifies what to do about a particular signal when it occurs. The third parameter is also the sigAction structure pointer, which represents the default processing mode. When we customize the semaphore processing, we use the default processing mode before it is stored.

This is also the difference between Pointers and references. Pointer operations operate on variables themselves, so after assigning a value to a new pointer, another pointer is needed to record the location in memory of the variable that encapsulates the default processing.

So, the simplest way to subscribe to signals for which exceptions occur is to simply loop through all the signals to subscribe, calling sigAction () on each signal

void init() {
    struct sigaction handler;
    struct sigaction old_signal_handlers[SIGNALS_LEN];
    for(int i = 0; i < SIGNALS_LEN; ++i) { sigaction(signal_array[i], &handler, & old_signal_handlers[i]); }}Copy the code

Capture the location of Crash

The sigaction structure has a sa_sigaction variable, which is a function pointer with the prototype void (*)(int siginfo_t *, void *). Therefore, we can declare a function that assigns the address of the function directly to sa_sigaction

void signal_handle(int code, siginfo_t *si, void *context) {
}

void init() {
	struct sigaction old_signal_handlers[SIGNALS_LEN];
	
	struct sigaction handler;
	handler.sa_sigaction = signal_handle;
	handler.sa_flags = SA_SIGINFO;
	
	for(int i = 0; i < SIGNALS_LEN; ++i) { sigaction(signal_array[i], &handler, & old_signal_handlers[i]); }}Copy the code

This will call signal_handle() when a Crash occurs. In signal_handle(), we have to find a way to get information about the currently executing code.

Set the emergency stack space

If the current function has an infinite recursion that causes a stack overflow, this situation needs to be taken into account in the statistics. Otherwise, the stack that is already full will process the overflow signal on the current stack, and the processing will definitely fail. So we need to set up a new stack for emergency processing. We can use sigaltStack () to register an optional stack on any thread to reserve space for emergency processing. (The system will point the stack pointer to this place in case of danger, so that the signal processing function can be run on a new stack.)

void signal_handle(int sig) {
    write(2, "stack overflow\n", 15);
    _exit(1);
}
unsigned infinite_recursion(unsigned x) {
    return infinite_recursion(x)+1;
}
int main() {
    static char stack[SIGSTKSZ];
    stack_t ss = {
        .ss_size = SIGSTKSZ,
        .ss_sp = stack,
    };
    struct sigaction sa = {
        .sa_handler = signal_handle,
        .sa_flags = SA_ONSTACK
    };
    sigaltstack(&ss, 0);
    sigfillset(&sa.sa_mask);
    sigaction(SIGSEGV, &sa, 0);
    infinite_recursion(0);
}
Copy the code

Catch the offending code

The third parameter in signal_handle() is the struct pointer to uc_mcontext, which encapsulates cpu-related context, including the register information of the current thread and the value of the PC at the time of crash. If the PC at the time of crash can be known, it can know which instruction was executed at the time of crash. Similarly, the register snapshot in the figure at the top of this article can be obtained with the following code.

char *head_cpu = nullptr;
asprintf(&head_cpu, "r0 %08lx r1 %08lx r2 %08lx r3 %08lx\n"
                 "r4 %08lx r5 %08lx r6 %08lx r7 %08lx\n"
                 "r8 %08lx r9 %08lx sl %08lx fp %08lx\n"
                 "ip %08lx sp %08lx lr %08lx pc %08lx cpsr %08lx\n",
         t->uc_mcontext.arm_r0, t->uc_mcontext.arm_r1, t->uc_mcontext.arm_r2,
         t->uc_mcontext.arm_r3, t->uc_mcontext.arm_r4, t->uc_mcontext.arm_r5,
         t->uc_mcontext.arm_r6, t->uc_mcontext.arm_r7, t->uc_mcontext.arm_r8,
         t->uc_mcontext.arm_r9, t->uc_mcontext.arm_r10, t->uc_mcontext.arm_fp,
         t->uc_mcontext.arm_ip, t->uc_mcontext.arm_sp, t->uc_mcontext.arm_lr,
         t->uc_mcontext.arm_pc, t->uc_mcontext.arm_cpsr);
Copy the code

However, uc_mcontext is platform-dependent. For example, we are familiar with ARM and x86, which are not the same structure definition. The above code only lists the register information of arm architecture.

uintptr_t pc_from_ucontext(const ucontext_t *uc) {
#if (defined(__arm__))
    return uc->uc_mcontext.arm_pc;
#elif defined(__aarch64__)
    return uc->uc_mcontext.pc;
#elif (defined(__x86_64__))
    return uc->uc_mcontext.gregs[REG_RIP];
#elif (defined(__i386))
  return uc->uc_mcontext.gregs[REG_EIP];
#elif (defined (__ppc__)) || (defined (__powerpc__))
  return uc->uc_mcontext.regs->nip;
#elif (defined(__hppa__))
  return uc->uc_mcontext.sc_iaoq[0] & ~0x3UL;
#elif (defined(__sparc__) && defined (__arch64__))
  return uc->uc_mcontext.mc_gregs[MC_PC];
#elif (defined(__sparc__) && ! defined (__arch64__))
  return uc->uc_mcontext.gregs[REG_PC];
#else
#error "Architecture is unknown, please report me!"
#endif
}
Copy the code

PC value to memory address

The PC value is the absolute address that the program loads into memory. The absolute address cannot be used directly because the memory created each time the program runs is definitely not in a fixed region of memory, so the absolute address must be inconsistent each time it runs. We need to get the offset address of the crashed code relative to the current library so that we can use addr2line to figure out which line it is. Dladdr () is used to get the starting address from which the shared library was loaded into memory, subtracting the PC value to get the relative offset address, and the shared library name.

Dl_info info;  
if (dladdr(addr, &info) && info.dli_fname) {  
  void * const nearest = info.dli_saddr;  
  uintptr_t addr_relative = addr - info.dli_fbase;  
}
Copy the code

Gets the function call stack when Crash occurs

Getting the function call stack is the most troublesome, and none of them work so far, all requiring major changes. There are four common practices:

  • The first: direct use of the system<unwind.h>Library, you can get the error file and function name. You just need to parse the function symbols yourself, and often catch system errors that need to be manually filtered.
  • Second: in4.1.1The above,5.0In the following, use the built-in systemlibcorkscrew.soAs of 5.0, the system is gonelibcorkscrew.so, you can compile the system source codelibunwind.libunwindIs an open source library, and in fact uses its optimized replacement in higher versions of Android source codelibcorkscrew.
  • Third: Use open source librariescoffeecatch, but this solution is not 100 percent compatible with all models.
  • Fourth: those using GooglebreakpadThis is the definitive solution for all C/C++ stack fetching, and the industry is basically based on this library. However, this library is full platform Android, iOS, Windows, Linux, MacOS all have, so it is very large, when using irrelevant platform to remove the volume.

The core method uses a method provided by the < unwinding. H > library, _Unwind_Backtrace(). This function can pass as a callback a pointer to a function that has an important argument to a structure pointer of type _Unwind_Context. You can use the _Unwind_GetIP() function to write the absolute memory address (PC) of each function in the current call stack to the _Unwind_Context structure. _Unwind_Word is actually an unsigned int. Capture_backtrace () returns the current count of the contents of the call stack.


/**
 * callback used when using <unwind.h> to get the trace for the current context
 */
_Unwind_Reason_Code unwind_callback(struct _Unwind_Context *context, void *arg) {
    backtrace_state_t *state = (backtrace_state_t *) arg;
    _Unwind_Word pc = _Unwind_GetIP(context);
    if (pc) {
        if (state->current == state->end) {
            return _URC_END_OF_STACK;
        } else{ *state->current++ = (void *) pc; }}return _URC_NO_REASON;
}

/**
 * uses built in <unwind.h> to get the trace for the current context
 */
size_t capture_backtrace(void **buffer, size_t max) {
    backtrace_state_t state = {buffer, buffer + max};
    _Unwind_Backtrace(unwind_callback, &state);
    return state.current - buffer;
}
Copy the code

When the absolute memory address (PC value) of all functions is obtained, we can use the method described above to convert the PC value to the relative offset, and obtain the real function information and relative memory address.

void *buffer[max_line];
int frames_size = capture_backtrace(buffer, max_line);
for (int i = 0; i < frames_size; i++) {
	Dl_info info;  
	const void *addr = buffer[i];
	if (dladdr(addr, &info) && info.dli_fname) {  
	  void * const nearest = info.dli_saddr;  
	  uintptr_t addr_relative = addr - info.dli_fbase;  
}

Copy the code

Dl_info is a structure that encapsulates information such as the file in which the function resides, the function name, and the base address of the current library

typedef struct {
    const char *dli_fname;  /* Pathname of shared object that
                               contains address */
    void       *dli_fbase;  /* Address at which shared object
                               is loaded */
    const char *dli_sname;  /* Name of nearest symbol with address
                               lower than addr */
    void       *dli_saddr;  /* Exact address of symbol named
                               in dli_sname */
} Dl_info;
Copy the code

With this object, we can get all the information we want. The trouble with < unwinding. H > is that you get all the information you want, so you have to manually filter out all the system errors, and then you can report the data to your own server.

Data is sent back to the server

Data can be sent back in two ways. One is to write the information to a file and report the data to Java directly upon the next startup. The other is to call back Java code and let Java handle it. The advantage of using Java is that the Java layer can continue to add the various state information of the Java layer to the current context and write it into the same file, making it easier for developers to solve bugs. I’m just going to write the data to the file.

void save(const char *name, char *content) {
    FILE *file = fopen(name, "w+"); fputs(content, file); fflush(file); fclose(file); // The Java layer can be notified after the file is written. It is easier to pass the file name directly to the Java layer. report(); }Copy the code

If you follow the instructions in this article, you should be able to create a working Native Crash collection library. However, there are still many details, such as data loss. Using W + when writing files may cause the loss of the files stored last time. If the current function has an infinite recursion resulting in a stack overflow, this situation needs to be taken into account in the statistics. Otherwise, the stack that is already full will process the overflow signal in the current stack, and the processing will definitely fail. For example, the various problems of multi-process and multi-thread in C are really very complicated.