Caikelun. IO/POST /2019-0…

Whether it’s Linux’s Coredump or Android’s Tombstone, we’ll eventually need to perform what’s known as a “mortem.” Sometimes we come across the perfect crime scene.

The phenomenon of

The phenomenon of 1

Signal: 7 (SIGBUS), Code: 2 (BUS_ADRERR), fault addr 0xcd596372

r0  0000006e  r1  00000000  r2  00000000  r3  00000000
r4  f35bb740  r5  f35bb748  r6  cd3ff8a8  r7  00000064
r8  f7451e34  r9  f35bb708  r10 cd301000  r11 cd5a3ac5
ip  00000000  sp  cd3ff890  lr  00000000  pc  cd596372

#00 pc 0009a372 /data/data/com.package.name/files/download/libCube.so
#01 pc 000a4f69 /data/data/com.package.name/files/download/libCube.so
#02 pc 000a513f /data/data/com.package.name/files/download/libCube.so
#03 pc 000a7ab9 /data/data/com.package.name/files/download/libCube.so
#04 pc 000a7abf /data/data/com.package.name/files/download/libCube.so
#05 pc 000a7add /data/data/com.package.name/files/download/libCube.so
#06 pc 0004078b /system/lib/libc.so (_ZL15__pthread_startPv+30)
#07 pc 0001a031 /system/lib/libc.so (__start_thread+6)

#00 cd3ff890 cd3ff8dc
     cd3ff894  5d419f43
     cd3ff898  cd3ff8a0
     cd3ff89c  f35bb744  [anon:libc_malloc]
     cd3ff8a0  5d419f43
     cd3ff8a4  000a5952
     cd3ff8a8  5d419f43
     cd3ff8ac  2e62c950  /dev/ashmem/dalvik-main space (deleted)
     cd3ff8b0  00000064
     cd3ff8b4  f35bb700  [anon:libc_malloc]
     cd3ff8b8  f35bb750  [anon:libc_malloc]
     cd3ff8bc  00000064
     cd3ff8c0  cd3ff8dc
     cd3ff8c4  cd5a0f6d  /data/data/com.package.name/files/download/libCube.so
#01 cd3ff8c8 cd5a3ac5 /data/data/com.package.name/files/download/libCube.so
     cd3ff8cc  00000000
     cd3ff8d0  00000001
     cd3ff8d4  262237a1  /dev/ashmem/dalvik-main space (deleted)
     cd3ff8d8 00000000 ...... . memory near pc:cd596350 43591889 60714b11 dd054299 33019b01 .. YC.Kq`.B..... 3cd596360 4b0f9306 607118c9 1c299803 f0251c32 ... K.. q`..) . 2. %.cd596370 286ef8d5 7b23d003 d0f52b00 2601e002 .. n(... # {+... &
    cd596380 e0004276 78232600 d1002b00 1c287323 vB... &#x.+.. #s(.
    cd596390 fd84f024 b0091c30 46c0bdf0 3b9ac9ff  $...0......F...;
    cd5963a0 c4653600 1c0cb510 f8c0f025 d1022800 .6e..... %... (..cd5963b0 20016020 f024e004 6800fcd1 20006020 `. .. $... h `.cd5963c0 b570bd10 26006803 3b0cb09a 1c0c681b .. p.. h.&... ; .h..cd5963d0 d10342b3 600a2202 e0161c18 6800600e  .B...". `... `.h cd5963e0 f0244669 2800fef3 f024d005 6800fcb7 iF$.... (.. $... h cd5963f0 1c306020 9b04e009 021222f0 2380401a `0......". @.#
    cd596400 429a021b 6020d101 b01a2001 b538bd70 ... B.. `... p.8.cd596410 1c0d6800 3b0c1c03 2c00681c 2302d102 .h..... ; .h.,...#
    cd596420 e00d600b feeaf024 d0051e04 602b2300 .`.. $...# + `
    cd596430 fef4f024 e0042001 fc90f024 60286800  $.... ..$....h(`
    cd596440 bd381c20 6803b570 1c0d1c06 681c3b0c .8.p.. h..... ; .h ...... .cd4fc000-cd5e9000 r-x       0  ed000 /data/data/com.package.name/files/download/libCube.so
cd5e9000-cd5ee000 r--   ec000   5000 /data/data/com.package.name/files/download/libCube.so
cd5ee000-cd5ef000 rw-   f1000   1000 /data/data/com.package.name/files/download/libCube.so
......

Copy the code
.text:0009A368      LDR     R0, [SP,#0x38+var_2C]
.text:0009A36A      MOVS    R1, R5
.text:0009A36C      MOVS    R2, R6
.text:0009A36E      BL      j_pthread_cond_timedwait
.text:0009A372      CMP     R0, #0x6E ; SIGBUS occurs at this location
.text:0009A374      BEQ     loc_9A37E
Copy the code

The crash is at libCube at offset 0009a372. According to the arm32 fastCall convention, here, after pthread_cond_timedwait() returns, check whether the return value in r0 is 6E (decimal: Pthread_cond_timedwait () returns 110 (ETIMEDOUT) indicating timeout.

This is a normal call, no problem. So why SIGBUS? Is it an occasional NAND Flash hardware failure? Or a driver bug?

As you can see from the crash information captured by xCrash, the binary instruction code of the current PC location (0009a372) is 286E, which corresponds to CMP R0, #0x6E, where the instruction data is obtained by ptrace(). XCrash gets it but the CPU doesn’t get it when it executes this instruction, right?

The phenomenon of 2

Signal: 11 (SIGSEGV), Code: 1 (SEGV_MAPERR), fault addr 0x11d868

r0  bf206648  r1  00000000  r2  00000000  r3  00000000
r4  bf206640  r5  bf206648  r6  ffffffff  r7  000001b9
r8  d44bf980  r9  bf206608  r10 b1654000  r11 b21bd89d
ip  b240ed4c  sp  b1752890  lr  b21a66b9  pc  0011d868

#00 pc 0011d868 
      
#01 pc 005586b5 /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN3psl5Event4WaitEi+160)
#02 pc 0056fe55 /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN8BaseHcdn6Thread9ThreadRunEv+122)
#03 pc 0057002b /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN8BaseHcdn10CThreadObj14ThreadWorkFuncEv+6)
#04 pc 0056f891 /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN3psl13CThreadObject14ThreadBaseFuncEv+26)
#05 pc 0056f897 /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN3psl13CThreadObject3RunEv+2)
#06 pc 0056f8b5 /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN3psl10ThreadBaseINS_13CThreadObjectEE10ThreadProcEPv+24)
#07 pc 00040aab /system/lib/libc.so (_ZL15__pthread_startPv+30)
#08 pc 0001a011 /system/lib/libc.so (__start_thread+6)

#00 b1752890 b17528dc. .#01 b1752890 b17528dc
     b1752894  5d419f83
     b1752898  b17528a0
     b175289c  bf206644  [anon:libc_malloc]
     b17528a0  5d419f83
     b17528a4  000cf94e
     b17528a8  5d419f84
     b17528ac  115c2ef0
     b17528b0  000001b9
     b17528b4  bf206600  [anon:libc_malloc]
     b17528b8  bf206650  [anon:libc_malloc]
     b17528bc  000001b9
     b17528c0  b17528dc
     b17528c4  b21bde59  /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN8BaseHcdn6Thread9ThreadRunEv+126)
#02 b17528c8 b21bd89d /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN3psl10ThreadBaseINS_13CThreadObjectEE10ThreadProcEPv)b17528cc 00000000 b17528d0 00000001 b17528d4 0730c6a9 b17528d8 00000000 b17528dc bf20663c [anon:libc_malloc] b17528e0 00000000 b17528e4 bf206600 [anon:libc_malloc] b17528e8 b1752970 b17528ec b1752930 ...... . f73d0000-f7444000 r-x        0   74000 /system/lib/libc.so
f7444000-f7448000 r--    73000    4000 /system/lib/libc.so
f7448000-f744b000 rw- 77000 3000 /system/lib/libc.so ...... b1c4e000-b23ef000 r-x 0 7a1000 /data/data/com.package.name/files/download/libHCDNClientNet.so b23ef000-b23f0000 --- 0 1000 b23f0000-b240f000 r-- 7a1000 1f000 /data/data/com.package.name/files/download/libHCDNClientNet.so b240f000-b241a000  rw- 7c0000 b000 /data/data/com.package.name/files/download/libHCDNClientNet.so ......Copy the code

This is a segment error, PC value is 0011d868, this address is obviously abnormal, we call PC “run away”. In the second line of backtrace, offset 005586b5 for libhcdnClientnet. so corresponds to the following instruction:

.text:005586B2 MOVS R0, R5 .text:005586B4 BL j_pthread_mutex_unlock ; Here. Text :005586B8 MOVS R0, R6Copy the code

This is a call to the pthread_mutex_unlock() function in libc.so, and there should be no problem here. Is it a Linker bug? The absolute address of pthread_mutex_unlock() should be written by Linker to libhCDnClientnet. so’s GOT.

There are more of these bizarre crashes, which are not listed here.

What went wrong?

Analysis of the

The problem characteristics

After summary, it is found that the above problems have the following characteristics:

  • The root cause of the problem cannot be seen from Tombstone.
  • The breakdown point itself is beyond the control of the business logic.
  • Most occur at the next instruction that causes a call to a thread block.
  • All take place in a dynamically downloaded so library. (Note that the so library is presentfilesDirectory)
  • Device model and OS version distribution have no obvious characteristics.
  • More than 99% occur on non-root devices. (There are fewer root phone users now)

Crash the capture mechanism

For android, debuggerd, xCrash, or Breakpad, catching native crashes will go through the following stages:

  • The signal handler of the crashed process is aroused to execute.
  • clone()The child process.
  • In the child processptrace()Attach individual threads to the crashed process.
  • Read the register, memory, ELF and other information of each thread in the crashed process.
  • Write information directly to dump file. Or further analyze and extract backtrace information and write it into the dump file.

The crashed process is still executing until the child attaches to all the threads of the crashed process. If you consider the multiple processes of an Android App, other processes of the app may also be executing during the entire crash capture.

The mechanism for crash capture is also implemented by code, which needs to be executed step by step. We can imagine that when a crime (crash) occurs, it takes a certain amount of time for the police (crash capture mechanism) to arrive at the scene of the crime, which is the “blind spot”.

For the crash problem we described earlier, after the process crashed, but before xCrash started executing and recording the crash information, something else must have happened that we didn’t know about. Some details of the crime scene were tampered with!

Suspicious of

Recall the above “phenomenon 1”, why xCrash can get instruction code 286E, but CPU does not get it, so SIGBUS occurs?

Perhaps the reason is simple: when the CPU pipeline loads the instruction from memory, the offset position of the so library file is indeed unreadable, and thus SIGBUS occurs. After that, xCrash began to capture crash information. When xCrash used ptrace() to read data from the same memory location, the offset position in the so library became readable again.

Most likely, the so library was overwritten while linker had already loaded and started executing. Maybe there is a bug in some aspects of dynamic download, decompression and verification of SO library, which leads to the abnormal internal state.

Look for circumstantial evidence

After careful analysis, we did find some circumstantial evidence:

The last modification time of the so file

We have posted a lot of cases where the “last modification time of the crashed so library file” is later than the “crash time”. Something like the following:

2019-08-01 01:54:51.519 +0800 Process Crash time: 2019-08-01 01:54:53.743 +0800 so last modify time: 2019-08-01 01:54:53.938 +0800Copy the code

So file size and MD5 hash

Since the dynamically downloaded SO library can be reliably obtained from the server, we compared the file size and MD5 hash of the crashed SO library recorded by xCrash. It turns out that one part of the case is actually wrong, and that part of the case should be: “Until xCrash started recording crash information, the so library was not completely covered by errors.”

open files

By analyzing the open Files list, we can see that in some cases, the crashed process still has a valid FD that can be used to access the zip package that the dynamically downloaded so library belongs to:

Open files:
......
fd 57: /data/data/com.package.name/files/download/fullso.zip
......
Copy the code

Some FDS also point to CRC files for zip packages:

Open files:
......
fd 58: /data/data/com.package.name/files/download/fullso.zip.crc
......
Copy the code

conclusion

The conclusion is obvious: there are some bugs in the dynamic download, decompression and verification of the so library, causing the so library file to be overwritten once it has been loaded and executed by Linker.

Follow-up: Our so library dynamic delivery process has been improved with a more defensive design. We can see from the statistics of online crash data of APM that the crashes similar to “Phenomenon 1” and “Phenomenon 2” mentioned above almost disappear, and the overall crash rate of APP native is further reduced.