Abstract: This article finds a class of OOM (OutOfMemoryError) that is characterized by a crash in which both Java heap memory and device physical memory are sufficient, explores and explains why this class of OOM is thrown.

Keywords: OutOfMemoryError, OOM, pthread_create failed, Could not allocate JNI Env

A tease.

Memory is a resource that every mobile developer needs to be careful with, and an OOM (OutOfMemoryError) online can drive developers crazy because the intuitive stack information we usually rely on is usually not very helpful in locating this problem. There is a lot of information on the web on how to make use of valuable heap memory (e.g. using small images, bitmap reuse, etc.), but:

  • Is it true that all online OOM’s are due to tight heap memory?
  • Is it possible for OOM to occur when App heap memory is abundant and device physical memory is abundant?

– OOM crash with plenty of memory? It seems incredible. However, when I was investigating a question recently, I found that most OOM of one of the company’s products did have such features through the APM platform developed by myself, that is:

  • When the OOM crashes, the Java heap memory is well below the upper limit set by the Android VIRTUAL machine, and there is sufficient physical memory and SD card space

If there is enough memory, why does OOM crash at this time?

2. Problem description

Before describing the problem in detail, let’s get one thing straight:

What led to OOM?

Here are a few apis for Android’s officially declared memory limit thresholds:

ActivityManager. GetMemoryClass () : the Java virtual machine heap size limit, allocating objects breakthrough this size will OOM ActivityManager. GetLargeMemoryClass () : In the manifest, largeheap=true specifies the maximum number of Java heaps for a VM runtime.getruntime ().maxMemory() : The maximum memory usage of the current VM instance can be either runtime.geTruntime ().totalMemory() : Currently subscribed memory, both used and unused Runtime.geTruntime ().freememory () : the portion of memory subscribed but unused in the previous entry. So have to apply for and are using the part 2 = totalMemory () - freeMemory () ActivityManager. MemoryInfo. TotalMem: Equipment is the total memory ActivityManager. MemoryInfo. AvailMem: equipment currently available memory/proc/meminfo memory information recording equipmentCopy the code

Figure 2-1 Android memory specifications

It is commonly believed that OOM occurs when the Java heap is out of memory, i.e

Runtime.getruntime ().maxMemory() does not meet the required heap memory sizeCopy the code

Figure 2-2 Causes of the Java heap OOM This OOM can be easily verified (for example, try to apply for the heap memory exceeding the maxMemory() threshold by using new byte[]).

java.lang.OutOfMemoryError: Failed to allocate a XXX byte allocation with XXX free bytes and XXXKB until OOM
Copy the code

In the OOM case, the heap memory is sufficient (runtime.getruntime ().maxMemory() has a large portion of the heap memory remaining). Current memory device is also very abundant (ActivityManager. MemoryInfo. AvailMem still has a lot of). There are two types of OOM error messages:

  1. The error message is as follows:
java.lang.OutOfMemoryError: Could not allocate JNI Env
Copy the code

Figure 2-4 Error information in OOM 1

  1. Note OOM 2 for huawei mobile phones running Android7.0 or later (emotionui_5.0 or later), the corresponding error information is as follows:
java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Out of memory
Copy the code

Figure 2-5 Error information in OOM 2

Problem analysis and solution

3.1 Code Analysis

How is OutOfMemoryError thrown by Android? The following code based on Android6.0 for a simple analysis:

  1. The code where the Android VIRTUAL machine finally throws OutOfMemoryError is located at /art/ Runtime /thread.cc
Void Thread: : ThrowOutOfMemoryError (const char * MSG) parameter MSG carried OOM error informationCopy the code

Figure 3-1 Positions thrown by the ART Runtime

  1. Searching the code reveals several places where the above method was called to throw an OutOfMemoryError
  • The first place is when the heap operates
/art/runtime/gc/heap.cc Void Heap: : ThrowOutOfMemoryError (Thread * self, size_t byte_count, AllocatorType allocator_type) throws the error message:  oss << "Failed to allocate a " << byte_count << " byte allocation with " << total_bytes_free << " free bytes and " << PrettySize(GetFreeMemoryUntilOOME()) << " until OOM";Copy the code

GetRuntime ().maxMemory(); getRuntime();

  • The second place is when the thread is created
/art/runtime/thread.cc Void Thread: : CreateNativeThread (env JNIEnv *, jobject java_peer, size_t stack_size, bool is_daemon) throws the error message: "Could not allocate JNI Env" or StringPrintf("pthread_create (%s stack) failed: %s", PrettySize(stack_size).c_str(), strerror(pthread_create_result)));Copy the code

Figure 3-3 Thread creation OOM contrast error messages, when we meet can know the OOM collapse is the timing, namely when create a Thread (Thread: : CreateNativeThread).

  • Other error messages such as “[XXXClassName] of length XXX would overflow” limit String/Array length and are not discussed in this article.

So, what we care about is the Thread: : CreateNativeThread thrown when OOM error, create a Thread why can lead to OOM?

3.2 inference

If you throw OOM, there must be some unknown limit triggered during the thread creation process, and since it’s not the Art virtual machine that set the heap limit for us, it may be a lower limit. Android is based on Linux, so Linux restrictions also apply to Android, including:

  1. /proc/pid/limits describes the limits on Linux processes. Here is an example:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             13419                13419                processes 
Max open files            1024                 4096                 files     
Max locked memory         67108864             67108864             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       13419                13419                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         40                   40                   
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us 
Copy the code

Figure 3-4 Example of Linux process limits Filter limits in the above example by exclusion:

  • Max stack size Max Processes are system-wide, not process-specific
  • Max locked memory Max locked memory Max locked memory Max locked memory Max locked memory Max locked memory Max locked memory Max locked memory Max locked memory
  • Max pending signals, c layer signal number threshold, irrelevant, excluded
  • Max MSgQueue size, Android IPC mechanism does not support message queue, excluded

Max Open Files indicates the maximum number of open files per process. For each open file, a file descriptor fd is generated (recorded under /proc/pid/fd). This limit indicates that the number of FDS cannot exceed the number specified by Max Open Files. File descriptors will be involved in the thread creation process later.

  1. The restrictions described in /proc/sys/kernel

The thread-related limit is /proc/sys/kernel/threads-max, which sets an upper limit on the number of threads that can be created for each process.

3.3 validation

The verification of the above inference is carried out in two steps: local verification and online acceptance.

  • Local validation: In the local validation inference, attempt to reproduce OOM as shown in Figure [2-4]OOM 1 and Figure [2-5]OOM 2
  • Online acceptance: when the plug-in is delivered and the online user OOM is accepted, it is really caused by the reasons inferred above.

Local validation

Experiment 1: Trigger a large number of network connections (each connection in a separate thread) and hold, each socket opened with an additional fd (/proc/pid/fd) note: This is not the only way to increase the number of fd’s. You can also use other methods, such as opening a file, creating handlerThread, and so on

  • Experiment expected: when the process number of fd (can use the ls/proc/pid/fd | wc -l) through the/proc/pid/limits specified in the Max open files, generate OOM
  • Experiment result: When the number of fd’s reaches the Max Open files specified in /proc/pid/limits, continuing the thread does result in an OOM. The error message and stack are as follows:
E/art: ashmem_create_region failed for 'indirect ref table': Too many open files E/AndroidRuntime: FATAL EXCEPTION: main Process: com.netease.demo.oom, PID: 2435 java.lang.OutOfMemoryError: Could not allocate JNI Env at java.lang.Thread.nativeCreate(Native Method) at java.lang.Thread.start(Thread.java:730) .Copy the code

Figure 3-5 Details of OOM caused by FD overload It can be seen that the error information in OOM 1 is indeed consistent with the “Could not allocate JNI Env” found online, so OOM 1 reported online may be caused by FD overload. In addition, from the Log of the ART virtual machine, there is another key information: “ART: ashmem_create_region failed for ‘indirect ref table’: “Too many Open Files” is used to locate and explain problems.

Experiment 2: Create a large number of empty threads (sleep without doing anything)

  • An OOM crash occurs when the number of threads (which can be seen in the threads item in /proc/pid/status) exceeds the limit specified in /proc/sys/kernel/threads-max

  • Experimental results:

  1. For Android7.0 and above huawei phones (emotionui_5.0 and above) generate OOM. These phones have a very small thread limit (500 threads per process), so it is easy to repeat. The following error information is displayed during OOM:
W libc    : pthread_create failed: clone failed: Out of memory
W art     : Throwing OutOfMemoryError "pthread_create (1040KB stack) failed: Out of memory"
E AndroidRuntime: FATAL EXCEPTION: main
                  Process: com.netease.demo.oom, PID: 4973
                  java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Out of memory
                      at java.lang.Thread.nativeCreate(Native Method)
                      at java.lang.Thread.start(Thread.java:745)
                      ......
Copy the code

Pthread_create (1040KB stack) failed: “Pthread_create failed: Clone failed: Out of memory” is another key Log of the ART VM, which will be used to locate and explain problems later.

  1. Other Rom mobile phone threads are relatively large upper limit, it is not easy to reproduce the above problems. However, if the logical address space of a process is insufficient in a 32-bit system, OOM will also be generated. Each thread usually requires about 1MB of Stack space (the stack size can be set by itself). 32 is the logical address of the system process, and the user space is less than 3GB. If the logical address space is insufficient (you can view VmPeak/VmSize in /proc/pid/status for the used logical address space), the OOM generated by the creation thread has the following information:
W/libc: pthread_create failed: couldn't allocate 1069056-bytes mapped space: Out of memory
W/art: Throwing OutOfMemoryError "pthread_create (1040KB stack) failed: Try again"
E/AndroidRuntime: FATAL EXCEPTION: main
                  Process: com.netease.demo.oom, PID: 8638
                  java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Try again
                       at java.lang.Thread.nativeCreate(Native Method)
                       at java.lang.Thread.start(Thread.java:1063)
                       ......
Copy the code

Figure 3-7 OOM when the logical address space is used up

On-line acceptance and problem solving

Figure [3-5] in the error message of OOM 1 reproduced locally is consistent with that of OOM 2 online, and Figure [3-6] is consistent with that of OOM 2 online, but the number of FD in OOM 1 online really exceeds the limit. Is OOM 2 really caused by the number of threads in Huawei mobile phone exceeding the limit? The final determination also needs to take the data of online equipment for verification.

Authentication methods: issued by the plugin to online users, when Thread. UncaughtExceptionHandler capture the OutOfMemoryError when record/proc/pid directory information as follows:

  1. /proc/pid/fd Number of files (number of fd)
  2. Threads in /proc/pid/status
  3. The OOM log information (out of the stack information and other warning information

Verify the information collected from the online device in OOM 1:

  1. The number of files in /proc/pid/fd is equal to the number of Max Open files in /proc/pid/limits, indicating that the number of fd files is full
  2. The log information during the crash is basically consistent with figure [3-5]

Thus, it is proved that OOM 1 on the line is indeed OOM caused by too many FD’s, and the verification is successful.

Location and solution of OOM 1: The final reason is that the long connection library used in App sometimes has the bug of sending a large number of INSTANT HTTP requests (resulting in a surge in FD numbers), which has been fixed

The following is an example of the information collected when OOM 2 crashes: devicemodel includes VKY-AL00, TRT-AL00A, BLN-AL20, BLN-AL10, DLI-AL10, and TRT-TL10. The WAS – AL00 etc.) :

  1. The number of threads recorded in /proc/pid/status reached the upper limit: Threads: 500
  2. The log information during the crash is basically consistent with figure [3-6]

Conclusion Verification is successful, that is, clone failed during thread creation due to the limited number of threads, resulting in OOM 2 online.

Location and solution of OOM 2: Problems in App business code are still being located and repaired

3.4 interpretation

The following is a code analysis of how the OOM described in this article happens. First, the flow chart of the simple version created by the thread is as follows:


Figure 3-8 Process for creating a thread

In the figure above, there are roughly two key steps in thread creation:

  • Create thread-private structure JNIENV in the first column (JNI execution environment for C layer calling Java layer code)
  • The second column calls the POSIX C library function pthread_create for thread creation

The key nodes (marked in the figure) in the flow chart are described below:

  1. Node in the graph, / art/runtime/thread. The function of cc thread: CreateNativeThread excerpt code is as follows:
    std::string msg(child_jni_env_ext.get() == nullptr ?
        "Could not allocate JNI Env" :
        StringPrintf("pthread_create (%s stack) failed: %s", PrettySize(stack_size).c_str(), strerror(pthread_create_result)));
    ScopedObjectAccess soa(env);
    soa.Self()->ThrowOutOfMemoryError(msg.c_str());
Copy the code

Figure 3-9 Thread: CreateNativeThread excerpt shows:

  • If JNIENV is not created successfully, error message Could not allocate JNI Env
  • If the pthread_create fails, the OOM error message is “pthread_create (% S stack) failed: %s”. The detailed error information is given by the return value (error code) of pthread_CREATE. For the mapping between error codes and error descriptions, see bionic/libc/include/sys/_errdefs.h. Pthread_create = “Out of memory”; pthread_create = 12;
. __BIONIC_ERRDEF( EAGAIN , 11, "Try again" ) __BIONIC_ERRDEF( ENOMEM , 12, "Out of memory" ) ... __BIONIC_ERRDEF( EMFILE , 24, "Too many open files" ) ...Copy the code

Figure 3-10 System error definition _errdefs.h

  1. Nodes ② and ③ in the figure are key nodes in the process of creating JNIENV. MemMap:MapAnonymous: MemMap:MapAnonymous: MemMap:MapAnonymous: MemMap:MapAnonymous: MemMap:MapAnonymous The method of allocating memory is the ashmem_create_region function shown on node 3 (creates a block of ashmen anonymous shared memory and returns a file descriptor). The code excerpt of node ② is as follows:
  if (fd.get() == -1) {
      *error_msg = StringPrintf("ashmem_create_region failed for '%s': %s", name, strerror(errno));
      return nullptr;
  }
Copy the code

Figure 3-11 memmap :MapAnonymous Error message “ashmem_create_region failed for ‘Indirect ref table’: Too many open files “, which matches the information printed here. Error description of “Too Many Open Files” Error description here the errno (system global error identifier) is 24(see Figure [3-10] System error definition _errdefs.h). Ashmem_create_region cannot return a new FD because the number of file descriptors is full.

  1. In the figure, nodes ④ and ⑤ are the links when C library is called to create a thread. The thread first calls __allocate_thread function to apply for the private stack memory of the thread, and then calls the Clone method to create the thread. The mMAP method is used to apply for stack, and the code excerpt of node ⑤ is as follows:
  if (space == MAP_FAILED) {
    __libc_format_log(ANDROID_LOG_WARN,
                      "libc",
                      "pthread_create failed: couldn't allocate %zu-bytes mapped space: %s",
                      mmap_size, strerror(errno));
    return NULL;
  }
Copy the code

Figure 3-12 __create_thread_mapped_space The printed error message is consistent with the OOM error message in Figure [3-7]. Error message “Try again” in Figure [3-7] indicates that the global error identifier errno is 11(see Figure [3-10] System error definition _errdefs.h). During the pthread_create process, the codes of node 4 are as follows:

int rc = clone(__pthread_start, child_stack, flags, thread, &(thread->tid), tls, &(thread->tid)); if (rc == -1) { int clone_errno = errno; // We don't have to unlock the mutex at all because clone(2) failed so there's no child waiting to // be unblocked, but we're about to unmap the memory the mutex is stored in, so this serves as a // reminder that you can't rewrite this function to use a ScopedPthreadMutexLocker. pthread_mutex_unlock(&thread->startup_handshake_mutex); if (thread->mmap_size ! = 0) { munmap(thread->attr.stack_base, thread->mmap_size); } __libc_format_log(ANDROID_LOG_WARN, "libc", "pthread_create failed: clone failed: %s", strerror(errno)); return clone_errno; }Copy the code

Figure 3-13 Pthread_create Error log “pthread_create failed: clone failed: The %s” matches the OOM 2 we found online, Error description “Out of Memory” in Figure [3-6] indicates that the global error identifier errno is 12(see Figure [3-10] System Error definition _errdefs.h). Therefore, OOM 2 on the line is OOM due to the failure of clone on node 5 due to the number of threads.

Iv. Conclusions and monitoring

4.1 Causes of OOM Occurrence

In summary, OOM can be caused by the following reasons:

  1. The number of file descriptors (FD) exceeds the limit, that is, the number of files in proc/pid/fd exceeds the limit set in /proc/pid/limits. The possible scenarios are as follows: The number of SOCKET FDS increases rapidly due to a large number of requests in a short period of time, and a large number of files are opened repeatedly
  2. The number of threads exceeds the limit, that is, the number of threads (threads item) recorded in proc/pid/status exceeds the maximum number of threads specified in /proc/sys/kernel/threads-max. Possible scenarios include improper use of multiple threads within an app, such as multiple OKHttpClients that do not share thread pools, and so on
  3. Conventional Java heap memory overload, that is, the requested heap size exceeds runtime.geTruntime ().maxMemory().
  4. (Low probability) 32 Indicates that the logical space of the system process is used up.
  5. other

4.2 Monitoring Measures

This can be monitored using Linux’s inotify mechanism:

  • Watch /proc/pid/fd to monitor app opening
  • Watch /proc/pid/task to monitor thread usage.

5. The Demo

POC(Proof of Concept) code see also: github.com/piece-the-w…