Preface;

This article has found a category of OOM(OutOfMemoryError) that is characterized by a crash with both Java heap memory and device physical memory sufficient. Explore and explain why this type of OOM is thrown.

The demo address is at the end of the article.

Key words:

OutOfMemoryError, OOM,pthread_create failede,Could not allocate JNI Env

One, the introduction

Memory is a resource that every mobile developer needs to be careful with, and an OOM (OutOfMemoryError) online can drive developers crazy because the intuitive stack information we usually rely on is usually not very helpful in locating this problem. There is a lot of information on the web on how to make use of valuable heap memory (e.g. using small images, bitmap reuse, etc.), but:

1. Is it true that all online OOM’s are due to heap memory shortage?

2. Is it possible for OOM to occur when the App heap is rich in memory and the device is also rich in physical memory?

– OOM crash with plenty of memory?

3. It seems incredible, but recently, the author found that most OOM of a product of the company does have such characteristics through the APM platform researched by the author, that is: When the OOM crashes, the Java heap memory is well below the upper limit set by the Android VIRTUAL machine, and there is sufficient physical memory and SD card space

If there is enough memory, why does OOM crash at this time?

Ii. Problem description

Before describing the problem in detail, let’s get one thing straight:

What led to OOM?

Here are a few apis for Android’s officially declared memory limit thresholds:



It is commonly thought that OOM occurs when the Java heap is out of memory, i.e.



In this OOM, you can easily verify that the heap memory exceeds the maxMemory() threshold by using a new byte[].



As mentioned earlier, the OOM case found in this article has plenty of heap memory (runtime.geTruntime ().maxMemory() has a large portion of heap memory left). Current memory device is also very abundant (ActivityManager. MemoryInfo. AvailMem still has a lot of). There are two types of OOM error messages:

1. This kind of OOM can be found in Android6.0 and Android7.0, which is referred to as OOM 1.



2. OOM 2 for huawei mobile phones running Android7.0 or later (emotionui_5.0 or later), the corresponding error information is as follows:



Problem analysis and solution

3.1 Code Analysis

How is OutOfMemoryError thrown by Android? The following code based on Android6.0 for a simple analysis:

1. The code that eventually throws OutOfMemoryError is located at /art/ Runtime /thread.cc



2. A search of the code reveals several places where the above method was called to throw an OutOfMemoryError

The first place is when the heap is operating



Runtime.getruntime ().maxMemory()

1. The second place is when the thread is created



OOM contrast error messages, we can know encountered collapse is the timing, namely when create a Thread (Thread: : CreateNativeThread).

[XXXClassName] of length XXX would overflow [XXXClassName] of length XXX would overflow [XXXClassName] of length XXX would overflow

So, what we care about is the Thread: : CreateNativeThread thrown when OOM error, create a Thread why can lead to OOM?

3.2 inference

If you throw OOM, there must be some unknown limit triggered during the thread creation process, and since it’s not the Art virtual machine that set the heap limit for us, it may be a lower limit. Android is based on Linux, so Linux restrictions also apply to Android, including:

1./proc/pid/limits describes limits on Linux processes. Here is an example:



Filter the limits in the above example by exclusion:

  • Max stack size Max Processes can be applied to the entire system, not to a particular process.
  • Max locked memory Max locked memory Max locked memory Max locked memory Max locked memory Max locked memory Max locked memory Max locked memory Max locked memory
  • Max pending signals, c layer signal number threshold, irrelevant, excluded;
  • Max MSgQueue size, Android IPC mechanism does not support message queue, excluded.

Max Open Files indicates the maximum number of open files per process. For each open file, a file descriptor fd is generated (recorded under /proc/pid/fd). This limit indicates that the number of FDS cannot exceed the number specified by Max Open Files.

File descriptors will be involved in the thread creation process later.

2. Restrictions described in /proc/sys/kernel

The thread-related limit is /proc/sys/kernel/threads-max, which sets an upper limit on the number of threads that can be created for each process.

3.3 validation

The verification of the above inference is carried out in two steps: local verification and online acceptance.

  • Local validation: In the local validation inference, attempt to reproduce OOM as shown in Figure [2-4]OOM 1 and Figure [2-5]OOM 2
  • Online acceptance: when the plug-in is delivered and the online user OOM is accepted, it is really caused by the reasons inferred above.

Local validation

Experiment 1: Trigger a large number of network connections (each connection in a separate thread) and hold, each socket opened with an additional fd (/proc/pid/fd)

Note: This is not the only way to increase the number of fd’s. You can also use other methods, such as opening a file, creating handlerThread, and so on

  • Experiment expected: when the process number of fd (can use the ls/proc/pid/fd | wc -l) through the/proc/pid/limits specified in the Max open files, generate OOM.
  • Experiment result: When the number of fd’s reaches the Max Open files specified in /proc/pid/limits, continuing the thread does result in an OOM.

The error message and stack are as follows:



It can be seen that the error information in OOM 1 at the time of occurrence is indeed consistent with “Could not allocate JNI Env” found online, so OOM 1 reported online may be caused by the excess of FD numbers, but it is finally determined to be verified online (the next section). ART: ashmem_create_region failed for ‘indirect ref table’: “Too many Open Files” is used to locate and explain problems.

Experiment 2: Create a large number of empty threads (sleep without doing anything)

  • Experiment expectation:
  • An OOM crash occurs when the number of threads (which can be seen in the threads item in /proc/pid/status) exceeds the upper limit specified in /proc/sys/kernel/threads-max.
  • Experimental results:
  • For Android7.0 and above huawei phones (emotionui_5.0 and above) generate OOM. These phones have a very small thread limit (500 threads per process), so it is easy to repeat.

The following error information is displayed during OOM:



Pthread_create (1040KB stack) failed: “Pthread_create failed: Clone failed: Out of memory” is another key Log of the ART VM, which will be used to locate and explain problems later.

1. The upper limit of the number of mobile phone threads of other ROMs is relatively large, so it is not easy to reproduce the above problems. However, if the logical address space of a process is insufficient in a 32-bit system, OOM will also be generated. Each thread usually requires about 1MB of Stack space (the stack size can be set by itself). 32 is the logical address of the system process, and the user space is less than 3GB. If the logical address space is insufficient (you can view VmPeak/VmSize in /proc/pid/status for the used logical address space), the OOM generated by the creation thread has the following information:



On-line acceptance and problem solving

Figure [3-5] in the error message of OOM 1 reproduced locally is consistent with that of OOM 2 online, and Figure [3-6] is consistent with that of OOM 2 online, but the number of FD in OOM 1 online really exceeds the limit. Is OOM 2 really caused by the number of threads in Huawei mobile phone exceeding the limit? The final determination also needs to take the data of online equipment for verification.

Verification method:

Issued by the plugin to online users, when the Thread. UncaughtExceptionHandler captured OutOfMemoryError recorded when/proc/pid directory information as follows:

1. /proc/pid/fd

/proc/pid/status threads (current number of threads)

3. OOM log information (out of the stack information and other warning information

Online OOM one verification

The following information is collected from the online device in OOM 1:

The number of files in /proc/pid/fd is equal to the number of Max Open files in /proc/pid/limits, indicating that the number of fd files is full.

2. The log information during the crash is basically consistent with Figure [3-5].

Thus, it is proved that OOM 1 on the line is indeed OOM caused by too many FD’s, and the verification is successful.

OOM 1 positioning and solution:

The final reason is that the long connection library used in App sometimes has a bug that sends a large number of HTTP requests instantaneously (resulting in a surge in FD numbers), which has been fixed.

The following is an example of the information collected when OOM 2 crashes: devicemodel includes VKY-AL00, TRT-AL00A, BLN-AL20, BLN-AL10, DLI-AL10, and TRT-TL10. The WAS – AL00 etc.) :

1. The number of threads recorded in /proc/pid/status reaches the upper limit: Threads: 500

2. The log information during the crash is basically consistent with Figure [3-6].

Conclusion Verification is successful, that is, clone failed during thread creation due to the limited number of threads, resulting in OOM 2 online.

OOM 2 Positioning and solution:

The problems in the App business code are still being fixed.

3.4 interpretation

The following is a code analysis of how the OOM described in this article happens. First, the flow chart of the simple version created by the thread is as follows:



In the figure above, there are roughly two key steps in thread creation:

  • Create thread-private structure JNIENV in the first column (JNI execution environment for C layer calling Java layer code)
  • The second column calls the POSIX C library function pthread_create for thread creation

The key nodes (marked in the figure) in the flow chart are described below:

1. Nodes in the graph, / art/runtime/thread. The function of cc thread: CreateNativeThread excerpt code is as follows:



Unknown:

  • If JNIENV is not created successfully, error message Could not allocate JNI Env

If the pthread_create fails, the OOM error message is “pthread_create (% S stack) failed: %s”. The detailed error information is given by the return value (error code) of pthread_CREATE. For the mapping between error codes and error descriptions, see bionic/libc/include/sys/_errdefs.h. Pthread_create = “Out of memory”; pthread_create = 12;



2. In the figure, nodes ② and ③ are key nodes in the process of creating JNIENV. /art/runtime/mem_map.cc MemMap:MapAnonymous Indirect_Reference_table JNIENV The global variable applies for memory using the ashmem_create_region function shown on node ③ (creates a block of ashmen anonymous shared memory and returns a file descriptor). The code excerpt of node ② is as follows:



In OOM 1, the error message “ashmem_create_region failed for ‘indirect ref table’: Too many open files” is consistent with this message. Error description of “Too Many Open Files” Error description Here, errno (system global error identifier) is 24(see Figure [3-10] System error definition _errdefs.h). Ashmem_create_region cannot return a new FD because the number of file descriptors is full.

3. Nodes ④ and ⑤ in the figure are the links when C library is called to create a thread. The thread first calls __allocate_thread function to apply for the private stack memory of the thread, and then calls the Clone method to create the thread. The mMAP method is used to apply for stack, and the code excerpt of node ⑤ is as follows:



The printed error information is consistent with the OOM error information in Figure [3-7]. Error message “Try again” in Figure [3-7] indicates that the global error identifier errno is 11(see Figure [3-10] System error definition _errdefs.h). During the pthread_create process, the codes of node 4 are as follows:



Pthread_create failed: clone failed: %s” is consistent with OOM 2 found online. The error description “Out of memory” in Figure [3-6] indicates that the global error identifier ERRno is 12(see Figure [3-10] system error definition _errdefs.h). Clone fails on node 5 due to the number of threads.

Iv. Conclusions and monitoring

4.1 Causes of OOM Occurrence

In summary, OOM can be caused by the following reasons:

1. The number of file descriptors (FD) exceeds the limit. That is, the number of files in proc/pid/fd exceeds the limit in /proc/pid/limits. The possible scenarios are as follows: A large number of socket fd requests surge in a short period of time, and a large number of files are opened repeatedly.

2. The number of threads exceeds the upper limit. That is, the number of threads (threads item) recorded in proc/pid/status exceeds the maximum number specified in /proc/sys/kernel/threads-max. Possible scenarios include: improper use of multiple threads in the app, such as multiple OKHttpClients that do not share thread pools, etc.

3. The traditional Java heap memory exceeds the threshold, that is, the requested heap memory exceeds runtime.geTruntime ().maxMemory().

(Low probability) 32 Indicates that the logical space of the system process is used up, resulting in OOM.

5. Others.

4.2 Monitoring Measures

This can be monitored using Linux’s inotify mechanism:

  • Watch /proc/pid/fd to monitor app opening
  • Watch /proc/pid/task to monitor thread usage.

Five,Demo


Six, the incredible OOM, Android advanced brain map, full set of videos

1. The incredible OOM;




2.Android Advanced Brain Map;



3.Android Advanced Video;




A full set of advanced video is still in order to improve, free to share, welcome to pay attention to thank you