preface

Due to the surge of online business, the operation added several new servers to the cluster. But the new machines always run for a period of time, inexplicably down.

The problem tracking

Viewing the service monitoring, it is found that each downtime is during peak service hours and the memory usage is high. I suspected that there was a memory leak that caused OOM, so I checked the online log, but found no abnormality of OOM. Moreover, the log was suddenly disappeared in the middle of each call, and this situation only appeared on a few newly added machines. After a brief analysis, the OOM Killer mechanism is triggered to kill the running process because the system memory is insufficient during service peak hours.

validation

  • Viewing system logs:dmesg -TOr view the following directory logs/var/log/messages(Simulation scenario)

Consider: If OOM Killer is caused by system memory shortage, then all machines with high load should be down. Why is this only happening on newly added machines?

  • Viewing System Configurationsfree -m, and Java startup command, it is found that the machine memory size is 6G, and the swap area is not configured, and the Java startup command is set-Xms,-XmxAll is 6 g.

So far, the problem has been solved. Careless operations did not adopt a consistent configuration for the new machine, the heap size was set to the maximum memory of the machine, but the swap area was not configured. During off-peak times, the heap is set to 6GB, but Linux doesn’t actually allocate 6GB to it, so there are no exceptions. At peak times, however, as memory usage increases, the JVM will request more memory generations, and Linux will find that it is out of memory, triggering OOM Killer.

extension

Virtual memory allocation versus physical memory

When the process Mmap requests a chunk of memory, Linux simply assigns it a virtual memory space and only allocates physical memory when it actually uses it. So it is possible that all process virtual memory space adds up to > physical memory. Once these processes use more and more physical memory, they run out of physical memory.

At this point we can turn on swap, but it does not solve the problem completely, only alleviates it, because swap partition also has a total capacity.

Therefore, we need to set different virtual memory usage policies according to different scenarios. The specific policies are controlled in /proc/sys/vm-overcommit_memory:

  • A value of 0: the virtual memory control strategy is known as the banker algorithm. The process must apply for less virtual memory than the currently available physical memory in order to apply.
  • If the value is 1, no check is performed and all virtual memory applications pass.
  • The value is 2: the total virtual memory space of all processes cannot exceed the total physical memory space in the physical system (in fact, the virtual memory space should be multiplied by a factor less than 1 to ensure that the kernel state uses memory).

What happens if the kernel runs out of physical memory

With values of 1 and 2, it goes without saying that the total amount of virtual memory must exceed physical memory. If the value is 2, the user mode will not exceed the threshold, but the kernel mode will cause insufficient physical memory. If the physical memory is insufficient, an OOM (out-of-memory) behavior occurs. The kernel uses the Oom-killer module to kill processes, freeing memory and making the current system available. Oom-killer gives a composite score based on how much memory each process uses and how active it is, with the highest score being killed until it runs out of memory. Of course, you may not be able to free enough memory for the current system, and if that happens, the kernel will panic.