Analysis and Solution of 9 Common CMS GC Problems in Java (Part 2)

1. Write at the front

| this article is aimed at the Hotspot VM “CMS + ParNew” combination of summarize some usage scenarios. Focus on the root cause analysis through part of the source code and the investigation method is summarized, the investigation process will be omitted more, in addition, this article more professional terms, there is a certain reading threshold, if not introduced clearly, please refer to the relevant materials.

| will probably around 20000 (does not contain code snippet), the overall reading time about 30 min, a longer article, can choose to study scenarios of interest to you.

Analysis and Solution of nine Common CMS GC Problems in Java (Part 1)

4.7 Scenario 7: Memory fragmentation & Collector Degradation

4.7.1 phenomenon

The concurrent CMS GC algorithm degenerates into the Foreground single-threaded serial GC mode, and the STW time is extremely long, sometimes up to more than ten seconds. There are two single-thread serial GC algorithms after the DEGRADATION of the CMS collector:

The algorithm with compression action, called MSC, which we described above, uses a mark-clean-compression, single-thread, all-pause approach to garbage collection of the entire heap, which is actually Full GC, with pause times longer than normal CMS.
The algorithm without compression action, collecting Old area, is similar to the common CMS algorithm, and the pause time is shorter than MSC algorithm.

4.7.2 reason

Collector degradation of CMS mainly occurs in the following situations:

Promotion Failed

As the name implies, promotion failure is when a Young GC does not place Survivor and only Old can be placed, but Old cannot be placed either. Intuitively, at first glance that may happen often, but because of the existence of concurrentMarkSweepThread and guarantee mechanism, the conditions are very harsh, unless it is a space left in a short period of time will be Old quickly fill, For example, dynamic age judgment leads to premature promotion as described above (see incremental collection guarantee failure below). In addition, there is Promotion Failed caused by memory fragmentation. Young GC thinks that Old has enough space, but when allocating, promoted large objects cannot find continuous space to store.

When CMS is used as GC collector, the Old area that has been running for a period of time is shown in the figure below. The clearing algorithm leads to multi-segment discontinuity of memory and a large number of memory fragments.

Fragmentation poses two problems:

Low efficiency in space allocation: As mentioned above, IN the case of continuous space, JVM can be allocated by using Pointer bumping; however, for such idle linked list with a large number of fragments, it is necessary to visit the items in freelist one by one to search for the address that can store newly created objects.
Space utilization efficiency becomes low: The size of promoted objects in Young area is larger than the size of continuous space, then Promotion Failed will be triggered. Even if the capacity of the whole Old area is sufficient, new objects cannot be stored due to its discontinuity, which is the problem mentioned in this paper.

Incremental collection guarantee failed

After memory allocation fails, it determines whether the average size of the Young GC to be promoted to Old and whether the size currently used by the Young region, that is, the maximum object size that can be promoted, is greater than the remaining space in the Old region. As long as the CMS has more free space than either of the preceding two, the CMS considers the promotion to be safe, otherwise, it is unsafe, and the Young GC is not carried out, and the Full GC is triggered directly.

Explicit GC

See scenario 2 for this scenario.

Concurrent Mode Failure

The last case, and the more likely one, is that the Concurrent Mode Failure keyword is often seen in the GC logs. This is caused by the concurrent Background CMS GC being executed and objects promoted by the Young GC being put into the Old zone when there is not enough space in the Old zone.

Why does the CMS GC degenerate when it is executing? This is mainly caused by the INABILITY of the CMS to handle Floating Garbage. In the concurrent cleanup phase of CMS, Mutator is still running, so new garbage is constantly generated, which is not in the scope of the cleanup mark and cannot be cleared in the current GC. These are floating garbage. In addition, objects that are disconnected from the read/write barrier control before Remark are also floating garbage. Therefore, the Old area reclamation threshold should not be too high; otherwise, the reserved memory space may be insufficient, resulting in Concurrent Mode Failure.

4.7.3 strategy

After analyzing the specific reasons, we can solve them specifically. The specific thinking is still based on the root cause, and specific solutions are as follows:

Memory fragments: through configuration – XX: UseCMSCompactAtFullCollection = true to control the Full GC is in the process of the space arrangement (the default open, pay attention to is a Full GC, not ordinary CMS GC), and – XX: CMSFullGCsBeforeCompaction = n to control how many times a compression after Full GC.
Incremental collection: Reduce the trigger threshold of the CMS GC, namely parameters – XX: CMSInitiatingOccupancyFraction values, let CMS GC executed as early as possible, in order to make sure there is enough space in a row, and reduce the use of the Old district space size, Also need to use – XX: + UseCMSInitiatingOccupancyOnly to cooperate to use, otherwise the JVM is only used for the first time set data, follow-up is adjusted automatically.
Floating garbage: Control the size of each promotion object according to the situation, or shorten the time of each CMS GC, and adjust the value of NewRatio if necessary. The other is to use -xx :+CMSScavengeBeforeRemark to trigger a Young GC early in the process to prevent subsequent promotion of too many objects.

4.7.4 summary

Normally, the CMS GC that triggers the concurrent mode has a very short pause and has little impact on services. However, after the CMS GC is degraded, the impact will be very serious. You are advised to rectify the fault once the CMS GC is discovered. Position as long as you can into pieces of memory, floating garbage, incremental collection related, such as specific reason, or better, about memory fragments on this, if – XX: bad CMSFullGCsBeforeCompaction value selection, You can use -xx :PrintFLSStatistics to view the memory fragmentation rate, and then set the specific value.

Finally, large objects requiring continuous address space should be avoided during coding, such as excessively long strings, byte arrays for attachment storage, serialization or deserialization, etc., and premature promotion should be avoided before the outbreak of problems.

4.8 Scenario 8: Out-of-heap Memory OOM

4.8.1 phenomenon

Memory usage keeps rising, and even SWAP memory is used. Meanwhile, GC time may surge, threads may be blocked, and the RES of Java processes even exceeds the size of -xmx through the top command. When these phenomena occur, it is almost certain that there is an out-of-heap memory leak.

4.8.2 reason

JVM out-of-heap memory leaks can occur for two main reasons:

throughUnSafe#allocateMemory.ByteBuffer#allocateDirectActive request for out-of-heap memory without releasing, common in NIO, Netty and other related components.
Some memory applied by calling Native Code through JNI was not released in the Code.

4.8.3 strategy

What causes out-of-heap memory leaks?

First, we need to determine what causes the out-of-heap memory leak. Here you can use NMT (NativemoryTracking) for analysis. Restart the project after the -xx :NativeMemoryTracking=detail JVM parameter is added to the project (note that there is a 5% to 10% performance penalty when NMT is turned on). Run the JCMD pid VM. Native_memory detail command to view the memory distribution. Focus on total, because the committed memory displayed in JCMD includes memory in the heap, the Code area, and the memory applied via Unsafe. AllocateMemory and DirectByteBuffer. But it does not include out-of-heap memory requested by other Native Code.

If the COMMITTED RES in Total and RES in TOP are not significantly different, then the committed out-of-heap memory was not released. If the committed RES in Total is not significantly different, then it is almost certain that the COMMITTED RES was caused by JNI calls.

Cause one: The application is not released

The JVM uses the -xx :MaxDirectMemorySize=size parameter to control the maximum amount of out-of-heap memory that can be allocated. In Java8, if this parameter is not configured, it is equal to -xmx by default.

Both NIO and Netty take the -xx :MaxDirectMemorySize configuration to limit the requested out-of-heap memory size. NIO and Netty also have a counter field that calculates the currently requested out-of-heap memory size, Is Java NIO. NIO. Bits# totalCapacity, Netty io.net ty. Util. Internal. PlatformDependent# DIRECT_MEMORY_COUNTER.

When requesting off-heap memory, NIO and Netty compare the counter field to the maximum value and raise an OOM exception if the counter value exceeds the maximum value.

In NIO, OutOfMemoryError: indicates Direct buffer memory.

OutOfDirectMemoryError: Failed to allocate capacity byte(s) of direct memory (used: usedMemory, Max: DIRECT_MEMORY_LIMIT).

We can check how the out-of-heap memory is used in the code. NIO or Netty can obtain the counter field in the corresponding component through reflection and count the value of this field in the project, so as to accurately monitor the usage of this part of out-of-heap memory.

In this case, you can Debug to determine whether the code that freed memory is correctly executed where the off-heap memory is used. Also, check that the JVM argument has the -xx :+DisableExplicitGC option and remove it if it does, as it invalidates System.gc. (Scenario 2: Explicit GC go and stay)

Cause two: The memory obtained by Native Code invoked by JNI is not released

This can be difficult to troubleshoot, but we can use tools such as Google PerfTools + Btrace to help us find out where the offending code is.

Gperftools is a very useful tool set developed by Google. It works by substituting libtcmalloc.so when a Java application is running and calling malloc, it can do some statistics on memory allocation. We use GperfTools to track commands to allocate memory. As shown in the figure below, Java_java_util_zip_Inflater_init is found to be suspicious using GperfTools.

You can then use Btrace to try to locate the specific call stack. Btrace is a Java tracking and monitoring tool from Sun that can monitor Java applications online without downtime. As shown in the figure below, Btrace is used to locate ZipHelper in the project making frequent calls to GZIPInputStream to allocate objects in out-of-heap memory.

Finally, it was found that yes, the project was using GIPInputStream incorrectly and did not have the correct close().

In addition to the cause of the project itself, there may also be external dependencies caused by leaks, such as Netty and Spring Boot. For details, you can learn from these two articles: “Out-of-heap memory leak” investigation and experience summary, out-of-heap memory leak investigation of Netty.

4.8.4 summary

First of all, NMT + JCMD can be used to analyze where the leakage out-of-heap memory is applied. After determining the cause, different means can be used to locate the cause.

4.9 Scenario 9: GC Problems caused by JNI

4.9.1 phenomenon

In the GC log, the GC Cause is GCLocker Initiated GC.

2020-09-23T16:49:09.727+ 0800:504426.742: [GC (GCLocker GC) 504426.742: [ParNew (promotion failed): 209716K->6042K(1887488K), secs] 1449487K->1347626K(3984640K), secs] [Times: Sys =0.00, real= 0.09secs] 2020-09-23T16:49:09.812+ 0800:504426.827: [Full GC (GCLocker Initiated GC) 504426.827: 1347626K->419699K(3984640K), [Metaspace: [Times: user=1.62 sys=0.20, real= 1.85secs] [Times: user=1.62 sys=0.20, real= 1.85secs]Copy the code

4.9.2 reason

JNI (Java Native Interface) stands for Java Native invocation. It allows Java code to interact with Native code written in other languages.

JNI needs to get strings or arrays from the JVM in one of two ways:

Copy pass.
Shared references (Pointers) for higher performance.

Since Native code directly uses Pointers to the JVM heap area, GC can cause data errors if this happens. Therefore, when such A JNI call occurs, GC is disabled and other threads are prevented from entering the JNI critical section until a GC is triggered when the last thread exits the critical section.

GC Locker experiment:

public class GCLockerTest {

  static final int ITERS = 100;
  static final int ARR_SIZE =  10000;
  static final int WINDOW = 10000000;

  static native void acquire(int[] arr);
  static native void release(int[] arr);

  static final Object[] window = new Object[WINDOW];

  public static void main(String... args) throws Throwable {
    System.loadLibrary("GCLockerTest");
    int[] arr = new int[ARR_SIZE];

    for (int i = 0; i < ITERS; i++) {
      acquire(arr);
      System.out.println("Acquired");
      try {
        for (int c = 0; c < WINDOW; c++) {
          window[c] = newObject(); }}catch (Throwable t) {
        // omit
      } finally {
        System.out.println("Releasing"); release(arr); }}}}Copy the code

#include <jni.h>
#include "GCLockerTest.h"

static jbyte* sink;

JNIEXPORT void JNICALL Java_GCLockerTest_acquire(JNIEnv* env, jclass klass, jintArray arr) {
sink = (*env)->GetPrimitiveArrayCritical(env, arr, 0);
}

JNIEXPORT void JNICALL Java_GCLockerTest_release(JNIEnv* env, jclass klass, jintArray arr) {
(*env)->ReleasePrimitiveArrayCritical(env, arr, sink, 0);
}
Copy the code

Running the JNI program, you can see that all GC that occurs is GCLocker Initiated GC, and note that GC is not possible when “Acquired” and “Released”.

GC Locker can have bad consequences for:

If this is a GC caused by an insufficient Allocation Failure of the Young section, the object will be allocated directly to the Old section because Young GC cannot be performed.
If there is no space left in the Old area, it waits for the lock to be released, causing the thread to block.
There is a Bug in the JDK that the Young GC in GCLocker Initiated GC should only be triggered once. There is actually an Allocation Failure GC followed by a GCLocker Initiated GC. The GCLocker Initiated GC property was set to full, causing both GCS to fail to converge.

4.9.3 strategy

Add the -xx + printJnigcBasket parameter to print out the thread on which the JNI call occurred for further analysis to find the JNI call that caused the problem.
JNI calls need to be careful, not necessarily to improve performance, but can cause GC problems.
Upgrade the JDK version to 14 to avoid repeated GC caused by JDK-8048556.

4.9.4 summary

GC problems caused by JNI are difficult to troubleshoot and should be used with caution.

5. To summarize

Here, we summarize the content of the whole article to facilitate the overall understanding of the review.

5.1 Processing Process (SOP)

The following figure shows the general processing flow of the overall GC problem. The key points will be marked separately below. The other basic processing flows are standard and will not be described here.

Set standards: In fact, this content is very important, but most of the system is missing. Only less than 10% of the students interviewed by the author in the past can give their own system GC standard, and the rest are unified indicator template, which lacks predictability. For specific indicator formulation, please refer to 3.1. Specific indicators need to be set based on TP9999 time and latency of the application system, throughput, etc., rather than being driven by problems.
Preserve the scene: Currently online services are basically distributed services, a node after the problem occurs, if the conditions permit must not restart, rollback recovery operation directly, priority to remove traffic ways to restore, so that we can be the heap, stack, GC logs and other key information preserved, not missing the opportunity of returning for positioning, the follow-up to solve the difficulty will increase greatly. In addition to these, application logs, middleware logs, kernel logs, various Metrics, etc., can be very helpful in problem analysis.
Causal analysis: To judge the causal relationship between GC anomalies and other system indicator anomalies, we can refer to the four causal analysis methods introduced by the author in 3.2, including timing analysis, probability analysis, experimental analysis and disproof analysis, so as to avoid mistakes in the process of investigation.
Root cause analysis: after GC is indeed a problem, we can use the tools mentioned above and match each of the nine common scenarios through 5 WHY root cause analysis and the third section, or directly refer to the root fishbone diagram below to find out the root cause of the problem, and then choose the optimization method.

5.2 Root fishbone diagram

In general, when we deal with a GC problem, as long as we can locate the “focus” of the problem, we can solve 80% of the problem. If it is not easy to locate in some scenarios, we can use this root cause analysis diagram to locate the problem through elimination.

5.3 Tuning Suggestions

Trade Off: Just as CAP is destined to miss a corner, GC optimization is a tradeoff between Latency, Throughput, and Capacity.
The last resort: Tuning the JVM’s GC parameters is not necessary for GC problems. Most of the time, it’s about finding some business problems with GC. Remember to adjust the GC parameters early on, except in the case of explicitly misconfigured scenarios.
Control variable: The control variable method is a technical method used in the Monte Carlo method to reduce variance. We should try to use it when tuning, and only adjust one variable in each tuning process as much as possible.
Make good use of search: 99.99% of GC problems should be solved. Learn the advanced techniques of using search engines. Focus on StackOverFlow, Github issues, and various forums and blogs to see how others solve problems. Can see this article, your search ability basic pass ~
Tuning key: generally speaking, the problems we encountered in the development process of type also basically accord with normal distribution, is too simple or complicated met the basic probability is very low, the author here will be the most important among the three scenarios with “*” logo, hope after reading this article can observe yourself responsible system, the existence of these problems.
If the heap or stack cannot be retained at the first time, it is necessary to keep the GC log, so that we can at least see the GC Cause, and have a general direction of the check. About GC logs related parameters, the most basic – XX: + HeapDumpOnOutOfMemoryError some parameters, such as no longer mention, the author suggested to add the following parameters, can improve the efficiency of our analysis.

Other suggestions: Not mentioned in the previous scenario, but there are some suggestions for improving GC performance.
- Active GC: It also has a different approach. It monitors the usage of the Old zone through monitoring means. When the threshold is about to be reached, the application service will be disconnected from the traffic and a Major GC will be triggered manually to reduce the pause caused by the CMS GC, but the robustness of the system will be reduced.
- Disable biased lock: Biased lock is efficient when only one thread uses the lock, but it will be upgraded to lightweight lock in fierce competition. In this case, biased lock needs to be eliminated first. This process is STW. If all synchronized resources go through this upgrade process, the cost will be very high. Therefore, under the premise of known intense concurrency, biased locking -xx: -usebiasedlocking is generally disabled to improve performance.
- Virtual memory: Some operating systems (such as Linux) do not actually allocate physical memory to the JVM at startup, but instead allocate pages in physical memory at use, which can also lead to long GC times. In this case, you can add the -xx :+AlwaysPreTouch parameter to make the VM run a loop during the commit to force the memory to be committed, so as not to trigger a page miss exception at runtime. In some large-memory scenarios, it is sometimes possible to reduce the first few GC times by an order of magnitude, but adding this parameter can slow down the startup process.

6. Write at the end

Finally, here are some personal suggestions for GC problems. If you have the energy, you must explore the source and find out the reasons at the deepest level. In addition, in this era of information overload, some are “dogma” experience may be wrong, try to get into the habit of watching the source code, there’s a phrase said to “source in front, no secret”, means that don’t understand the problems, we can from the source code for a peek inside, some scenarios does wonders. But it is not only to read the source code to learn, if hard gnawing source code but ignore the theoretical basis behind it may contain, it is easy to “pick sesame and throw watermelon”, “see the trees, see the forest”, so that “no secret” has become an empty talk, we still want to combine some actual business scenarios to targeted learning.

Where your time is, your accomplishments will be. In the past two years, the author gradually began to deepen in the direction of GC, checking problems, looking at the source code and making summaries, forming a small closed loop for each Case. At present, he has preliminarly grasped some methods of GC problem treatment, and at the same time applied the experience summary to the production environment practice, slowly forming a virtuous circle.

This article mainly introduces the analysis of some common scenarios of CMS GC, and some other problems, such as JIT failure caused by CodeCache problem, long SafePoint ready time, Card Table scanning time, are not common, so it does not spend too much time to explain. It took many years for Java GC to break through to partitioning under the concept of “generation”. Currently, G1 has been used in The United States to replace CMS for many years. Although G1 is still slightly inferior to CMS in the small heap, it is a trend that will not be upgraded to ZGC in a short time. So future G1 problems are likely to increase. Remember Set coarsening, Humongous allocation, Ergonomics exceptions, and Evacuation Failure in Mixed GC have been collected, along with suggestions for upgrading CMS to G1. Next, the author will continue to finish this part of the article sorting, please look forward to.

“Fire prevention” is always better than “fire fighting”, do not let go of any abnormal small indicators (generally speaking, any uneven curve is suspicious), it is possible to avoid a failure. As a Java programmer, we will encounter some GC problems, and solving GC problems independently is a hurdle we must overcome. As mentioned in The beginning, GC is a classic technology, which is worth learning. Some GC learning materials, such as The Garbage Collection Handbook and In-depth Understanding of Java Virtual Machine, are also frequently read and updated. Hurry up and practice The basic GC skills.

Finally, the first sentence of all GC tuning related articles is “don’t optimize too early”, which makes many students afraid of GC optimization. Here, the author put forward different point of view, the laws of entropy (in an isolated system, if there is no external force to do work, its total chaos degree (entropy) will continue to increase) also works in a computer system, if you do not take the initiative to do work that entropy, system will eventually out of your control, in command of our business systems and GC principle is deep enough, Optimizations can be done boldly, because we can basically predict the results of every operation. Go for it, boy!

Read more:

Analysis and Solution of nine Common CMS GC Problems in Java (Part 1)

7. Reference materials

[1] ガ elevator!! I’m back to work! ガガガガガガガョョョ to play play and co mode
[2]. The Garbage Collection Handbook, Richard Jones/ Antony Hosking/Eliot Moss
[3]. Deep Understanding of Java Virtual Machines (3rd edition)
[4] Java Platform, Standard Edition HotSpot Virtual Machine Garbage Collection Tuning Guide
[5]. “Shipilev One Page Blog
[6]. Openjdk.java.net/projects/jd…
[7]. Jcp.org/en/home/ind…
[8]. A Generational Poly-Concurrent Garbage Collector by Tony Printezis/David Detlefs
[9]. Java Memory Management White Paper
[10].《Stuff Happens：Understanding Causation in Policy and Strategy》AA Hill

8. Author introduction

Xinyu: I joined Meituan in 2015 as a business development engineer for hotel accommodation tickets.
Xiang Ming: Joined Meituan in 2018 as a customer platform development engineer.
Xiang Pu: Joined Meituan in 2018 as a customer platform development engineer.

9. Job postings

Meituan to shop business group accommodation ticket data intelligent group sincerely invite friends, from the supply, control, selection, sales and other levels to enhance the competitiveness of the business in an all-round way, hundreds of QPS processing, hundreds of millions of data analysis, complete business closed loop, there are a large number of HC, interested please email to [email protected], We will contact you as soon as possible.

For more technical articles, please follow the official wechat account of Meituantech.

Analysis and Solution of 9 Common CMS GC Problems in Java (Part 2)

1. Write at the front

4.7 Scenario 7: Memory fragmentation & Collector Degradation

4.7.1 phenomenon

4.7.2 reason

4.7.3 strategy

4.7.4 summary

4.8 Scenario 8: Out-of-heap Memory OOM

4.8.1 phenomenon

4.8.2 reason

4.8.3 strategy

4.8.4 summary

4.9 Scenario 9: GC Problems caused by JNI

4.9.1 phenomenon

4.9.2 reason

4.9.3 strategy

4.9.4 summary

5. To summarize

5.1 Processing Process (SOP)

5.2 Root fishbone diagram

5.3 Tuning Suggestions

6. Write at the end

7. Reference materials

8. Author introduction

9. Job postings

Related Posts

Can you do it? Jenkins Pipeline

26. Remove duplicates from ordered arrays | Java swipe card

Spring Cloud+Nginx is that simple!!