One, foreword

The main motivation for the Shenandoah OpenJDK Garbage Collection (GC) project was to reduce garbage collection pauses. In JDK 12, the original Shenandoah garbage collector was released, which implemented concurrent heap evacuation and solved the main problem of cleaning up the (large) heap without stopping the application. This version was eventually ported to JDK 11. In JDK 14, we implemented concurrent class unload, and in JDK 16, we added concurrent reference handling, both of which further reduced the pause time for garbage collection operations. The rest of the garbage collection operations that are paused are handled by the thread stack, which we addressed in JDK 17.

This article introduces the new concurrent thread stack handling in the Shenandoah GC. Concurrent processing of the thread stack in JDK 17 provides us with reliable submillisecond pauses.

Thread processing in Java

What is threading, and why do we need to stop applications for it? Java programs execute in threads, and each thread has a list of stack frames, each of which holds local variables, monitors, and other information about the currently executing method. Most importantly, in the context of Java garbage collection, it holds references to heap objects (for example, local variables that reference typed objects).

When the garbage collection cycle begins, we first scan the stack of all threads to seed the tag queue with references we find on the stack. We do this when _GC pauses _ (safe point), because we need a consistent state of the stack at the start of the tag, rather than the execution of the thread being concurrently confused with the stack. Once done, we proceed and traverse the graph of the reachable objects, starting with the references we found during the initial thread scan.

Similarly, when evacuating reachable objects to empty areas, we need to update all references on the thread stack to point to the new object location. We need to pause because garbage collection loading barriers usually work when references are loaded from the heap (such as into local variables or registers), which means that local variables or registers cannot be interfered with by object references in the state that requires GC. By then, it will be too late to get through the garbage collection barrier. Invoking the garbage collection barrier quickly for each local variable or register access can run into performance issues.

Scanning and processing the thread stack takes time. Small workloads (a small number of threads with a small stack) may only take milliseconds to scan, but large workloads — application servers, I’m looking at you! It can easily take tens of milliseconds to process. All of this processing is done when the application is stopped, so it affects the overall end-to-end latency of the application.

OpenJDK 17: Concurrent thread handling

How can we improve this situation and handle the thread stack at the same time? We do this by using a mechanism called stack watermarking, originally implemented by ZGC developers. The central observation is that all thread stack operations take place in the topmost frame: the currently executing method. All of the frames below are essentially static and do not change — they can be safely scanned concurrently by the garbage collector thread. All we need to do is coordinate the GC thread with the executing thread when the stack frame is destroyed (for example, by returning the caller, or by throwing an exception) to exit GC processing. This coordination is achieved through the _ stack watermark _, a pointer that tells us which parts of the stack can be safely scanned, and a barrier that allows the garbage collector to handle the returns.

Use stack watermarks during garbage collection

Let’s consider an example. For example, at the start of the tag, during the initial pause, we set the stack watermark to the topmost frame of each thread and arm the thread. This means that we think all frames are safe for concurrent scanning, but none is (yet) executable. When we return from the safe point, we give control back to the Java program, so we need the top-level framework to execute safely. Here, the stack watermark barrier comes into play, letting the garbage collector process the top frame (and, for practical reasons, its caller). The thread scans the top frame and lowers the watermark accordingly, resuming its execution at the point left before the safe point. At the same time, the GC thread also starts scanning the stack, from the bottom up to the watermark, in the safe zone.

  1. Reduces the watermark by one frame.
  2. Prevents the GC thread from scanning beyond the watermark.
  3. Process the frame that is now above the watermark by scanning any references.

The end result is that we will effectively scan all the same frames and references, just as we did when the initial mark was paused, but at the same time as the program is executing.

Shenandoah GC benchmark

So what impact will these changes have in practice? I’ve run a number of benchmarks that measure garbage collection pauses. The following table shows the average pause times for all benchmarks in JDK 11, JDK 16, and JDK 17. The difference between JDK 16 and JDK 17 shows the improvements implemented with concurrent stack handling. The differences from JDK 11 are shown for completeness, including various other improvements over previous versions.

Initial tag The final result
JDK 11 421 microseconds 1294 microseconds
JDK 16 321 microseconds 704 microseconds
JDK 17 63 microseconds 328 microseconds

Release

Availability of Shenandoah varies by vendor and JDK version. By default, OpenJDK 12+ builds usually include Shenandoah. OpenJDK 11 needs to be opt-in at build time.

The known supplier status is:

  • The red hat
    • Fedora 24+ OpenJDK 8+ version includes Shenandoah
    • RHEL 7.4+ comes with OpenJDK 8+, which includes Shenandoah as a technical preview
    • Shenandoah is included in the 8U version of Red Hat OpenJDK for Windows
  • amazon
    • Shenandoah is available in Amazon Corretto starting with OpenJDK 11.0.9
  • Oracle
    • Shenandoah will not be released in any version, including OpenJDK builds and proprietary builds.
  • Azul
    • Release Shenandoah in Azul Zulu starting with OpenJDK 11.0.9.
  • OpenJDK
    • Shenandoah is provided as a default binary file starting with OpenJDK 11.0.9
  • Linux distributions
    • Debian released Shenandoah starting with OpenJDK 11.0.9
    • IcedTea’s Gentoo Ebuild bears the Shenandoah USE logo
    • Distributions based on RHEL/Fedora or other distributions that use their packages may also have Shenandoah enabled. It’s worth noting that CentOS, Oracle Linux, and Amazon Linux are known to distribute it.

7. Enable Shenandoah

Run your Java application through the ShenandoahGC using the -xx :+UseShenandoahGC JVM option.

java <PATH_TO_YOUR_APPLICATION> -XX:+UseShenandoahGC
Copy the code

7.1 model

Patterns define the main way shenandoah works. This defines the barriers (if any) to Shenandoah’s use, and defines the main performance characteristics. You can use -xx :ShenandoahGCMode= Select mode. Available modes are:

  1. Normal/SATb (default). This pattern uses The snapshot-at-the-beginning (SATB) flag to run concurrent GC. This markup pattern is similar to what G1 does: intercepts writes and markup via the “previous” object.
  2. Iu (experimental). This mode runs concurrent GC using the Incremental update (IU) flag. This markup pattern is a mirror image of the SATB pattern: intercepts writes and markup through “new” objects. This may make the tag less conservative, especially when it comes to accessing weak references.
  3. Passive (Diagnosis). This mode runs the Stop-the-world GC. This pattern is used for functional testing, but sometimes it is useful for bisecting performance anomalies with GC barriers or calculating the actual real-time data size in an application.

7.2 Basic Configuration

Basic configuration and command-line options:

  • -xlog :gc (since JDK 9) or -verbose: GC (up to JDK 8) will print individual GC timings.
  • -xlog: GC +ergo (since JDK 9) or -xx :+PrintGCDetails (until JDK 8) will either print heuristic decisions, which may reveal outliers (if any).
  • -xlog: GC +stats (since JDK 9) or -verbose: GC (up to JDK 8) prints a summary table of shenandoh internal timing at the end of the run.

Running with logging enabled is almost always a good idea. This summary table conveys important information about GC performance, which we almost inevitably ask for in performance error reports. Heuristic logging is useful for finding GC outliers.

Other recommended JVM options are:

  • -xx :+AlwaysPreTouch: Committing heap pages to memory helps reduce latency interrupts
  • -xms and -xmx: Use -xms = -xmx to make the heap non-resizable and reduce heap management problems. In combination with AlwaysPreTouch, -xms = -xmx commits all memory at startup, avoiding problems when it is finally used. -xms also defines the low bounds of uncommitted memory, so with -xms = -xmx, all memory will remain committed. That said, if you want to configure Shenandoah to reduce footprint, a low -XMS is recommended. You need to decide how low to set it to balance commit/uncommit overhead with memory footprint. In many cases, setting -xms as low as you want will do.
  • Using large pages greatly improves heap performance. There are two ways to opt in. -xx :+UseLargePages will enable Hugetlbfs (Linux) or Windows (with appropriate permissions) support. -xx :+UseTransparentHugePages will transparently enable it. For transparent large pages, it is suggested that the/sys/kernel/mm/transparent_hugepage/enabled and/sys/kernel/mm/transparent_hugepage/defrag is set to “madvise”. When run with AlwaysPreTouch, it will also warm up at startup to reduce defragmentation overhead.
  • -xx :+UseNUMA: Although Shenandoah does not explicitly support NUMA yet, it is a good idea to enable NUMA interleaving on multi-socket hosts. Combined with AlwaysPreTouch, it provides better performance than the default out-of-the-box configuration
  • -xx: -usebiasedlocking: There is a tradeoff between uncontested (biased) locking throughput and the safe point at which the JVM enables and disables them as needed. For delay-oriented workloads, it makes sense to turn off bias locking.
  • -xx :+DisableExplicitGC: Calls system.gc () from user code to force Shenandoah to perform additional GC loops; It may be beneficial to disable it to prevent abuse of system.gc () code. It usually will not be affected, because – XX: + ExplicitGCInvokesConcurrent enabled by default, this means that calls the concurrent GC cycle, rather than the STW Full GC.

Eight, the conclusion

This article explains how concurrent thread stack processing in the Shenandoah GC solves the problem of remaining garbage collection pause times and provides reliable submillisecon-level garbage collection pauses in JDK 17. For more information, visit the GitHub repository and The OpenJDK Wiki page for the Shenandoah GC project.

9. The translator said

On the basis of the original text, we added the sixth and seventh section of the comparison relationship how to use the relevant content. The author’s Spring Cloud microservice component MICA has also been adapted to Java17, and the PR of Mybatis – Plus has been issued. More java17 articles are coming soon, please follow me!!

The original link: developers.redhat.com/articles/20…