ElasticSearch garbage collector optimization

background

Es version: 6.3.2
Es cluster configuration: 16-core CPU, 64 GB memory, and 200 GB disk
JDK version: 1.8
Garbage collector: CMS+ParNew

The services deployed in this cluster occasionally encounter a service timeout. Kibana monitoring shows that the ES server CPU is high when the service timeout occurs. In ES, young GC is frequent and old GC is low frequency, which occurs about 2-4 times a day.

Looking at the monitoring over the past hour, we find that young gc is quite frequent, and a large number of objects end up in the old age and are recycled through the old GC.

Check the GC log. 99% of the logs are young GC caused by Allocation Failure (GC). Allocation Failure Indicates the minor GC that requests space from young generation(Eden) for a new object, but young generation(Eden) does not have enough suitable space left for the desired size.

Desired survivor size 56688640 bytes, new threshold 6 (max 6)
- age   1:    6717288 bytes,    6717288 total
- age   2:    6025032 bytes,   12742320 total
- age   3:     987872 bytes,   13730192 total
- age   4:        176 bytes,   13730368 total
- age   5:        336 bytes,   13730704 total
- age   6:      93864 bytes,   13824568 total
Copy the code

Desired survivor size This parameter indicates that the maximum size that a survivor area can hold is 56688640 bytes
Max 6 indicates that an object survives after 6 gc cycles and goes straight to the old age
The object list is the age distribution of survivor objects after the current GC. If the objects are not released in the next GC, the objects exceeding the threshold (age=6 or space > 56688640) will be promoted to the old generation.

JVM garbage Collection

Modern mainstream VIRTUAL machine (Hotspot VM) garbage collection adopts “generational collection” algorithm. Generational collection is based on the fact that objects have different life cycles, so different collection methods can be adopted for objects with different life cycles to improve collection efficiency.

The new generation can be divided into three regions: one Eden region and two Survivor regions. The default memory ratio is 8:1:1. Most objects are generated in the Eden zone. When the Eden zone is full, the surviving objects are copied to one of the two Survivor zones. When this Survivor zone is full, objects that survive in this zone and do not meet the “promotion” criteria are copied to another Survivor zone. Each time an object undergoes a Minor GC, its age is incremented by one, and when it reaches the “promotion age threshold”, it is placed in the old age.
Old generation: Objects that survive N garbage collections in the new generation are placed in the old generation.
Java8 does not have a persistent generation. Instead, it stores metadata, such as Class and Method metadata.

Object allocation procedure

Object is bigger, more than – XX: PretenureSizeThreshold setting value, the direct distribution to the old age;
Ask Eden for space to create a new object. Eden does not have the right space, so minor GC is triggered
The minor GC processes the surviving objects in Eden and from survivor:
- If these objects reach the MaxTenuringThreshold, they are directly promoted to the old generation
- If the object to be copied is too large, it is not copied to To survivor, but directly into the aged generation
- If there is insufficient space in the to survivor region or insufficient space during replication, survivor overflow occurs and the aged generation is directly entered
- Otherwise, if there is enough space in the to Survivor region, the surviving object is copied to the TO Survivor region
In this case, the remaining objects in Eden area and from survivor area are garbage objects, which are directly erased and recycled. The released space becomes the new space that can be allocated
After minor GC, if Eden space is sufficient, the new object allocates space in Eden. If Eden space is still insufficient, the new object allocates space directly in the old generation

Garbage collector

The new generation collectors are Serial, ParNew, Paraller Scavenge, and the Insane.
The old collector has: CMS (a collector whose goal is to obtain the shortest collection pause time, which is implemented based on the “mark-sweep” algorithm), Serial Old, Parallel Old
G1 collector works with the new generation and the old generation (JDK9 default garbage collector)

ParNew+CMS working mechanism

+ UseConcMarkSweepGC - - XX: XX: CMSInitiatingOccupancyFraction = 75 / / old s memory utilization rate more than 75% triggers garbage collection - XX: + UseCMSInitiatingOccupancyOnlyCopy the code

ParNew: Replication algorithm that divides memory into two equally sized pieces and copies surviving objects to the other piece each time one piece is used up. CMS: Use the mark-sweep algorithm. The whole process is divided into four steps:

Initial tag: STW, which marks objects that GC Roots can associate with, very fast
Concurrent markup: GC Roots Tracing process. Time consuming. Execute with the user thread (in parallel)
Relabelling: STW, marking an object whose mark changes as a result of the program running during concurrent marking, is longer than the initial mark and much shorter than the concurrent mark
Concurrent clearing: Time consuming. Execute with the user thread (in parallel)

G1

Designed to replace the CMS collector, the G1 collector performs better than the CMS collector in the following ways:

G1 is a garbage collector with a defragmenting memory procedure that does not generate much memory fragmentation.
G1’s Stop The World(STW) is more controllable, and G1 adds a predictive mechanism for pause times, allowing users to specify desired pauses. The number of blocks for garbage collection is selected based on the pause time specified by the user. G1 takes an incremental recycling approach, collecting a few blocks at a time rather than the whole heap.
The G1 collection thread executes concurrently with the application thread during the tagging phase, and when the tagging is over, G1 knows which blocks are basically garbage, with very few viable objects, and starts with those blocks because they can quickly free up a lot of free space. That’s why G1 was named garbage-first.

The storage addresses of G1 generations are discontinuous. Each generation uses n discontinuous regions of the same size. Each Region occupies a contiguous virtual memory address. As shown below:

Remembered Sets(Rset)

Logically, each Region has an RSet. The RSet records the relationship between objects in other regions and objects in this Region.

Collection Set (CSet)

Records the collection of regions to be collected by the GC. The regions in the collection can be of any age.

G1 Working Mode

YoungGC Collection of the young generation

YoungGC is triggered when all Eden Regions have reached their maximum usage threshold and cannot allocate enough memory during allocation of generic objects (non-giant objects). Each time younggc reclaims all Eden and Survivor zones and copies the surviving objects to the Old zone and another portion of Survivor zones.

mixed gc

When more and more objects are promoted to old regions, in order to avoid running out of heap memory, the virtual machine will trigger a mixed garbage collector, namely mixed GC. This algorithm is not an old GC, but will reclaim the whole young region as well as part of the old region. Note that you can select certain old regions for garbage collection rather than all old regions to control the garbage collection time.

Problem solving

According to the index data, it is found that the es cluster has too small memory allocation for the new generation, resulting in frequent young GC. Viewing by Command

jstat -gc pid 1000 1000

They found that the new generation allocated only about 1 GIGAByte of memory, while the older generation accounted for 29 GIGABytes. One solution is to increase the size of the new generation, which depends on experience and adjusted indicator data.

In addition, the survey found that some major Internet companies, such as Meituan, Ctrip and ES, all use G1 garbage collectors. To sum up, the garbage collector that directly replaces ES is G1.

-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
Copy the code

I’m going to replace it with

-XX:+UseG1GC
-XX:MaxGCPauseMillis=50
Copy the code

Effect comparison diagram

Unupgraded machines:

Machines upgraded to G1:

The average time of young gc is about 15ms before the upgrade and about 1ms after the upgrade
Young GC is more frequent before the upgrade, but the number of young GC decreases significantly after the upgrade
CPU burrs occur occasionally before the upgrade, but are stable after the upgrade

With THE use of G1 for ES, the frequency and duration of Young GC can be greatly reduced, and Old GC is almost nonexistent.

reference

Let’s talk about GC optimization for Java applications from a practical example

Analysis of frequent GC (Allocation Failure) and long TIME of YOUNG GC

Some key technologies for the Java Hotspot G1 GC

Replace CMS with G1 garbage collection for ElasticSearch