The background,

In February 2021, feedback was received that a core interface of the video APP had slow response during peak hours, which affected user experience.

Through monitoring, it is found that the slow interface response is mainly caused by the high time consumption of P99, which is suspected to be related to the GC of the service. The GC performance of a typical instance of the service is shown as follows:

It can be seen that in the observation period:

  • The average number of Young GC was 66 times per 10 minutes, and the peak was 470 times.

  • The average number of Full GC was 0.25 times per 10 minutes, and the peak value was 5 times.

It can be seen that the Full GC is very frequent, and the Young GC is also frequent in a specific period of time, so there is a large space for optimization. Because optimization of GC pauses is an effective means of reducing the P99 latency of the interface, it was decided to perform JVM tuning for this core service.

Second, optimization objectives

  • The interface P99 latency is reduced by 30%

  • Reduce Young and Full GC count, pause duration, and single pause duration

Because GC behavior is related to concurrency, for example, when concurrency is high, Young GC will be frequent no matter how tuned, and Full GC will always be triggered by objects that should not be promoted, so optimization goals are set according to the load:

Goal 1: High load (more than 1000 QPS per machine)

  • The number of Young GC is reduced by 20%-30%, and the cumulative time of Young GC does not deteriorate;

  • Reduce the number of Full GC by more than 50%, and reduce the time of single and cumulative Full GC by more than 50%. Service publication does not trigger Full GC.

Goal 2: Medium load (single machine 500-600)

  • The number of Young GC is reduced by 20%-30%, and the cumulative time of Young GC is reduced by 20%.

  • The number of Full GC is not more than 4 times per day. Service publication does not trigger Full GC.

Goal 3: Low load (less than 200 QPS per machine)

  • The number of Young GC is reduced by 20%-30%, and the cumulative time of Young GC is reduced by 20%.

  • The number of Full GC is not more than 1 per day. Service publication does not trigger Full GC.

Third, existing problems

The JVM configuration parameters for the current service are as follows:

-Xms4096M -Xmx4096M -Xmn1024M
-XX:PermSize=512M
-XX:MaxPermSize=512M
Copy the code

Purely from the analysis of parameters, there are the following problems:

** The specified collector ** is not displayed

The JDK 8 default collector is the ParrallelGC, the Young Exploiture, and the Parallel Insane, the Parallel Old Exploiture. This configuration is throughput first and is generally suitable for background task-based servers.

For example, batch order processing, scientific computing and other scenarios that are sensitive to throughput and delay insensitive, the current service is the portal of video and user interaction and is very sensitive to delay, so it is not suitable to use the default collector ParrallelGC, and a more appropriate collector should be selected.

The ratio of Young area is unreasonable

The current services mainly provide apis, which are characterized by a small number of resident objects, most of which have a short life cycle and die after one or two Young GC’s.

Take a look at the current JVM configuration:

Default -xx :SurvivorRatio=8, i.e., the valid size is 0.9G, and the size of resident objects in older ages is about 400M.

This means that when the service load is high and the request concurrency is high, the Eden + S0 region in the Young region will fill up quickly and the Young GC will be more frequent.

In addition, objects that should have been collected by Young GC will be promoted prematurely, increasing the frequency of Full GC and increasing the area of single collection. As the Old area uses ParralellOld, it cannot be executed concurrently with the user thread, resulting in long service pause, decreased availability and increased P99 response time.

Is not set

– XX: MetaspaceSize and – XX: MaxMetaspaceSize

Perm area in the JDK1.8-xx :PermSize=512M -xx :MaxPermSize=512M configuration is ignored, the actual control Meta GC parameter is -xx :MetaspaceSize: Metaspace initial size,64-xx :MaxMetaspaceSize: indicates the maximum value of Metaspace.64Machine defaults to 18446744073709551615 byte, can be understood as uncapped - XX: MaxMetaspaceExpansion: increasing the trigger metaspace GC threshold requirements - XX: largest MinMetaspaceExpansion: Increases the minimum threshold for triggering metaspace GC. The default value is 340784 bytesCopy the code

In this way, in the process of service startup and publication, when the Metadata area reaches 21M, a Full GC Threshold will be triggered. Then, as the Metadata area expands, several Full GC thresholds will be included. Making service delivery less stable and efficient.

In addition, if the service uses a lot of dynamic class generation techniques, this mechanism can also generate unnecessary Full GC (Metadata GC Threshold).

Iv. Optimization scheme/verification scheme

The above analysis has pointed out the obvious deficiencies existing in the current configuration. The following optimization plan mainly focuses on solving these problems, and then decides whether to further optimize based on the effect.

Current mainstream/good collectors include:

  • Parrallel insane + Parrallel insane.

  • ParNew + CMS: a classic low-pause collector used by most commercial, time-sensitive services;

  • G1: JDK 9 default collector, high throughput and short pause times when heap memory is large (6G-8G +);

  • ZGC: a low-latency garbage collector released in JDK 11, currently in the experimental stage;

Considering the reality of the current service (heap size, maintainability), our choice of ParNew + CMS solution is appropriate.

The principles for selecting parameters are as follows:

1) The size of the Meta area must be specified, and the MetaspaceSize and MaxMetaspaceSize sizes should be set to the same size. The specific size should be based on the online instance. Jstat-gc can be used to obtain the online instance of the service.

# jstat -gc 31247
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
37888.0 37888.0 0.0 32438.5 972800.0 403063.5 3145728.0 2700882.3 167320.0 152285.0 18856.0 16442.4 15189 597.209 65 70.447 667.655
Copy the code

It can be seen that the MU is around 150M, so -xx :MetaspaceSize=256M -xx :MaxMetaspaceSize=256M is reasonable.

2) Bigger is not better.

When the heap size is constant, the larger the Young region is, the less frequent the Young GC will be, but the smaller the Old region will be. If it is too small, a slight promotion of objects will trigger the Full GC.

If the Young section is too small, the Young GC will be more frequent, so the Old section will be larger, and the pauses in a single Full GC will be larger. Therefore, the size of Young area needs to be compared in several scenarios based on the service situation to obtain the most appropriate configuration.

Based on the above principles, the following four parameter combinations are available:

1.ParNew +CMS, Young area doubled

-Xms4096M -Xmx4096M -Xmn2048M
-XX:MetaspaceSize=256M
-XX:MaxMetaspaceSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSScavengeBeforeRemark
Copy the code

**2.ParNew +CMS, **Young

Remove – XX: + CMSScavengeBeforeRemark (using – XX: CMSScavengeBeforeRemark parameters can be done in the back tag to perform before a new generation of GC).

Because there are cross-generation references between objects in the old generation and the young generation, GC Roots tracing in the old generation will also scan the young generation, and if the new generation GC can be performed before relabeling, fewer objects will be scanned and the performance of the relabeling phase will be improved.

-Xms4096M -Xmx4096M -Xmn2048M
-XX:MetaspaceSize=256M
-XX:MaxMetaspaceSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
Copy the code

3.ParNew +CMS, Young area expanded by 0.5 times

-Xms4096M -Xmx4096M -Xmn1536M
-XX:MetaspaceSize=256M
-XX:MaxMetaspaceSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark
Copy the code

4.ParNew +CMS, Young area unchanged

-Xms4096M -Xmx4096M -Xmn1024M
-XX:MetaspaceSize=256M
-XX:MaxMetaspaceSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark
Copy the code

Next, we need to compare, analyze and verify the actual performance of the four schemes under different loads in the pressure test environment.

4.1 Pressure test environment verification/analysis

High load scenario (1100 QPS)GC performance

The results show that ParNew + CMS perform better than Parrallel avenge + Parrallel Old in high-load scenarios. Among them:

  • Scheme 4(Young zone expanded by 0.5 times) has the best performance, with interface P95 and P99 delay reduced by 50% compared with the current scheme, Full GC cumulative time reduced by 88%, Young GC times reduced by 23%, and Young GC cumulative time reduced by 4%. However, with the large Young area, the time consuming of a single Young GC is also highly likely to rise, which is in line with expectations.

  • The two schemes of doubling the Young area, namely scheme 2 and Scheme 3, are similar in performance, with interface P95 and P99 delay reduced by 40% compared with the current scheme, Full GC cumulative time reduced by 81%, Young GC frequency reduced by 43%, Young GC cumulative time reduced by 17%. It is slightly inferior to that of Young district, which has expanded by 0.5 times, with a good overall performance. These two schemes are merged and no longer differentiated.

In the new scheme, the scheme with no change in the Young area performed worst and was eliminated. So in the medium load scenario, we only need to compare plan 2 and Plan 4.

In the load scenario (600 QPS)GC performance

ParNew + CMS perform significantly better than Parrallel Scavenge + Parrallel Old in medium-load scenarios.

  • The scheme with double expansion of Young area has the best performance, with interface P95 and P99 delay reduced by 32% compared with the current scheme, Full GC cumulative time reduced by 93%, Young GC frequency reduced by 42%, and Young GC cumulative time reduced by 44%.

  • The plan to expand the Young area by 0.5 times is less impressive.

Generally speaking, the performance of the two schemes is very similar. In principle, both schemes are ok, but the scheme with 0.5 times expansion in Young zone performs better in peak business hours. In order to ensure the stability and performance of peak service, ParNew + CMS and the scheme with 0.5 times expansion in Young zone are preferred at present.

4.2 Grayscale scheme/analysis

To ensure peak service coverage, select one online instance randomly from the two equipment rooms on Friday, Saturday, and Sunday. Perform full upgrade after the online instance indicator meets the expectation.

Target group xx. XXX. 60.6

Plan 2, the target plan, is adopted

-Xms4096M -Xmx4096M -Xmn1536M
-XX:MetaspaceSize=256M
-XX:MaxMetaspaceSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark
Copy the code

Control group 1 XX. XXX.15.215

Adopt the original plan

-Xms4096M -Xmx4096M -Xmn1024M
-XX:PermSize=512M
-XX:MaxPermSize=512M
Copy the code

Control group 2 XX. Xxx.40.87

Scheme 4, the candidate target scheme, is adopted

-Xms4096M -Xmx4096M -Xmn2048M
-XX:MetaspaceSize=256M
-XX:MaxMetaspaceSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark
Copy the code

Gray scale 3 machines.

Let’s first analyze the relevant indicators of Young GC:

Young GCS

Young Total GC time

Young Single GC time

It can be seen that compared with the original scheme, the YGC times of the target scheme are reduced by 50%, and the cumulative time is reduced by 47%. While the throughput is improved, the frequency of service pauses is greatly reduced, at the expense of an increase of 3ms in the time of a single Young GC, which is a very high benefit.

The overall performance of scheme 2, namely scheme 2G in Young area, was slightly lower than that of the target scheme, and then the Full GC index was analyzed.

Memory growth in the old days

Full GCS

Full Indicates the total or single GC time

Compared with the original scheme, when using the target scheme, the growth rate of the old age is much slower. Basically, the number of Full GC occurrences within the observation cycle is reduced from 155 to 27 times, a decrease of 82%, and the mean pause time is reduced from 399ms to 60ms, a decrease of 85%, with very few burrs.

The overall performance of plan 2, namely, plan 2G in Young area, was inferior to the target plan. So far, it can be seen that the target scheme is far superior to the original scheme from all dimensions, and the tuning objective is basically achieved.

However, if you are careful, you will find that compared with the original scheme, the time of “Full GC”(actually CMS Background GC) is more stable, but there will be a time-consuming burr after several times of “Full GC”, which means that the user request will pause for 2-3s at this moment. Can you further optimize it? To give users a more extreme experience?

4.3 Re-Optimization

The first step is to analyze the logic behind this phenomenon.

For CMS collector, the collection algorithm is Mark-sweep -[Compact].

CMS collector GC types:

CMS Background GC

This type of GC, the most common type of CMS, is periodic. It is periodically scanned by the RESIDENT thread of the JVM for the usage of older generations and triggered when the usage exceeds a threshold. It uses mark-sweep. The occurrence of CMS Initial Mark (GC) in the GC log indicates that a CMS Background GC has occurred.

Background GC, because of its Mark-sweep, causes old-age memory fragmentation, which is the biggest weakness of CMS.

CMS Foreground GC

This kind of GC is the true Full GC in the CMS collector. It is collected using Serial Old or Parralel Old, which occurs less frequently and causes a big pause when it occurs.

There are many scenarios that trigger CMS Foreground GC, which are as follows:

  • System. The gc ();

  • Jmap – histo: live pid;

  • Insufficient space of metadata area;

  • Promotion failed, marked as ParNew(promotion failed) in GC log;

  • Concurrent mode failed, marked as councurrent mode failure in GC logs.

It is not difficult to infer that the burrs in the target scenario are caused by promotion failures or concurrent mode failures because gc logging is not enabled on the line, but that is just as well because the root cause of both scenarios is old memory fragmentation after several CMS Backgroud GCS.

We just need to minimize promotion failures due to old age shards and concurrent mode failures.

Resident of CMS Background GC by the JVM thread regularly scan the old s usage, trigger when utilization rate exceeds the threshold, the threshold by – XX: CMSInitiatingOccupancyFraction; – XX: + UseCMSInitiatingOccupancyOnly two control parameters, is not set, the default is 92% for the first time, subsequent will predict according to the historical situation, dynamic adjustment.

If we fixed the size of the threshold and set it to a relatively reasonable value that would not make GC too frequent and would reduce the probability of promotion failures or concurrent mode failures, the frequency of burrs would be greatly reduced.

The heap distribution of the target scheme is as follows:

  • Young area 1.5 G

  • Old area 2.5 G

  • The permanent object in the Old area is about 400 MB

According to the empirical data, 75%, 80% is the compromise, so we choose – XX: CMSInitiatingOccupancyFraction = 75 –

XX: + UseCMSInitiatingOccupancyOnly gray observation (we also do the control experiment, 80% of the scene is better than that of 75% to 80%).

The configuration of the final target scheme is as follows:

-Xms4096M -Xmx4096M -Xmn1536M 
-XX:MetaspaceSize=256M 
-XX:MaxMetaspaceSize=256M 
-XX:+UseParNewGC 
-XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark 
-XX:CMSInitiatingOccupancyFraction=75 
-XX:+UseCMSInitiatingOccupancyOnly
Copy the code

Above configuration, grayscale XX.XXX.60.6 a machine;

According to the result of re-optimization, the burrs caused by CMS Foreground GC basically disappear, which conforms to the expectation.

Therefore, the configuration of the final target scheme of video service is;

-Xms4096M -Xmx4096M -Xmn1536M 
-XX:MetaspaceSize=256M 
-XX:MaxMetaspaceSize=256M 
-XX:+UseParNewGC 
-XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark 
-XX:CMSInitiatingOccupancyFraction=75 
-XX:+UseCMSInitiatingOccupancyOnly
Copy the code

5. Acceptance of results

The gray scale lasted for about 7 days, covering weekdays and weekends, and the results were in line with expectations, so it met the conditions for opening full online. The results after full scale were evaluated below.

Young GCS

Young Total GC time

Duration of a single Young GC

In terms of Young GC indicators, after adjustment, the average number of Young GC is reduced by 30%, the average cumulative time of Young GC is reduced by 17%, and the average single time time of Young GC is increased by about 7ms. The performance of Young GC meets expectations.

In addition to technology, we also did some optimization in our business, tuning instance before Young GC will be obvious, irregular (timing task not assigned to the current instance) burr, here is the business on a regular task, loads of large amounts of data, the tuning process will be the task shard, split into multiple instances, This makes the Young GC smoother.

Full Indicates the single OR cumulative GC time

In terms of the “Full GC” metrics, the frequency and pauses of “Full GC” are greatly reduced, and there is almost no real “Full GC” anymore.

The response time of core interface-A (more downstream dependent) P99 decreased by 19% (from 3457 ms to 2817 ms);

Core interface -B (medium downstream dependence) P99 response time, reduced by 41% (from 1647ms to 973ms);

Core interface C (least downstream dependence) P99 response time, reduced by 80% (from 628ms to 127ms);

Taken together, the overall results exceeded expectations. Young GC performance is very consistent with the set goal, there is basically no true Full GC, the optimization effect of interface P99 depends on the number of downstream dependencies, the less dependencies, the more obvious effect.

Write at the end

The complexity of the GC algorithm, the number of parameters that affect GC performance, and the specific parameter Settings depending on the characteristics of the service greatly increase the difficulty of JVM tuning.

This article combines the video service tuning experience, focuses on the introduction of tuning ideas and landing process, and summarizes some general tuning process, hoping to provide you with some reference.

Authors: Li Guanyun, Jessica Chen, Internet Technology team, Vivo