JVM tuning practice road in high concurrency scenarios

1. Background

In February 2021, I received feedback that a certain core interface of the video APP responded slowly during peak periods, which affected the user experience.

Through monitoring, it is found that the slow interface response is mainly caused by the high time-consuming P99, which is suspected to be related to the GC of the service. A typical instance of the service GC behaves as shown in the following figure:

It can be seen that in the observation period:

The average number of Young GC times per 10 minutes is 66 times, and the peak value is 470 times;
The average number of Full GC times per 10 minutes is 0.25 times, and the peak value is 5 times;

It can be seen that Full GC is very frequent, and Young GC is also relatively frequent in a specific period of time, and there is a large room for optimization. Since the optimization of GC pause is an effective means to reduce the P99 delay of the interface, it is decided to perform JVM tuning on the core service.

2. Optimization goals

interface P99 latency reduced by 30%
Reduce the number of Young GC and Full GC, pause duration, and single pause duration

Because the behavior of GC is related to concurrency, for example, when the concurrency is relatively high, no matter how you tune it, Young GC will always be very frequent, and there will always be objects that should not be promoted to trigger Full GC. Therefore, the optimization goals are set separately according to the load:

Target 1: High load (single machine above 1000 QPS)

The number of Young GC is reduced by 20%-30%, and the accumulated time of Young GC does not deteriorate;
Full GC times are reduced by more than 50%, single and cumulative Full GC time is reduced by more than 50%, service release does not trigger Full GC.

Objective 2: Medium load (single machine 500-600)

The number of Young GC is reduced by 20%-30%, and the cumulative time of Young GC is reduced by 20%;
The number of Full GC is not more than 4 times/day, and the service release does not trigger Full GC.

Objective 3: Low load (below 200 QPS for a single machine)

The number of Young GC is reduced by 20%-30%, and the cumulative time of Young GC is reduced by 20%;
The number of Full GC is not more than 1 time/day, and the service release does not trigger Full GC.

3. Current problems

The JVM configuration parameters of the current service are as follows:

-Xms4096M -Xmx4096M -Xmn1024M
-XX:PermSize=512M
-XX:MaxPermSize=512M

Analyzing purely from the parameters, there are the following problems:

specified collector is not displayed

The default collector of JDK 8 is ParallelGC, that is, Parallel Scavenge is used for Young area, and Parallel Old is used for collection in the old age. This configuration is characterized by throughput priority and is generally suitable for background task servers.

For example, batch order processing, scientific computing, etc. are sensitive to throughput and insensitive to delay. The current service is a portal for video and user interaction, which is very sensitive to delay. Therefore, it is not suitable to use the default collector ParrallelGC. You should choose a more appropriate one. Collector.

Young district ratio is unreasonable

The current service mainly provides API. The characteristic of this type of service is that there are fewer resident objects, and the life cycle of most objects is relatively short, and they will die after one or two Young GCs.

Look at the current JVM configuration :

The entire heap is 4G, and the Young area is 1G in total. The default -XX:SurvivorRatio=8, that is, the effective size is 0.9G, and the size of the resident objects in the old age is about 400M.

This means that when the service load is high and the request concurrency is large, the Eden + S0 area in the Young area will fill up quickly, and the Young GC will be more frequent.

In addition, objects that should have been recycled by Young GC are promoted prematurely, increase the frequency of Full GC, and the area for a single collection will also increase. Since the Old area uses ParralellOld, it cannot be executed concurrently with user threads, resulting in long service. Time stalls, availability drops, and P99 response time rises.

not set

-XX:MetaspaceSize and -XX:MaxMetaspaceSize

Perm区在jdk 1.8已经过时，被Meta区取代，
因此-XX:PermSize=512M -XX:MaxPermSize=512M配置会被忽略，
真正控制Meta区GC的参数为
-XX:MetaspaceSize:
Metaspace初始大小，64位机器默认为21M左右
 
-XX:MaxMetaspaceSize:
Metaspace的最大值，64位机器默认为18446744073709551615Byte，
可以理解为无上限
 
-XX:MaxMetaspaceExpansion:
增大触发metaspace GC阈值的最大要求
 
-XX:MinMetaspaceExpansion:
增大触发metaspace GC阈值的最小要求，默认为340784Byte

In this way, in the process of service startup and release, when the metadata area reaches 21M, a Full GC (Metadata GC Threshold) will be triggered, and then as the metadata area expands, it will be mixed with several Full GC (Metadata GC Threshold) to make the service Release stability and efficiency decline.

In addition, if the service uses a lot of dynamic class generation technology, it will also generate unnecessary Full GC (Metadata GC Threshold) because of this mechanism.

Four, optimization plan / verification plan

The above analysis has analyzed the obvious shortcomings of the current configuration. The following optimization plan mainly solves these problems in a targeted manner, and then determines whether to continue in-depth optimization based on the effect.

The current mainstream/excellent collectors include:

Parrallel Scavenge + Parrallel Old : throughput priority, suitable for background task services;
ParNew + CMS : a classic low-pause collector, most commercial and delay-sensitive services are in use;
G1 JDK 9, which shows relatively high throughput and short pause time when the heap memory is relatively large (above 6G-8G);
ZGC : A low-latency garbage collector introduced in JDK 11, currently in the experimental stage;

Combined with the actual situation of the current service (heap size, maintainability), we choose the ParNew + CMS solution is more appropriate.

The principle of parameter selection is as follows:

1) The size of the Meta area must be specified as , and the MetaspaceSize and MaxMetaspaceSize should be set the same. The specific size should be combined with the online instance. You can obtain the online instance of the service through jstat -gc.

# jstat -gc 31247
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
37888.0 37888.0 0.0 32438.5 972800.0 403063.5 3145728.0 2700882.3 167320.0 152285.0 18856.0 16442.4 15189 597.209 65 70.447 667.655

It can be seen that the MU is around 150M, so -XX:MetaspaceSize=256M -XX:MaxMetaspaceSize=256M is more reasonable.

2) The Young area is not as big as possible .

When the heap size is constant, the larger the Young area, the smaller the frequency of Young GC, but the Old area will become smaller. If it is too small, a slight promotion of some objects will trigger the Full GC.

If the Young area is too small, Young GC will be more frequent, so the Old area will be larger, and the pause of a single Full GC will be larger. Therefore, the size of the Young area needs to be compared in several scenarios based on the service situation, and finally the most suitable configuration is obtained.

Based on the above principles, the following are 4 parameter combinations:

ParNew +CMS, the Young area is doubled

-Xms4096M -Xmx4096M -Xmn2048M
-XX:MetaspaceSize=256M
-XX:MaxMetaspaceSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSScavengeBeforeRemark

2.ParNew +CMS, Young district doubled,

Remove -XX:+CMSScavengeBeforeRemark (use [-XX:CMSScavengeBeforeRemark] parameter to perform a new generation GC before remarking).

Because there are cross-generation references between objects between the old and young generations, when GC Roots tracking is performed in the old generation, the young generation will also be scanned. If you can perform a new generation GC before remarking, you can scan less For some objects, the performance of the relabeling phase can also be improved. )

-Xms4096M -Xmx4096M -Xmn2048M
-XX:MetaspaceSize=256M
-XX:MaxMetaspaceSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC

3.ParNew +CMS, the Young area expanded by 0.5 times

-Xms4096M -Xmx4096M -Xmn1536M
-XX:MetaspaceSize=256M
-XX:MaxMetaspaceSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark

4.ParNew +CMS, the Young district remains unchanged

-Xms4096M -Xmx4096M -Xmn1024M
-XX:MetaspaceSize=256M
-XX:MaxMetaspaceSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark

Next, we need to compare, analyze, and verify the actual performance of the four solutions under different loads in a stress testing environment.

4.1 Verification/Analysis of Pressure Test Environment

High load scene (1100 QPS) GC performance

It can be seen that in high-load scenarios, the performance of the four ParNew + CMS indicators is far better than Parrallel Scavenge + Parrallel Old. in:

Scheme 4 (the Young area expanded by 0.5 times) performed best. The interface P95 and P99 delays were reduced by 50% compared with the current scheme, the cumulative time of Full GC was reduced by 88%, the number of Young GC was reduced by 23%, and the cumulative time of Young GC was reduced by 4%. After the Young zone is increased, although the number of times is reduced, but the Young zone is larger, and the time consumption of a single Young GC will also increase, which is in line with expectations.
The two schemes that double the Young area, namely scheme 2 and scheme 3, have similar performance. The interface P95 and P99 delays are reduced by 40% compared with the current scheme, the cumulative time of Full GC is reduced by 81%, and the number of Young GCs is reduced by 43%. The cumulative time consumption of GC is reduced by 17%, which is slightly inferior to the expansion of the Young area by 0.5 times. The overall performance is good. The two schemes are merged and no longer distinguished.

The unchanged plan in the Young district performed the worst in the new plan and was eliminated. So in the medium load scenario, we only need to compare Option 2 and Option 4.

GC performance in load scenario (600 QPS)

It can be seen that in the medium load scenario, the performance of the indicators of the two ParNew + CMS (plan 2 and plan 4) are also much better than Parrallel Scavenge + Parrallel Old.

The solution that doubled the Young zone performed best. The interface P95 and P99 delays were reduced by 32% compared with the current solution, the cumulative time of Full GC was reduced by 93%, the number of Young GC was reduced by 42%, and the cumulative time of Young GC was reduced by 44%;
The plan to expand the Young area by a factor of 0.5 is somewhat inferior.

Taken together, the performance of the two plans is very similar. In principle, both plans are ok, but the plan with a 0.5 times expansion of the Young district performs better during peak business periods. To ensure the stability and performance of services during peak periods, it is currently preferred When choosing ParNew + CMS, the Young area is expanded by 0.5 times.

4.2 Gray scheme/analysis

In order to ensure the peak period of coverage of the business, choose to randomly select an online instance from the two computer rooms on Friday, Saturday, and Sunday. After the indicators of the online instance meet the expectations, the full upgrade will be carried out.

Target group xx.xxx.60.6

Use plan 2, the target plan

-Xms4096M -Xmx4096M -Xmn1536M
-XX:MetaspaceSize=256M
-XX:MaxMetaspaceSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark

Control group 1 xx.xxx.15.215

Adopt the original plan

-Xms4096M -Xmx4096M -Xmn1024M
-XX:PermSize=512M
-XX:MaxPermSize=512M

Control group 2 xx.xxx.40.87

Use option 4, which is the candidate target option

-Xms4096M -Xmx4096M -Xmn2048M
-XX:MetaspaceSize=256M
-XX:MaxMetaspaceSize=256M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark

3 machines in grayscale.

Let's first analyze the relevant indicators of Young GC:

Young GC times

Young GC accumulated time

Young GC single time consumption

It can be seen that compared with the original solution, the number of YGCs of the target solution is reduced by 50%, and the cumulative time is reduced by 47%. While the throughput is increased, the frequency of service pauses is greatly reduced, at the cost of a single Young GC time-consuming increase 3ms, the profit is very high.

Compare the overall performance of the 2G program in the Young district with the target program, which is slightly inferior to the target program, and then analyze the Full GC indicators.

Memory growth in the old age

Full GC times

Full GC cumulative/single time consumption

Compared with the original scheme, when using the target scheme, the growth rate of the old generation is much slower. Basically, the number of Full GC occurrences in the observation period is reduced from 155 to 27, a reduction of 82%, and the average pause time is reduced from 399ms to 60ms. , 85% reduction, and very few glitches.

The overall performance of the control plan 2, namely the 2G plan in Young district, was worse than the target plan. At this point, it can be seen that the target solution is far superior to the original solution in all dimensions, and the tuning target is basically achieved.

However, careful students will find that the target solution is more stable than the original solution, "Full GC" (actually CMS Background GC) is more time-consuming, but after each "Full GC" there will be a very time-consuming glitch. , Which means that the user request will pause for 2-3 seconds at this moment. Can it be further optimized to give users a more extreme experience?

4.3 Re-optimize

Here we must first analyze the logic behind this phenomenon.

For the CMS collector, the collection algorithm used is Mark-Sweep-[Compact].

Types of CMS collector GC:

CMS Background GC

This type of GC is the most common type of CMS. It is periodic. The resident thread of the JVM scans the usage rate of the old generation regularly. When the usage rate exceeds the threshold, it is triggered. The Mark-Sweep method is adopted. Because there is no Compact. This is a time-consuming operation and can be parallel to the user process, so the CMS pause will be relatively low. The GC (CMS Initial Mark) in the GC log means that a CMS Background GC has occurred.

Because Background GC uses Mark-Sweep, it will cause memory fragmentation in the old generation, which is also the biggest weakness of CMS.

CMS Foreground GC

This kind of GC is the real Full GC in the CMS collector. It uses Serial Old or Parralel Old for collection, and the frequency of occurrence is lower, and when it often appears, it will cause a larger pause.

There are many scenarios that trigger CMS Foreground GC, the scenarios are as follows:

System.gc()；
jmap -histo:live pid；
Insufficient space in the metadata area;
Promotion failed, the mark in the GC log is ParNew(promotion failed);
Concurrent mode failure, the mark in the GC log is councurrent mode failure.

It is not difficult to infer that the glitch in the target scheme is caused by the failure of promotion or the failure of concurrent mode. Since the gc log is not printed on the line, it does not matter, because the root cause of the two scenarios is the same, that is, several CMS Backgroud GC Memory fragmentation in the old age caused by later.

We only need to minimize the failure of promotion and concurrent mode caused by fragmentation in the old age.

CMS Background GC scans the usage rate of the old age by the resident thread of the JVM. It is triggered when the usage rate exceeds the threshold. The threshold is controlled by the two parameters -XX:CMSInitiatingOccupancyFraction; -XX:+UseCMSInitiatingOccupancyOnly. It is not set and the default is 92 for the first time. %, the follow-up will be based on historical conditions to predict and dynamically adjust.

If we fix the size of the threshold and set the threshold to a relatively reasonable value, it will not only make GC too frequent, but also reduce the probability of promotion failure or concurrent mode failure, which can greatly alleviate the frequency of glitches.

The heap distribution of the target program is as follows:

1.5G in Young District
2.5G in Old District
Resident object in Old area is about 400M

According to empirical data, 75% and 80% are relatively compromised, so we choose -XX:CMSInitiatingOccupancyFraction=75-

XX:+UseCMSInitiatingOccupancyOnly performs grayscale observation (we also did a control experiment on 80% of the scenes, 75% is better than 80%).

The configuration of the final target plan is:

-Xms4096M -Xmx4096M -Xmn1536M 
-XX:MetaspaceSize=256M 
-XX:MaxMetaspaceSize=256M 
-XX:+UseParNewGC 
-XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark 
-XX:CMSInitiatingOccupancyFraction=75 
-XX:+UseCMSInitiatingOccupancyOnly

As configured above, one machine with gray scale xx.xxx.60.6;

From the results of re-optimization, the burr caused by CMS Foreground GC has basically disappeared, which is in line with expectations.

Therefore, the final target plan of the video service is configured as:

-Xms4096M -Xmx4096M -Xmn1536M 
-XX:MetaspaceSize=256M 
-XX:MaxMetaspaceSize=256M 
-XX:+UseParNewGC 
-XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark 
-XX:CMSInitiatingOccupancyFraction=75 
-XX:+UseCMSInitiatingOccupancyOnly

Five, result acceptance

The gray scale lasts for about 7 days, covering weekdays and weekends, and the results are in line with expectations. Therefore, it meets the conditions for opening the full amount online. The results after the full amount are evaluated below.

Young GC times

Young GC accumulated time

Single Young GC takes time

From the Young GC index, after the adjustment, the number of Young GCs decreased by 30% on average, the cumulative time consumed by Young GC decreased by 17% on average, and the average single time consumed by Young GC increased by about 7ms. The performance of Young GC was in line with expectations.

In addition to technical means, we have also made some optimizations in the business. The Young GC of the pre-tuning instance will have obvious and irregular (timed tasks may not be assigned to the current instance) glitches. Here is a timed task in the business. A large amount of data will be loaded. During the tuning process, the task will be divided into multiple instances to make Young GC smoother.

Full GC single/cumulative time

Judging from the "Full GC" indicator, the frequency and pauses of "Full GC" are greatly reduced. It can be said that there is basically no Full GC in the true sense.

Core interface-A (rely more on downstream) P99 response time, reduced by 19% (from 3457 ms to 2817 ms);

Core interface-B (medium downstream dependency) P99 response time, reduced by 41% (from 1647ms to 973ms);

Core interface-C (the least downstream dependency) P99 response time, reduced by 80% (from 628ms to 127ms);

Taken together, the overall result is beyond expectations. The performance of Young GC is very consistent with the set goals. There is basically no full GC in the true sense. The optimization effect of the interface P99 depends on the number of downstream dependencies. The less the dependency, the more obvious the effect.

Six, write at the end

Because the GC algorithm is complex, there are many parameters that affect GC performance, and the setting of specific parameters depends on the characteristics of the service. These factors greatly increase the difficulty of JVM tuning.

Combining the tuning experience of video services, this article focuses on the ideas and implementation process of tuning, and at the same time summarizes some general tuning procedures, hoping to provide you with some references.

Authors: Li Guanyun, Jessica Chen, vivo Internet Technology Team

JVM tuning practice road in high concurrency scenarios

1. Background

2. Optimization goals

3. Current problems

Four, optimization plan / verification plan

4.1 Verification/Analysis of Pressure Test Environment

4.2 Gray scheme/analysis

4.3 Re-optimize

Five, result acceptance

Six, write at the end

vivo互联网技术

引用和评论

vivo 官网 APP 首页端智能业务实践

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性