Analysis and Solution of Performance Fluctuation in Service Startup Process

1 Introduction

This article only shares the solutions and ideas for the problems encountered at work, as well as the troubleshooting process. The key point is to share the investigation ideas. The knowledge points are actually quite old. If you have any questions or the description is inappropriate, please let me know.

2. Problem representation

When the project starts, the system requests will have a wave of timeout. From the monitoring point of view, the GC (G1) of the JVM fluctuates greatly, the CPU fluctuates greatly, the thread pool used by each business fluctuates greatly, and the external IO time consumption increases. System calls generate more exceptions (also due to timeouts)
The number of exceptions in the publishing process:

3. The conclusion first

Due to the optimization of JIT, the compilation of hot code is triggered when the system starts, and it is compiled for C2, which leads to high CPU usage, which in turn causes a series of problems, and eventually causes some requests to time out.

4. Troubleshooting process

In fact, the knowledge points are placed there. The important thing is to be able to connect the actual problems encountered with the knowledge points and to have a deeper understanding of this part of the knowledge. This can be translated into experience.

4.1 Initial investigation

Our project is an algorithm sorting project, and some small models and large and small caches have been added to it, and from the monitoring point of view, the GC spurt time of the JVM is very close to the CPU spurt time (this is also a monitoring platform). the time is not accurate enough). So in the early stage, I spent a lot of energy and time to troubleshoot JVM and GC problems.
First of all, I recommend a website to everyone: https://gceasy.io/ , which is really useful for analyzing GC logs. Print the GC log with the following JVM parameters:

 -XX:+PrintGC 输出GC日志
-XX:+PrintGCDetails 输出GC的详细日志
-XX:+PrintGCTimeStamps 输出GC的时间戳（以基准时间的形式，你启动的时候相当于12点，跟真实时间无关）
-XX:+PrintGCDateStamps 输出GC的时间戳（以日期的形式，如 2013-05-04T21:53:59.234+0800）
-Xloggc:../logs/gc.log 日志文件的输出路径

Because I saw that YGC was serious, I tried the following methods successively:
- Adjust the heap size of the JVM. That is -Xms, -Xmx parameters. invalid.
- Adjust the number of recycling threads. That is -XX:ConcGCThreads parameter. invalid.
- Adjust the expected single recovery time. That is, the -XX:MaxGCPauseMillis parameter is invalid, or even worse.
- The above adjustment and mixed test are invalid.
- Chicken thief method. After loading the model, sleep for a period of time, let the GC stabilize, and then put the request in. After this operation, the GC does get better, but the request at the beginning still has a timeout. (of course, because the problem is not GC at all)

4.2 Another way of thinking

According to the monitoring point of view, the thread pool, external IO, and RT all have obvious rise and then fall at startup, and the trend is very consistent. This is generally caused by systemic problems, such as CPU, GC, network card, cloud host oversold, Room delays, etc. So since GC cannot be cured, let's start with the CPU.
Because the JVM will generate a large amount of GC when the system is started, it is impossible to distinguish whether the traffic is coming because the system has not been warmed up, or whether the traffic will be a problem no matter how long the system is started. And I checked the GC operation before, that is, adding the sleep time, which happened to help me see this problem, because it can be clearly seen that the time of the GC fluctuation and the time of the timeout are very different. That is to say, the fluctuation has nothing to do with GC. No matter how stable the GC is, once the flow comes, it still has to time out.

4.3 Analysis tool Arthas

I have to say that Arthas is really a very useful analysis tool that saves a lot of complicated operations.

Arthas documentation: https://arthas.aliyun.com/doc/quick-start.html
In fact, the core to be analyzed is what our CPU does when the traffic first arrives, so we use Arthas to analyze the CPU situation when the traffic arrives. In fact, this part can also be completed with commands such as top -Hp pid , jstack, etc., without the description.
CPU situation:

It can be seen from the figure that C2 CompilerThread occupies a lot of CPU resources.

4.4 The heart of the problem

So what exactly is this C2 CompilerThread.
"In-depth understanding of JAVA virtual machine" actually has a description of this part, here I will explain it to you in vernacular.
In fact, when Java first runs, you can understand that it is stupid to execute the code you wrote, calling it an "interpreter". This has the advantage that it is very fast. Java becomes .class, and it is very fast. It can start and run, but the problem is also obvious, that is, it runs slowly, so smart JVM developers have done one thing, if they find that you have some code that is executed frequently, then they will be running During the period, you compile this code into machine code, so that it will run quickly, this is just-in-time compilation (JIT). But this also has a problem, that is, the compilation time consumes CPU. The C2 CompilerThread is a layer of optimization in JIT (a total of five layers, C2 is the fifth layer). So, the culprit was found.

5. Try to solve

The relationship between the interpreter and the compiler can be as follows:

As mentioned above, the interpreter starts fast, but executes slowly. The compiler is divided into the following five levels.

 第 0 层：程序解释执行，默认开启性能监控功能（Profiling），如果不开启，可触发第二层编译；
第 1 层：可称为 C1 编译，将字节码编译为本地代码，进行简单、可靠的优化，不开启 Profiling；
第 2 层：也称为 C1 编译，开启 Profiling，仅执行带方法调用次数和循环回边执行次数 profiling 的 C1 编译；
第 3 层：也称为 C1 编译，执行所有带 Profiling 的 C1 编译；
第 4 层：可称为 C2 编译，也是将字节码编译为本地代码，但是会启用一些编译耗时较长的优化，甚至会根据性能监控信息进行一些不可靠的激进优化。

So we can try to solve the problem from the perspective of C1, C2 compiler.

5.1 Turn off layered compilation

 增加参数 ： -XX:-TieredCompilation -client （关闭分层编译，开启C1编译）

The effect is sloppy.
CPU usage continues to be high water (compared to before adjustment). It is true that there is no C2 thread problem, but it is guessed that because the code is not compiled as well as C2, the performance of the code continues to be low.
CPU screenshot:

5.2 Increase the number of C2 threads

 增加参数 ：-XX:CICompilerCount=8 恢复参数：-XX:+TieredCompilation

The effect is average, there are still request timeouts. But it will be less.
CPU screenshot:

5.3 Inference

In fact, it can be seen from the above analysis that if C2 cannot be bypassed, there will inevitably be some jitter. If C2 is bypassed, the overall performance will be much lower, which we do not want to see, so close C1, C2, and directly Running in interpreter mode I didn't try.

6. Solutions

6.1 Final Scheme

Since this part of the jitter cannot be bypassed, then we can use some mock traffic to withstand this part of the jitter, which can also be called warm-up. When the project starts, use the traffic recorded in advance to make the system hot code complete real-time compilation, Then receive the real traffic, so that the real traffic does not jitter.
During the normal operation of the system, some traffic is collected and serialized into files for storage. When the system starts, the files are deserialized into request objects for traffic replay. Then, the C2 compile of JIT is triggered, so that the fluctuation of the CPU is completed during the warm-up period, and the normal online traffic is not affected.

6.2 Put the result first

Expect to reduce 10,000 exception requests per post (counting exceptions only excluding timeouts).
Reduce the revenue loss of other businesses caused by search diversion.
The drainage operations of other related searches all reduce the loss of 10,000 requests per release.
Abnormal reduction:

Changes in RT:

The overall change can be seen from the monitoring system. Comparing the RT changes in the two release processes, it is found that the system after governance is more stable, and the RT basically does not fluctuate greatly, while the interface without governance has a higher RT:

6.3 Preheating Design

6.3.1 Overall Process Representation

The following figure shows the traffic collection process of collecting traffic by the way during normal online service, and the replay process when it is sent to restart, publish and other operations.

6.3.2 Explain the details

①: The sorting system receives requests for different codes (which can be understood as requests for different services). In the figure, different requests are marked with different colors.
②: The entry for expressing the request of the sorting system, although the internal execution is chained, the external RPC is a different interface.
③: The AOP used here is done in the Around method, and specific annotations are designed to reduce the invasion of the existing code by the warmup operation. This annotation is placed at the RPC implementation of the entry, and the request information can be automatically collected.
④: It expresses the streaming orchestration system of the sorting system. There are different RPC interfaces to the outside world, but in fact, flowexecutor.run is used internally to realize the concatenation and realization of different links of different services.
⑤: The asynchronous storage method is used in AOP, which can avoid affecting the RT of normal requests when the warmup collects traffic, but it should be noted here that the asynchronous storage here must pay attention to the deep copy of the object, otherwise there will be very serious problems. Weird exception because of the subsequent link. The sorting system is operated with the Request object, and the asynchronous operation of warmup will be slightly slower due to the operation of files, so if the Request object has been changed and then serialized for the next use, it will destroy the original request. As a result, the warmup will be abnormal the next time it is started. Therefore, a deep copy operation is also performed in AOP, so that the normal business request and the warmup serialization storage operation are not the same object.
⑥: The original AOP design is actually designed using before, that is, it does not care about the result of execution, and persists the traffic when the Request comes. However, it was later found that due to the existing bugs in the sorting system, some requests may cause exceptions. If we do not pay attention to the results and still record the requests that may trigger exceptions, a large number of requests may be generated during warm-up. exception, which triggers an alarm. Therefore, the aspect of AOP is adjusted from before to Around, focusing on the result. If the result is not empty, the traffic is serialized and stored persistently.
⑦: The serialized files actually need to be stored in folders, because different codes, that is, when requesting different business RPCs, the generic types of Request<T> are different, so they need to be distinguished and reversed. Specify generics when serializing.
⑧: The original design was to complete the entire warm-up operation with a single thread. Later, it was found that the speed was too slow, requiring about 12 minutes of warm-up, and there were many machines in the sorting system. If each group increased by 12 minutes, it was unacceptable. Therefore, the multi-threading method is used to preheat, and it is finally shortened to about 3 minutes.
⑨: The publishing method of the publishing system is actually to continuously call the check interface. If there is a return, it means that the program starts successfully. Next, it will try to call the online interface to complete the online components such as rpc and message queue, so the original check is modified. The interface, from meaningless returning "ok", is adjusted to test whether the warmup process is complete. If it is not completed, throw an exception, otherwise return ok, so that the warmup can be completed before the online, that is, before the traffic is received, and the traffic will not come before the warmup is over.

7. Finally

This article describes the causes, consequences, and details of problems encountered during the warm-up of a system design. The effect of the final launch is quite impressive. It solves the crazy alarm and the loss of real traffic every time it is released. The focus is on sharing the thinking of troubleshooting and problem solving. Students who encounter similar problems may be able to combine their own The company's release system to achieve this operation.
Throughout the development and self-testing process, focus on the following:
- Does it really solve the online problem?
- Whether new problems have been introduced.
- Whether the preheated flow is uniquely identified to avoid data backflow for the preheated part of the flow.
- How to better fit with the company's existing release system.
- How to reduce the intrusiveness and be completely unaware of other developers of this project and users of the system.
- Whether it can be done without developers needing to pay attention to warmup at all, and can complete the whole set of operations automatically, so that they don't even know that I have launched a new function, but it really solves the problem.
- If there is a problem with the preheating system, can the preheating be directly turned off to ensure the stability of the line.

8. Reference articles

[About java: -XX:-What exactly does -TieredCompilation do] https://www.codenong.com/38721235/
[It seems to be the original version of the above article] https://stackoverflow.com/questions/38721235/what-exactly-does-xx-tieredcompilation-do
[C2 Compiler Thread] https://blog.csdn.net/chenxiusheng/article/details/74007750
[C2 CompilerThread9 occupies CPU for a long time solution] https://blog.csdn.net/m0_37886429/article/details/105139611
"Late (Runtime) Optimization" in Part 4 of "Understanding the Java Virtual Machine Second Edition"
[In-depth analysis of the creation and operation principles of threads in the JVM｜｜JIT (future)] https://www.cnblogs.com/silyvin/p/10228184.html
[Tiered Compilation of HotSpot Virtual Machine] https://blog.csdn.net/u013490280/article/details/108522427

Analysis and Solution of Performance Fluctuation in Service Startup Process

1 Introduction

2. Problem representation

3. The conclusion first

4. Troubleshooting process

4.1 Initial investigation

4.2 Another way of thinking

4.3 Analysis tool Arthas

4.4 The heart of the problem

5. Try to solve

5.1 Turn off layered compilation

5.2 Increase the number of C2 threads

5.3 Inference

6. Solutions

6.1 Final Scheme

6.2 Put the result first

6.3 Preheating Design

6.3.1 Overall Process Representation

6.3.2 Explain the details

7. Finally

8. Reference articles

羊都是我吃的

引用和评论

Spring中的循环依赖

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性