头图

[Troubleshooting series] CPU jitter caused by C2 compilerthread

羊都是我吃的
中文

1 Introduction

  • This article only shares the solutions and ideas for the problems encountered at work, as well as the process of troubleshooting. The key point is to share the ideas of the investigation, the knowledge points are actually quite old. If you have any questions or the description is incorrect, please enlighten me.

2. Problem appearance

  • When the project is started, the system request will have a wave of timeouts. From the monitoring point of view, the GC (G1) of the JVM fluctuates greatly, the CPU fluctuates greatly, the thread pool used by each business fluctuates, and the external IO time consumption increases.

3. Let's talk about the conclusion first

  • Due to the optimization of JIT, the compilation of hot code is triggered when the system starts, and it is compiled for C2, which causes a high CPU usage, which in turn leads to a series of problems, and eventually causes some requests to time out.

4. Investigation process

In fact, the knowledge points are there. How to connect the actual problems and knowledge points together and understand this part of the knowledge more deeply, this may be the meaning of investigation and analysis, and then precipitate into experience, and then grow.

4.1 Initial investigation

  • In fact, our project is an algorithm sorting project, which more or less adds some small models and large and small caches, and from the monitoring point of view, the JVM GC spike and CPU spike time are very close (this is also a monitoring The reason why the platform time is not accurate enough). So in the early stage, I spent a lot of energy and time to troubleshoot JVM and GC issues.
  • First of all, I recommend a website to everyone: https://gceasy.io/ , really easy to use for analyzing GC logs. Print GC log with the following JVM parameters:
-XX:+PrintGC 输出GC日志
-XX:+PrintGCDetails 输出GC的详细日志
-XX:+PrintGCTimeStamps 输出GC的时间戳(以基准时间的形式,你启动的时候相当于12点,跟真实时间无关)
-XX:+PrintGCDateStamps 输出GC的时间戳(以日期的形式,如 2013-05-04T21:53:59.234+0800)
-Xloggc:../logs/gc.log 日志文件的输出路径
  • Because I saw that YGC was serious, I tried the following methods one after another:

    • Adjust the heap size of the JVM. Namely -Xms, -Xmx parameters. invalid.
    • Adjust the number of recycling threads. That is, the -XX:ConcGCThreads parameter. invalid.
    • Adjust the expected single recovery time. That is, the -XX:MaxGCPauseMillis parameter is invalid or even worse.
    • The above adjustment and mixed test are all invalid.
    • Chicken thief's method. After loading the model, sleep for a period of time to allow the GC to stabilize, and then put the request in. After this operation, the GC does get better, but the initial request still has a timeout. (Of course, because the problem is not GC at all)

4.2 Change your mind

  • According to the monitoring point of view, the thread pool, external IO, and RT have a significant increase and then decrease at startup, and the trend is very consistent. This is generally caused by systemic problems, such as CPU, GC, network card, and cloud host oversold. Computer room delays and so on. So since GC cannot be cured, let's start with the CPU.
  • Because the JVM will generate a large number of GCs when the system is started, it is impossible to distinguish whether it is due to the flow of the system before it has been warmed up or that no matter how long it has been started, the flow will have problems as soon as it comes. And I checked the operation of GC, that is, adding sleep time, which happened to help me see this problem, because it can be clearly seen that the GC fluctuation time and the timeout time are already a lot different in time. That is to say , Has nothing to do with GC, no matter how smooth the GC has been, once the traffic comes, it still has to time out.

4.3 Analysis tool Arthas

I have to say that Arthas is really a very useful analysis tool, saving a lot of complicated operations.
  • Arthas document: https://arthas.aliyun.com/doc/quick-start.html
  • In fact, the core to be analyzed is what our CPU did when the traffic first arrived, so we used Arthas to analyze the CPU situation when the traffic arrived. In fact, this part can also be completed with commands such as top -Hp pid, jstack, etc., and the description will not be expanded.
  • CPU situation, only the important part is shown:

It can be seen from the figure that C2 CompilerThread occupies a lot of CPU resources.

4.4 The core of the problem

  • So what exactly is this C2 CompilerThread.
  • "In-depth understanding of the JAVA virtual machine" actually has a description of this part, here I will explain it to you in vernacular.
  • In fact, when Java runs at the very beginning, you can understand it as stupidly executing the code you wrote, calling it an "interpreter". This has the advantage that it is very fast, and Java becomes .class, very fast It can be started and run, but the problem is also obvious. It is slow. Then smart JVM developers have done one thing. If they find that you have some code that is frequently executed, then they will be running. In the meantime, I will help you compile this code into machine code, so that it will run quickly. This is just-in-time compilation (JIT). But there is also a problem with this, that is, the time required for compilation is CPU-consuming. The C2 CompilerThread is just one layer of optimization in JIT (a total of five layers, C2 is the fifth layer). So, the culprit was found.

5. Try to solve

  • The relationship between the interpreter and the compiler can be as follows:

  • As mentioned above, the interpreter starts fast, but executes slowly. The compiler is divided into the following five levels.
第 0 层:程序解释执行,默认开启性能监控功能(Profiling),如果不开启,可触发第二层编译;
第 1 层:可称为 C1 编译,将字节码编译为本地代码,进行简单、可靠的优化,不开启 Profiling;
第 2 层:也称为 C1 编译,开启 Profiling,仅执行带方法调用次数和循环回边执行次数 profiling 的 C1 编译;
第 3 层:也称为 C1 编译,执行所有带 Profiling 的 C1 编译;
第 4 层:可称为 C2 编译,也是将字节码编译为本地代码,但是会启用一些编译耗时较长的优化,甚至会根据性能监控信息进行一些不可靠的激进优化。
  • So we can try to solve the problem from the perspective of C1 and C2 compilers.

5.1 Turn off hierarchical compilation

增加参数 : -XX:-TieredCompilation -client (关闭分层编译,开启C1编译)
  • The effect is pulpy.
  • CPU usage continues to be high (compared to before adjustment). It is true that there is no C2 thread problem, but it is guessed that because the code is not compiled as good as C2, the performance of the code continues to be low.
  • CPU screenshot:

5.2 Increase the number of C2 threads

增加参数 :-XX:CICompilerCount=8 恢复参数:-XX:+TieredCompilation
  • The effect is average, and there are still requests timed out. But it will be less.
  • CPU screenshot:

5.3 Inference

  • In fact, it can be seen from the above analysis that if C2 is not bypassed, then there will be some jitter. If C2 is bypassed, the overall performance will be much lower. This is what we don’t want to see, so turn off C1 and C2 directly. I did not try to run it in interpreter mode.

5.4 Final plan

  • Since this part of the jitter cannot be circumvented, we can use some mock traffic to withstand this part of the jitter, which can also be called preheating. When the project starts, use the pre-recorded traffic to make the system hotspot code complete instant compilation. Then the real traffic is received, so that the real traffic can be achieved without jitter.

Afterword

  • This article focuses on sharing the process of solution and analysis, and does not focus on the analysis of knowledge points. For more knowledge, please see the "Reference Articles" section.
  • If you have any questions in this article, you are welcome to correct it.

Reference article

阅读 729

vipwhr
将来的你,一定会感激现在拼命的自己。

将来的你,一定会感谢现在拼命努力的自己。

1.4k 声望
2.9k 粉丝
0 条评论
你知道吗?

将来的你,一定会感谢现在拼命努力的自己。

1.4k 声望
2.9k 粉丝
宣传栏