Java application performance optimization is a common topic. Typical performance problems include slow page response, interface timeout, high server load, low concurrency, and frequent database deadlocks. Especially in today's "rough and fast" Internet development model is popular today, with the increase of system visits and the bloated code, various performance problems have begun to emerge. There are many bottlenecks in Java application performance, such as system factors such as disk, memory, network I/O, Java application code, JVM GC, database, cache, etc. The Java performance optimization is divided into 4 levels: application layer, database layer, framework layer, and JVM layer.
Each level of optimization difficulty increases step by step, and the knowledge involved and the problems to be solved will be different.
- The application layer needs to understand the code logic and locate the problematic code line through the Java thread stack, etc.;
- The database level needs to analyze SQL, locate deadlocks, etc.;
- The framework layer needs to understand the source code and understand the framework mechanism;
- The JVM layer needs to have an in-depth understanding of GC types and working mechanisms, and has a clear understanding of various JVM parameters.
There are two basic analysis methods around Java performance optimization: on-site analysis and post-analysis. The on-site analysis method is to retain the site and then use diagnostic tools to analyze the location. On-site analysis has a greater impact on online, and some scenarios (especially when users’ key online businesses are involved) are not suitable. The post-analysis method needs to collect as much field data as possible, and then resume the service immediately, while performing post-mortem analysis and reproduction of the collected field data. Below we start from the performance diagnosis tool, share and review some classic cases and practices in the HeapDump performance community.
Performance diagnostic tool
One type of performance diagnosis is to diagnose systems and codes that have been determined to have performance problems, and the other is to perform early performance testing of the pre-launch system to determine whether the performance meets the requirements of the launch. This article focuses on the former, the latter can be tested with various performance stress testing tools (such as JMeter), which is beyond the scope of this article. For Java applications, performance diagnostic tools are mainly divided into two layers: OS level and Java application level (including application code diagnosis and GC diagnosis).
OS diagnosis
OS diagnosis mainly focuses on the three aspects of CPU, Memory, and I/O.
CPU diagnostics
For CPU, we mainly focus on load average (Load Average), CPU usage, and the number of context switches (Context Switch).
Through the top command, you can view the average system load and CPU usage.
XPocket plug-in container , which is open sourced by PerfMa, integrates the top_x , which is an enhanced version of linux top, which can display the list of CPU usage/load, CPU and memory process usage. This plug-in splits and organizes the output of complicated top commands, which is clearer and easier to use, supports pipelines, and especially can directly get the top process or thread tid and pid;. The mem_s command increases the sorting according to the process swap size and enhances the original top function.
The figure shows that the CPU of the current system is used by more than 51%. When you find that some processes occupy a relatively high cpu, you can use the cpu_t command of top_x, which will automatically obtain the cpu status of the highest process occupied by the current cpu, or you can specify the process pid through the -p parameter, and you can directly use cpu_t to see:
You can view the number of CPU context switches through the vmstat command. XPocket also integrates the vmstat tool.
The main scenarios where the number of context switches occur are as follows:
- The time slice is used up, and the CPU normally schedules the next task
- Preempted by other higher priority tasks
- The execution task encounters I/O blocking, suspends the current task, and switches to the next task
- User code actively suspends the current task to give up the CPU
- Multi-task preemption of resources, suspended due to not being preempted
- Hardware interrupt.
Java thread context switching mainly comes from competition for shared resources. Generally, a single object lock rarely becomes a system bottleneck, unless the lock granularity is too large. However, a large number of context switches may occur in a code block that is frequently accessed and continuously locks multiple objects, which becomes a system bottleneck. The author Zhu Jibing's CPU context switch causes service avalanche records the case of frequent CPU context switching caused by log4j using asynchronous AsyncLogger to write logs and finally leads to a service avalanche. AsyncLogger uses the disruptor framework, and the disruptor framework processes MultiProducer on the core data structure RingBuffer. Sequence is needed when writing the log, but at this time the RingBuffer is full and the Sequence cannot be obtained, the disruptor will call Unsafe.park to actively suspend the current thread. To put it simply, when the consumption speed cannot keep up with the production speed, the production thread does infinite retries with a retry interval of 1 nano, which causes the cpu to frequently hang and wake up, and a large number of cpu switching occurs, which consumes cpu resources. The problem of Distuptor version and og4j2 version to 3.3.6 and 2.7 respectively was solved.
Memory
From the perspective of the operating system, memory is concerned with whether the application process is sufficient. You can use the free -m command to view the memory usage. The top command can be used to view the virtual memory VIRT and physical memory RES used by the process. According to the formula VIRT = SWAP + RES, the swap partition (Swap) used by the specific application can be calculated. Excessive use of the swap partition will affect the performance of Java applications. Adjust the value as small as possible. Because for Java applications, occupying too many swap partitions may affect performance, after all, disk performance is much slower than memory.
I/O
I/O includes disk I/O and network I/O. Generally, disks are more prone to I/O bottlenecks. Through iostat, you can view the disk read and write status, and through the CPU I/O wait, you can see whether the disk I/O is normal. If the disk I/O is always in a high state, it means that the disk is too slow or faulty, which has become a performance bottleneck, and application optimization or disk replacement is required.
In addition to the commonly used top, ps, vmstat, iostat and other commands, there are other Linux tools that can diagnose system problems, such as mpstat, tcpdump, netstat, pidstat, sar, etc. Here is a summary of the performance diagnostic tools for different types of Linux devices, as shown in the figure below, for reference.
Java application diagnostic tool
Application code diagnosis
Application code performance problems are a type of performance problems that are relatively easy to solve. Through some application-level monitoring alarms, if you determine the problematic function and code, you can locate it directly through the code; or through top+jstack, find the problematic thread stack, locate the code of the problematic thread, and you can also find the problem. For more complex and logical code segments, printing performance logs through Stopwatch can often locate most application code performance problems.
Commonly used Java application diagnosis includes the diagnosis of thread, stack, GC, etc.
jstack
The jstack command is usually used with top. You can locate Java processes and threads through top -H -p pid, and then use jstack -l pid to export the thread stack. Because the thread stack is transient, it requires multiple dumps, usually 3 dumps, usually every 5s. Convert the Java thread pid located by top to hexadecimal to get the nid in the Java thread stack, and you can find the corresponding problem thread stack.
The 16179070dc7302 jstack_x tool is integrated in . You can use the stack -t nid command to view the call stack of a thread waiting for the lock, and locate the business code through the call stack.
XElephant、XSheepdog
XElephant is a product for online analysis of Java memory dump files provided by the HeapDmp performance community for free. It can make the various dependencies between objects in the memory clearer, without installing software, providing upload methods, not subject to local machine memory limitations, and supporting large Dump file analysis.
XSheepdog is an online analysis thread Dump file product provided by HeapDmp performance community free of charge. The relationship between threads, thread pools, stacks, methods and locks is sorted out clearly and presented to users from multiple perspectives, so that thread problems can be clear at a glance.
GC diagnosis
Java GC solves the risk of programmers managing memory, but application suspension caused by GC has become another problem that needs to be solved. JDK provides a series of tools to locate GC problems. The more commonly used ones are jstat, jmap, and third-party tools MAT.
jstat
The jstat command can print GC detailed information, Young GC and Full GC times, heap information, etc. The command format is
jstat –gcxxx -t pid <interval> <count>。
MAT
MAT is a tool for analyzing Java heaps. It provides intuitive diagnostic reports. The built-in OQL allows SQL-like queries on the heap and is powerful. Outgoing reference and incoming reference can trace the source of object references.
MAT has two columns to display the object size, Shallow size and Retained size, the former represents the size of the memory occupied by the object itself, excluding the object it refers to, and the latter is the sum of the Shallow size of the object itself and the object directly or indirectly referenced , That is, the size of the memory released by the GC after the object is reclaimed. Generally speaking, it is enough to pay attention to the size of the latter. For some Java applications with large piles (tens of G), a large amount of memory is required to open MAT. Usually, the memory of the local development machine is too small to open it. It is recommended to install the graphics environment and MAT on the offline server and open it remotely for viewing. Or execute the mat command to generate a heap index and copy the index to the local, but the heap information seen in this way is limited.
In order to diagnose GC problems, it is recommended to add -XX:+PrintGCDateStamps to the JVM parameters.
For Java applications, most applications and memory problems can be located through top+jstack+jmap+MAT, which is an indispensable tool. Sometimes, Java application diagnosis needs to refer to OS-related information, and some more comprehensive diagnosis tools can be used, such as Zabbix (integrated OS and JVM monitoring). In a distributed environment, infrastructure such as distributed tracking systems also provide strong support for application performance diagnosis.
Performance optimization practice
After introducing some commonly used performance diagnosis tools, the following will combine some of our practices in Java application tuning to share cases from the JVM layer, application code layer, and database layer.
JVM tuning: the pain of GC
Author Afei Javaer’s [FullGC Actual Combat: The business lady keeps turning in circles when viewing the picture
]( https://heapdump.cn/article/249540) The article records the time-consuming interface and causes the picture to be inaccessible. After excluding the database, the synchronization date to the blocking problem, and the system problem, start to troubleshoot the GC problem. After using the jstat command, the output result is as follows:
bash-4.4$ /app/jdk1.8.0_192/bin/jstat -gc 1 2s
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
170496.0 170496.0 0.0 0.0 171008.0 130368.9 1024000.0 590052.8 70016.0 68510.8 8064.0 7669.0 983 13.961 1400 275.606 289.567
170496.0 170496.0 0.0 0.0 171008.0 41717.2 1024000.0 758914.9 70016.0 68510.8 8064.0 7669.0 987 14.011 1401 275.722 289.733
170496.0 170496.0 0.0 0.0 171008.0 126547.2 1024000.0 770587.2 70016.0 68510.8 8064.0 7669.0 990 14.091 1403 275.986 290.077
170496.0 170496.0 0.0 0.0 171008.0 45488.7 1024000.0 650767.0 70016.0 68531.9 8064.0 7669.0 994 14.148 1405 276.222 290.371
170496.0 170496.0 0.0 0.0 171008.0 146029.1 1024000.0 714857.2 70016.0 68531.9 8064.0 7669.0 995 14.166 1406 276.366 290.531
170496.0 170496.0 0.0 0.0 171008.0 118073.5 1024000.0 669163.2 70016.0 68531.9 8064.0 7669.0 998 14.226 1408 276.736 290.962
170496.0 170496.0 0.0 0.0 171008.0 3636.1 1024000.0 687630.0 70016.0 68535.6 8064.0 7669.6 1001 14.342 1409 276.871 291.213
170496.0 170496.0 0.0 0.0 171008.0 87247.2 1024000.0 704977.5 70016.0 68535.6 8064.0 7669.6 1005 14.463 1411 277.099 291.562
There is an FGC almost every 1 second, and the pause time is quite long. Finally, the parameter -XX:-UseAdaptiveSizePolicy is closed, and the service is restarted after optimization, and the access speed is faster again.
GC tuning is still necessary for applications with high concurrency and large data volume interaction, especially the default JVM parameters usually do not meet business needs and require special tuning. There are a lot of public information about the interpretation of GC logs, so I won't go into details in this article. There are basically three ideas for GC tuning goals: reduce the GC frequency by increasing the heap space to reduce unnecessary object generation; reducing the GC pause time can be achieved by reducing the heap space and using the CMS GC algorithm; avoiding Full GC and adjusting the CMS trigger Proportion, avoid Promotion Failure and Concurrent mode failure (allocate more space in the old generation, increase the number of GC threads to speed up the recovery speed), reduce the generation of large objects, etc.
Application layer tuning: smell bad code
Starting from application layer code tuning, analyzing the root cause of code efficiency decline is undoubtedly one of the good ways to improve the performance of Java applications.
FGC actual combat: Bad code causes frequent service FGC unresponsive problem analysis The article records the case of bad code causing memory leaks and high CPU usage and a large number of interface timeouts.
Using the MAT tool to analyze the jvm Heap, it can be seen from the above pie chart that most of the heap memory is occupied by the same memory. Then check the details of the heap memory and trace back to the upper level, and the culprit is quickly found.
Find the memory leaked object, search the object name globally in the project, it is a Bean object, and then locate one of its attributes of type Map. This Map uses an ArrayList to store the results of each detection interface response according to the type. After each detection, it is stuffed into the ArrayList for analysis. Since the Bean object will not be recycled and this attribute has no clearing logic, it has not been served for ten days. In the case of online restart, this Map will grow larger and larger until the memory is full. After the memory is full, no more memory can be allocated for the HTTP response result, so it has been stuck at readLine. And our interface with a large number of I/Os has a very large number of alarms, which is probably related to the large response and the need for more memory.
For the location of bad code, in addition to code review in the conventional sense, tools such as MAT can also be used to quickly locate system performance bottlenecks to a certain extent. However, in some situations that are bound to specific scenarios or business data, auxiliary code walkthroughs, performance testing tools, data simulations and even online drainage are needed to finally confirm the source of the performance problems. The following are some of the possible characteristics of some bad code we summarized for your reference:
(1) The code has poor readability and no basic programming specifications;
(2) Too many objects are generated or large objects are generated, memory leaks, etc.;
(3) Too many IO stream operations, or forget to close;
(4) There are too many database operations and the transaction is too long;
(5) The scene of synchronous use is wrong;
(6) Time-consuming operation of loop iteration, etc.
Database layer tuning: deadlock nightmare
For most Java applications, the scenario of interacting with the database is very common, especially for applications that require high data consistency such as OLTP, the performance of the database will directly affect the performance of the entire application.
Generally speaking, for the tuning of the database layer, we basically start from the following aspects:
(1) Optimize at the SQL statement level: slow SQL analysis, index analysis and tuning, transaction splitting, etc.;
(2) Optimize at the database configuration level: such as field design, adjustment of cache size, disk I/O and other database parameter optimization, data defragmentation, etc.;
(3) Optimize from the database structure level: consider the vertical split and horizontal split of the database;
(4) Choose the appropriate database engine or type to adapt to different scenarios, such as considering the introduction of NoSQL.
Summary and suggestions
Performance tuning also follows the 2-8 principle, 80% of performance problems are caused by 20% of the code, so optimizing the key code can get twice the result with half the effort. At the same time, the optimization of performance should be optimized on demand, and over-optimization may introduce more problems. For Java performance optimization, it is not only necessary to understand the system architecture and application code, but also to pay attention to the JVM layer and even the bottom layer of the operating system. In summary, it can be considered from the following points:
1) Tuning of basic performance
The basic performance here refers to the upgrade and optimization of the hardware level or the operating system level, such as network tuning, operating system version upgrade, and hardware device optimization. For example, the use of F5 and the introduction of SDD hard disks, including the NIO upgrade of the new version of Linux, can greatly promote the performance of applications;
2) Database performance optimization
Including common transaction splitting, index tuning, SQL optimization, introduction of NoSQL, etc., such as the introduction of asynchronous processing during transaction splitting, and finally the introduction of consistency, including the introduction of various NoSQL databases for specific scenarios, Can greatly alleviate the shortcomings of traditional databases under high concurrency;
3) Application architecture optimization
Introduce some new computing or storage frameworks, use new features to solve the original cluster computing performance bottlenecks, etc.; or introduce distributed strategies to level computing and storage, including pre-calculation in advance, etc., using typical space-for-time practices Etc.; can reduce the system load to a certain extent;
4) Optimization at the business level
Technology is not the only way to improve system performance. In many scenarios where performance problems occur, it can be seen that a large part of it is caused by special business scenarios. If you can circumvent or adjust the business, it is often the most Effective.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。