1
头图

introduction

"Only people who have been beaten by online service issues understand how important logs are!"
Let me say the conclusion first, who is in favor and who is against? If you feel the same way, then congratulations you are a social person :)

The importance of the log to the program is self-evident. It is lightweight, simple, and mindless. It can be seen everywhere in the program code, helping us to troubleshoot and locate a problem. However, the seemingly inconspicuous log hides all kinds of "pits". If it is used improperly, it will not only not help us, but will become a service "killer".
This article mainly introduces the "pits" caused by improper use of logs in the production environment and how to avoid pits, especially in high-concurrency systems. At the same time, a set of implementation solutions are provided to allow the program and the log to "coexist in harmony".

Avoid pits and point north

In this chapter, I will introduce the log problems encountered online in the past, and analyze the root causes of the problems one by one.

Irregular log writing format

Scenes

 // 格式1
log.debug("get user" + uid + " from DB is Empty!");

// 格式2
if (log.isdebugEnable()) {
    log.debug("get user" + uid + " from DB is Empty!");
}

// 格式3
log.debug("get user {} from DB is Empty!", uid);

The above three ways of writing, I believe that everyone has seen it in the project code more or less, so what are the differences between them before, and what impact will it have on performance?
If the DEBUG log level is turned off at this time, the difference will appear, and format 1 still needs to perform string concatenation, even if it does not output the log, it is a waste.

The disadvantage of format 2 is that additional judgment logic needs to be added, and waste code is added, which is not elegant at all.
Therefore, format 3 is recommended. Dynamic splicing will only occur during execution. After the corresponding log level is turned off, there will be no performance loss.

Production printing a lot of logs consumes performance

Use as many logs as possible to string together user requests and make it easier to determine the location of the problematic code. Due to the current distributed system and complex business, the lack of any log is a great obstacle for programmers to locate problems. Therefore, programmers who have suffered from production problems must log as much as possible in the process of developing code.
In order to locate and fix problems as soon as possible online in the future, programmers will try to record as many key logs as possible during the programming implementation stage. After the launch, the problem can be quickly located, but then there will be new challenges: with the rapid development of the business, the user access continues to increase, and the system pressure is increasing. At this time, a large number of INFO logs are online, especially in the During peak periods, a large number of log disk writes consume service performance.
Then this becomes a game theory. It is easy to troubleshoot problems with more logs, but the service performance is "eaten", and the lack of logs has no effect on service stability, but it is difficult to troubleshoot problems, and programmers are "suffering".
image.png
Question: Why does the INFO log type too much and the performance suffers (the CPU usage is very high at this time)?

Root cause 1: Synchronous print log disk I/O becomes a bottleneck, resulting in a large number of thread Blocks

It is conceivable that if the logs are all output to the same log file, there are multiple threads writing to the file at this time, is it messed up? The solution is to add locks to ensure that the log file output will not be cluttered. If it is during the peak period, the contention of locks is undoubtedly the most performance-intensive. When one thread grabs the lock, other threads can only wait in block, which seriously drags down the user thread. The performance is that the upstream call times out, and the user feels stuck.

The following is the stack when the thread is stuck writing a file

 Stack Trace is:
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.logging.log4j.core.appender.OutputStreamManager.writeBytes(OutputStreamManager.java:352)
- waiting to lock <0x000000063d668298> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.writeEncodedText(TextEncoderHelper.java:96)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeText(TextEncoderHelper.java:65)
at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:68)
at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:32)
at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:228)
.....

So is it okay to reduce INFO logs online? Similarly, the amount of ERROR logs should not be underestimated. If there is a large amount of abnormal data online or a large number of downstream timeouts, a large number of ERROR logs will be generated instantaneously. At this time, the disk I/O will still be full, causing the user thread to block.

Question: Assuming you don't care about INFO troubleshooting, is there no performance problem if only ERROR logs are printed in production?

Root cause 2: Thread Block caused by abnormal stack of log printing under high concurrency

There was a time when a large number of timeouts occurred online and downstream, and the exceptions were all caught by our service. Fortunately, the disaster recovery design expected such a problem to occur, and made a bottom-up value logic. Fortunately, the server started” Teach to be a man." The online monitoring started to alarm, the CPU usage increased too fast, and the CPU directly increased to 90%+ all the way. At this time, the capacity was urgently expanded to stop the loss, and a machine was found to pull down the traffic and pull the stack.
After viewing the dumped thread stack, combined with the flame regression analysis, most of them are now stuck in the following stack positions:

 Stack Trace is:
java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
- waiting to lock <0x000000064c514c88> (a java.lang.Object)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.logging.log4j.core.impl.ThrowableProxyHelper.loadClass(ThrowableProxyHelper.java:205)
at org.apache.logging.log4j.core.impl.ThrowableProxyHelper.toExtendedStackTrace(ThrowableProxyHelper.java:112)
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:112)
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:96)
at org.apache.logging.log4j.core.impl.Log4jLogEvent.getThrownProxy(Log4jLogEvent.java:629)
...

The stack here is longer, and most of the sites are all Blocks in java.lang.ClassLoader.loadClass , and it is found that the stack is triggered by this line of code.

 at org.apache.logging.slf4j.Log4jLogger.error(Log4jLogger.java:319)

// 对应的业务代码为
log.error("ds fetcher get error", e);

Ah this. . . It's outrageous, why do you load the class when you make a log? Why does the loading class block so many threads?
After some research and analysis, the following conclusions were drawn:

  • When using Log4j's Logger.error to print the exception stack, in order to print out the location information of the classes in the stack, you need to use Classloader for class loading;
  • Classloader loading is thread-safe. Although parallel loading can improve the efficiency of loading different classes, when multiple threads load the same class, they still need to wait for each other in synchronization, especially when the exception stacks printed by different threads are exactly the same, it will increase the number of threads. Block risk, and when Classloader loads a class that cannot be loaded, the efficiency will drop sharply, further worsening the thread Block situation;
  • Because of the efficiency of reflection calls, JDK optimizes reflection calls, dynamically generates Java classes for method calls, and replaces the original native calls. The generated dynamic classes are loaded by DelegatingClassLoader and cannot be loaded by other Classloaders. In the exception stack Dynamic classes with reflection optimization are very easy to generate thread blocks under high concurrency conditions.

Combining the above stack, the card is very clear here:

  • A large number of threads flood in, causing the downstream services to time out, causing the timeout exception stack to be printed frequently. Each layer of the loadClass needs to get the corresponding class, version, line number and other information through reflection. For synchronous waiting, a thread locks, which causes most threads to block and wait for the class to be loaded successfully, which affects performance.
  • To be reasonable, even if most threads wait for a thread loadClass , it is only a momentary freeze. Why does this error report keep going loadClass class? Combining the above conclusions and analyzing the program code, it is concluded that the request downstream service logic in the thread here includes the Groovy script execution logic, which belongs to the dynamic class generation. The above conclusion 3 shows that the dynamic class cannot be correctly reflected and loaded by log4j in the case of high concurrency. If it is reached, then stack reflection is used again, and it enters an infinite loop. More and more threads can only join and wait, and block live.

    Best Practices

    1. Remove unnecessary exception stack printing

    Obviously know the exception, don't print the stack, save some performance, anything + high concurrency, the meaning is different :)

     try {
      System.out.println(Integer.parseInt(number) + 100);
    } catch (Exception e) {
      // 改进前
      log.error("parse int error : " + number, e);
      // 改进后
      log.error("parse int error : " + number);
    }

    If an exception occurs in Integer.parseInt, the reason for the exception must be that the number of incoming and outgoing numbers is illegal. In this case, it is completely unnecessary to print the exception stack, and the stack printing can be removed.

    2. Convert the stack information to a string and print it

     public static String stacktraceToString(Throwable throwable) {
      StringWriter stringWriter = new StringWriter();
      throwable.printStackTrace(new PrintWriter(stringWriter));
      return stringWriter.toString();
    }

    log.error The stack information obtained will be more complete, the JDK version, the path information of the Class, the class in the jar package will also print the name and version information of the jar, which are all obtained by loading the class reflection information, great loss of performance.
    Call stacktraceToString to convert the exception stack into a string. Relatively speaking, it does indeed have some metadata information of the version and jar. At this time, you need to decide whether it is necessary to print out this information (such as class Version-based conflict detection is still useful).

    3. Disable reflection optimization

    Use Log4j to print stack information. If there is a dynamic proxy class generated by reflection optimization in the stack, this proxy class cannot be loaded by other Classloaders. Printing the stack at this time will seriously affect the execution efficiency. But disabling reflection optimization can also have side effects, making reflection execution less efficient.

    4. Asynchronously print logs

    In production environments, especially services with high QPS, asynchronous printing must be enabled. Of course, if asynchronous printing is enabled, there is a possibility of losing logs. For example, the server is forcibly "killed", which is also a process of choice.

    5. The output format of the log

    Differences in the output format of our play log output

     // 格式1
    [%d{yyyy/MM/dd HH:mm:ss.SSS}[%X{traceId}] %t [%p] %C{1} (%F:%M:%L) %msg%n
    
    // 格式2
    [%d{yy-MM-dd.HH:mm:ss.SSS}] [%thread]  [%-5p %-22c{0} -] %m%n

The official website also has a clear performance comparison prompt. If the following fields are used for output, the performance will be greatly lost.

 %C or $class, %F or %file, %l or %location, %L or %line, %M or %method

image.png
In order to get the function name and line number information, log4j uses the exception mechanism, first throws an exception, then catches the exception and prints the stack content of the exception information, and then parses the line number from the stack content. In the implementation source code, the process of acquiring and parsing locks is added. Under high concurrency, the performance loss can be imagined.

The following are the parameter configurations that affect performance, please configure as appropriate:

 %C - 调用者的类名(速度慢,不推荐使用)
%F - 调用者的文件名(速度极慢,不推荐使用)
%l - 调用者的函数名、文件名、行号(极度不推荐,非常耗性能)
%L - 调用者的行号(速度极慢,不推荐使用)
%M - 调用者的函数名(速度极慢,不推荐使用)

Solution - dynamic adjustment of log level

The project code needs to print a lot of INFO level logs to support problem location and test troubleshooting. But these large amounts of INFO logs are invalid for the production environment, and a large number of logs will eat up CPU performance. At this time, it is necessary to dynamically adjust the log level, which can be viewed at any time INFO logs , and can be dynamically closed when not needed without affecting service performance.

Solution: Combining Apollo and log4j2 features, from the API level, dynamically and fine-grained control the log level in the global or a single Class file. The advantage is that it takes effect at any time, and the production troubleshooting problem can be specified to open the log level of a single class file, and it can be closed at any time after the troubleshooting.

Due to the limited space of this article, the specific implementation code will not be posted. In fact, the implementation is very simple, that is, it cleverly uses Apollo's dynamic notification mechanism to reset the log level. If you are interested, you can send a private message or leave a message to me, and I will open an article. The article is dedicated to explain in detail how to do it.

Summary and Outlook

This article takes you to understand the common problems of logs in daily software services and the corresponding solutions. Remember, simple things + high concurrency = not simple! Be in awe of production!

If you can read it to the end, it shows that you are really a big fan. If you have any questions, please send a private message or comment. I will reply as soon as I see it. If you think the content I share is "dry" enough, please like, follow, and forward it. This is the greatest encouragement to me. Thank you for your support!
I hope the articles I share can be helpful to every reader!

Welcome to the official account: Gugu Chicken Technical Column Personal technical blog: https://jifuwei.github.io/
image.png

咕咕鸡
45 声望17 粉丝