Log analysis routine operation

Preface

Logs are an important tool for developers to analyze programs and troubleshoot problems. As the system architecture has evolved from an early monolithic application to today's microservice architecture, the importance of logs has gradually increased. In addition to using logs to assist in troubleshooting, the logs can also be used to perform performance analysis on the full link of microservice requests, and it can even be used to solve consistency problems in distributed systems. At the same time, the amount of logs generated by the system and the difficulty of log management have also increased significantly. As a result, log management tools were born and iteratively upgraded. From the very beginning to log on to the springboard to view logs, to build a distributed log center to manage log streams in a unified way, to cloud platform vendors to provide specialized log management services. Developers only need to connect the SDK in the application to return the log to the log platform, and then use the log platform to provide intelligent retrieval, data analysis, and link analysis capabilities. The platform has an easy-to-use graphical interface and mature data management capabilities. Greatly improve development efficiency.

However, the log management platform is not a panacea. There are always some scenarios where it will be absent (for example, the logs generated by local debugging will not be returned to the log platform, and it does not support complex data analysis. Of course, there are the most common and most crashing ones. , The data is lost...), and we have to look at each other with a lot of original log files. At this time, we had to take out the original weapon-linux instructions from the toolkit, and start the operation as fierce as a tiger.

This article will combine the scenes encountered in the daily development process to give the corresponding log retrieval sentences. You are also welcome to collect it in your own toolkit, or leave a message below for the log analysis problems encountered by yourself. The blogger will Complete it into the article.

Log structure

Before getting to know the log analysis statement, let me briefly introduce the log type and structure. The following sequence will use the content introduced in this section as the background to provide specific log analysis statements.

There are two main types of logs. One is business logs, which are logs generated in the process of realizing business functions, which are usually triggered by developers actively burying points in the program. There is also a system log. This type of log has a larger scope and can be further subdivided, such as a snapshot of various indicators of the host where the system is located, or a log printed internally by a dependent middleware.

Business logs and system logs are usually in different directories. In fact, different types of system logs are generally isolated in separate directories. Take a system connected to Mysql and RocketMq as an example, its log directory structure may be as follows:

/log
  /mysql
  /rocketmq
  /app
    /app_name1
    /app_name2

It can be seen that different business systems under the app will also be isolated from logs for easy retrieval and viewing.

Then take a look at the log file structure in each directory. There is often no absolute standard. Developers usually design the log file structure according to the needs of the system, and even generate logs for specific purposes (such as mysql bin log and relay log). Here is a brief introduction to the log structure design that I am used to when developing the system.

Usually I will distinguish at least three log files: application.log , application-error.log and application-warn.log . As the name of the file indicates, they correspond to logs of different levels. application.log will contain all logs in the application life cycle, while application-error.log and application-warn.log logs only record error level logs and warn level logs, respectively, so as to facilitate the rapid location of system exceptions. . However, if you directly look at all the log files in the directory, you will find that it has more than these three files.

This is because a large number of logs will be generated during the operation of the system. If only one file is used for log storage, the file will become extremely large and will seriously consume disk space. Therefore, the operating system or log tool will perform log truncation, compression, and backup operations after being configured to reduce the impact of logs on the stability of the entire host. The truncated log will be suffixed and saved according to the configuration, usually with a timestamp.

In addition to dividing multiple log files based on log levels as shown above, log files can also be designed from other dimensions, such as printing logs separately at the entrance and exit of the system traffic. The entrance of traffic can be understood as the RPC interface Server, HTTP service server, MQ receiving message log, etc. The opposite of the traffic exit refers to the RPC interface Client, calling downstream HTTP services, etc. Therefore, the entire log directory will contain the following files

application.log
appilcation-error.log
application-warn.log
rpc-client.log
rpc-server.log
mq.log

The specific log configuration is beyond the scope of this article. You can read the documentation of logback and other log frameworks yourself.

Log analysis small operations

Next, common log query and analysis scenarios in the daily development process will be listed, and corresponding instructions will be given.

View log

View a single log file
Cat is our most commonly used instruction to read files, and the content of the file can be displayed cat ${filename} Take application.log as an example
```
cat application.log
```
This command is suitable for viewing all readable files.
View multiple log files
As mentioned above, due to the existence of the Logrotate mechanism, log files are often truncated into multiple files with different timestamp suffixes, and we are not sure which file the log we want to query is in. At this time, we can add more All files are passed to the cat command, cat ${filename} ${filename2} ${filename...} , and cat will read the files one by one and display them. But what if the number of files is very large? Fortunately, the cat instruction supports regular matching. The * keyword allows us to match any number of fields.
```
cat application1.log application2.log application3.log
cat application.log*
```
Of course, it is not a good choice to use the cat command to view the full log when the number of files is large. Other log query methods will be given below.
View the last few lines of log
The cat command will read the entire log file from beginning to end and display it on the console, but sometimes we only need to look at the log for the most recent period of time. Moreover, when the log file is particularly large, using the cat command is not only slow, but also causes a large number of irrelevant logs to flood the screen and affect reading. At this time, the tail command can solve this problem well. The tail instruction can only read the last few lines of the log and display it on the screen.
```
tail application.log # 读取application.log文件最后一部分日志
```
The tail command also supports the input of multiple files, it will read the last part of the contents of several files in order and print them to the console
```
tail application1.log application2.log application3.log
```
If you want to specify the log showing the last 100 lines, you can use tail -n to match the query:
```
tail -n 100 application.log
```
View incremental log
Sometimes, we want to view the content of the log file in real time, so as to capture the behavior of the system more quickly. The tail -f command supports dynamic display of the new content of the file. If you want to exit the automatic refresh, you can use the ctrl+c command to achieve:
```
tail -f application.log
```
View log by page
Sometimes, because the log content is too much, the console output frantically, which directly overwhelms the key information. Therefore, an instruction is needed to view the log content in pages to reduce the frequency of console refresh. more command provides very good support for this scenario.
```
more application.log
```
After executing the more command, the console will display the contents of the file screen by screen, can use space (space key) to display the content of the next screen, press Enter (Enter) to display the content of the next line, Q key to exit the more command

So far, the main instructions for the full text query and partial query of the file have been given. The following is another type of query, which is based on the keyword query.

Keyword search

Retrieve logs based on keywords
In a distributed system, there are often dozens or even hundreds of systems participating in the process. At this time, the entrance of the traffic will generate a unique logId to serialize and mark the full link request. When we need upstream and downstream troubleshooting, we often provide logId to the other party for troubleshooting. Similarly, when we get the logId, we also need to query the log content associated with the logId in the log. This is a typical scenario where logs are retrieved based on keywords.
The grep command solves this problem very well, it can print out the line that matches the keyword in the log.
```
grep "logId" application.log
```
The above command will print out all the lines containing logId in the application.log file.
The grep command also supports multi-file query
```
grep "logId" application1.log application2.log application3.log
grep "logId" application*.log
```
There are also regular expression matching or case-insensitive matching
```
grep "logId" application.log -i # 大小写不敏感
grep -E "[\w\d]*" application.log # 正则表达式
```

By the way, here is a reminder of a system design note . In a distributed system, logId is passed by writing to the current thread context. Therefore, if a part of the task is submitted to the asynchronous thread for execution in the current thread, there is hope that the current Thread to track, you must remember to pass the logId to the context of the asynchronous thread.

View logs near keywords
I don't know if you have written such log logic in the program: log.error("something went wrong keyId:{}", keyId, ex) . Through this line of log, we expect to type out the key id and stack of the exception in the log. However, the log system divides this line of log and stack into two lines for printing. If you use grep ${keyId} application.log directly, you cannot find the stack information of the next line. Therefore, we need to view the logs of N lines near this keyword line to help us quickly locate the cause of the exception. Grep also provides this ability:
```
grep 'something went wrong' -A 100 application.log
```
The above statement shows the something went wrong line, including the line that was hit. Here -A is --after-context , which is easy to remember.
In the same way, you can also view the log of the first N lines of the matching line:
```
grep -B 100 application.log
```
Here -B is the abbreviation --before-context Of course, these two commands can also be used together~
Multi-keyword query
Sometimes we may not be able to get the requested logId. It may only be the calling method and the key parameters in the method. If only the method name is used for matching, many irrelevant calls may be found. On the contrary, only the key parameter query is also possible. This parameter may call many other methods, these logs will affect us to find the correct critical request. grep also provides a pattern for multi-keyword matching, suppose we now have a method called helloWorld(String name) , and pass in the value rale
```
grep -e "helloWorld" -e "rale" application.log
```
But this command does not match the demands of this scenario, because it is actually an OR relationship, that is, log lines containing helloWorld or Rale will be filtered out. Here we will introduce the linux pipeline operation (pipeline). The linux pipeline operator | can concatenate two shell commands, where the Output of the previous command will be the Input of the next command.
```
grep -e "helloWorld" application.log | grep "rale"
```
As shown in the above instruction, the first grep statement will filter out all log lines containing helloWorld and pass to the next instruction, and the second grep instruction will filter out log lines containing rare on this basis. This constitutes an AND relationship, which meets our requirements.
Reverse query (forward query + reverse query)
Sometimes, we may want to query the log in reverse, that is, find out the logs that do not contain certain keywords. This is usually suitable for scenarios where the problem is more ambiguous. For example, some users now complain that a certain function is abnormal, but as a middle-office service, it is often difficult to immediately locate which method or service has a problem, so this user will be first (assuming All requests with uid) are printed out and the methods with normal results are filtered out. This can be used in conjunction with the above pipeline instructions.
```
grep "$uid" application.log | grep -v "SUCCESS"
```
-v is --revert-match . Here, all logs of uid will be found first, and the result of successful SUCCESS calls will be filtered out.
View logs in the time interval
When troubleshooting, the information that can be obtained is less and not the least, resulting in the query of the log is only more beautiful but not the most beautiful. When users feedback problems, sometimes they can only give us an approximate time when the time occurred, and there is no other information. Therefore, we need to query all the entry calls in this time interval and combine the above reverse query to gradually filter out the content that really needs to be concerned. It is assumed that there is time stamp information in each line of the log.

There are many ways to achieve this, the first can be to use the grep regular matching pattern we mentioned earlier:

grep -E '2021-10-02 19:\d\d:\d\d:' application.log

Here we use regular expressions to match all the logs on October 2nd, 2021 at 19:00. Similarly, if you want to be accurate to the hour, minute and second, you can change the expression in the corresponding position to a specific value. This method has a more obvious shortcoming, that is, it is not flexible enough. If we need to query the data from 19:58 to 20:02 in the evening, the regular expression is difficult to satisfy.

So we need a new instruction to complete this mission: sed .
Sed is a stream editor, which can support more complex instructions and edit files compared to grep. Here is only the query statement for the time range log. For other functions, you can do some extended reading:

sed -n '/2021-10-02 19:58:00/,/2021-10-02 20:02:00/p' application.log

-n will filter out matching rows, and p represents printing to the console. Look at the content immediately after -n , represents a range query. This range can be a line number, a regular expression, or a combination of the two. The example given is exact matching. The rule of the range query is to find all satisfied intervals and print them out. For example, if the range is /a/,/z/ , sed will sequentially find all intervals that satisfy the start line containing a and the end line containing z. Every time it finds a range from a to z, sed starts to find the next line containing a, and then finds the line containing z until the end of the file.

This search method is disadvantageous in scenarios where the logs are out of order. For example, if the 20:00 log appears after 20:02, it may not be filtered out. Similarly, if there is no log at the two time points of 19:58:00 and 20:02:00 in the log, it may also cause the correct content to be filtered out. Therefore, we usually need to print out part of the log first to get the timestamp of the log line, and then use this timestamp to filter.

Highlight keywords
Here is an option to improve the readability of the grep command, that is, highlight the keyword, so as to locate the log line faster.
```
grep "logId" application.log --color=auto
```

Classification/Sort

The instructions for querying log lines are given above. Below we will discuss some common log statistics and analysis operations. Most of the following operation log platforms can be provided, but it does not rule out that sometimes we need to tear up a log file.

Count the number of occurrences of keywords
I actually encountered this scene during the development process. At that time, there were some online problems, and it was necessary to calculate what percentage of the traffic was affected. The statistical indicator is based on the method keyword + the execution result as the failure, which means that the request is affected by the system abnormality. Because the amount of data has also exceeded the upper limit of the maximum number of rows for a single query, the log file on the machine is finally downloaded and counted.
At this time, we must mention a new instruction wc :
```
grep "methodName" application.log* | grep "FAIL" | wc -l
```
The two grep commands here will help us filter out failed requests. The wc command will count the input stream. The statistics can include the number of bytes, lines or characters. Here, the -l command indicates that only the number of rows is displayed.

If the statistical dimension is upgraded, we hope to count how many users are affected (the user's uid is included in the log). At this time, simply counting the number of rows will produce duplicate statistics. Therefore, we need to de-duplicate the log line based on the user id. Here we need to introduce a new instruction uniq the entire line or part of the information in the line, and count the number of lines associated with each keyword. should be noted here that uniq only takes effect on the sorted rows, so it is necessary to intercept and sort the keywords before using uniq to achieve the desired effect

grep "methodName" application.log* | grep "FAIL" | cut -d ' ' -f 1 | sort | uniq -c

Here comes a bunch of new shell commands, the first is the cut command, the cut command will copy the value of the corresponding bit to the output stream according to the specified rules, similar to ctrl+c on the computer. -d declares the separator for cutting log lines, where the lines are cut according to spaces. -f specifies the value at which position to take. The string list generated after the cut command uses 1 as the starting subscript, so the first string after segmentation is taken here (assuming the log format is userId timestamp threadId .... , of course, most In this case, the userId will not be the content at the beginning of the line). Then sort all userIds, using the sort command here. Finally, use the uniq command to remove the -c . The 0616039d2796f9 option means to display the number of repeated occurrences of the row next to each column.

Export results
Suppose now I want to export the above statistical results to a file and pass it to others. The linux pipeline also supports redirecting the output stream to a file.
```
...a lot of command ... > result.txt
```
The txt file here is just an example, you can declare files in other formats.

Reverse Thinking-Best Practices for Printing Logs

After introducing some log parsing and statistics methods above, I also think about some of our practices when printing logs in the system. The unification of the log format often allows us to complete a series of statistical analysis based on the log more quickly. The current mainstream log format is usually as follows:

%level %d{yyyy-MM-dd HH:mm:ss.SSS} %F:%L %X{HOST_IP:-_} %X{LOG_ID:-_} %X{CLUSTER:-_} %X{STAGE:-_} [%t] %m%n%wEx -->

You can see that it will contain log level, distributed logId, timestamp, host key information (ip, environment), thread id, etc. Similarly, if the logs we print in the business system can also follow a certain format, it will greatly enhance the readability and convenience of log troubleshooting. For example, when printing RPC Server logs, I usually give three types of logs, Start log, Finish log and Digest log.

   private static final String START_LOG_TEMPLATE = "[rpc_server_start] [method]:{} [request]:{}";

   private static final String FINISH_LOG_TEMPLATE = "[rpc_server_finish] [method]:{} [success]:{} [duration]:{}s [request]:{} [result]:{}";

   private static final String DIGEST_LOG_TEMPLATE = "[rpc_server_digest] [method]:{} [success]:{} [duration]:{}s";

The Start log is to print before the request starts processing, including input parameters, caller and other information, and the Finish log is to print the input parameters, output parameters, whether it is successful and time-consuming. The Digest log is special. It does not contain any request response information for the call. It only prints out the called method, whether it was successful, and time-consuming information. Digest logs are usually used for quick performance analysis.

summary

The log operation described in this article is only the tip of the iceberg in the entire log file processing operation, but it must be the most common and most likely to be used. I hope it will be helpful to everyone. You are also welcome to collect it in your own toolkit, or leave a message below for the log analysis problems you have encountered, and the blogger will perfect it into the article.

Log analysis routine operation

Preface

Log structure

Log analysis small operations

View log

Keyword search

Classification/Sort

Reverse Thinking-Best Practices for Printing Logs

summary

raledong

引用和评论

从零开始JVM(一)：初探JVM运行时数据区域

日志-Log4j，Logback，Log4j2和Slf4j日志框架的结构原理

日志-log4j2基于AsyncAppender的异步日志打印

日志-log4j2基于AsyncLogger的异步日志打印

日志-log4j2日志框架源码学习

安装vmware配置ubuntu linux环境上篇

Kubenetes里pod和service绑定的实现方式

Log analysis routine operation

Preface

Log structure

Log analysis small operations

View log

Keyword search

Classification/Sort

Reverse Thinking-Best Practices for Printing Logs

summary

raledong

引用和评论

从零开始JVM(一)：初探JVM运行时数据区域

日志-Log4j，Logback，Log4j2和Slf4j日志框架的结构原理

日志-log4j2基于AsyncAppender的异步日志打印

日志-log4j2基于AsyncLogger的异步日志打印

日志-log4j2日志框架源码学习

安装vmware配置ubuntu linux环境 上篇

Kubenetes里pod和service绑定的实现方式

安装vmware配置ubuntu linux环境上篇