Take a troubleshooting record for the Node.js http service

Preface

Recently, our team took over a legacy project, which is an HTTP service project written in Node.js language. The general logic of the project is to implement a layer of proxy logic through Node.js, do traffic transfer, and add some key buried points. Display request records ( and some advanced usage, tampering with request-related data, but not related to this troubleshooting, so I won’t repeat ).
When we first took over, the situation we got was that this project had to be restarted regularly due to memory leaks and other unknown reasons, otherwise the user would experience a significant delay or the service would be unavailable when using it.
So while we were carrying out human flesh operation and maintenance, we started to read the source code, understand the project, analyze the cause of the problem, and try to fix it.

Investigation process

The project has key indicators, including memory, CPU and other indicators, as well as user traffic statistics. We have seen two more important indicators:

Memory trends
Number of Active Handle

Just like the information initially obtained, there is a memory leak problem, and the cliff-like drop of the indicator in less than an hour is caused by the service being restarted.
Similarly, the following Active Handle indicator also fluctuates greatly. Here we can briefly introduce what Active Handle is. This will extend to a key word often involved in a Node.js technology stack: Event Loop.

A brief introduction to Event Loop

Briefly explain the working mode of Event Loop. If we have a file, we write code like this:

console.log('hello')

Then we node XXX , and the process will exit after running.

And if we write a file like this:

console.log('hello')

setInterval(() => {}, 1e3)

Run the file through node XXX again, the terminal will stay in the process, and will not exit, because we created a timer, and the timer can be understood as an Active Handle, and the running process of the Event Loop is Execute all the synchronization code, and then check whether there is Active Handle, if it is not detected, it will exit directly, if there is Handle, it will enter the familiar Event Loop.
We can get the number of Active Handle through a non-public API:

process._getActiveHandles()

console.log(process._getActiveHandles().length) // 0

If node XXX , then we will get a 0, and if we node , we will find this value is 2. This is because the interactive command line binds stdin and stdout, This is why you will stay in the interactive command line environment instead of the terminal.

Many asynchronous APIs like process.nextTick , net.listen , fs.XXX can be considered as an Active Handle. As long as there are similar scenarios, the process can always run.

The above briefly explained what Active Handle is, and later I will mention why this indicator will help us troubleshoot the problem.

analyse problem

While looking at the project code, we found that the core logic of the project is to forward traffic, and the way to achieve this function is to bind two Sockets to each other. The specific code:

export function pipeSocket(from: Socket, to: Socket) {
  from.on('data', chunk => {
    to.write(chunk)
  })
  from.on('close', () => {
    to.destroy()
  })
}

It seems to be nothing wrong, but combined with the abnormal fluctuations in the number of active handles we have seen before, we have reason to suspect that it is related to this.
At the same time, we checked the problem on the server when the indicator was abnormal, and found some clues.
We performed netstat -an | awk '/^tcp/ {++y[$NF]} END {for(w in y) print w, y[w]}' to view the statistics of the number of different connection types, and got a result similar to this:

You can see that ESTABLISHED types is very large. This is actually different from my initial guess. I thought it would be a common CLOSE_WAIT and TIME_WAIT problem, but I didn't expect the ESTABLISHED be abnormal.
In the definition of TCP connection status, ESTABLISHED represents a successfully established connection, which can be understood as a normal data transmission status:

However, it is obviously not a normal state that there are so many connections in transmission at the same time, so we further check which IP occupies the most connections netstat -nat|grep ESTABLISHED|awk '{print$5}'|awk -F : '{print$1}'|sort|uniq -c 127.0.0.1 . This is because the proxy logic of the project is passed The service is forwarded to a specific local port, and the actual proxy logic is provided by the service.

Due to IP sensitivity, the image is not mapped.
The general situation is that about dozens of pieces of data are found, and the number corresponding to a piece of data in 127.0.0.1 exceeds 2W.

Next, we 127.0.0.1 conduct further investigation on 060e721144c67a, we need to find out why there are so many connections in a transmission state, and why these connections are not closed.

So we ss -aoen | grep '127.0.0.1' | grep 'ESTAB' | tail -n 10 . Here we found some clues. In the last column of detailed information of the statistics, we found that many Sockets do not have timeout settings.

This is more confusing, because our system is actually set to keep alive . If the peer data is no longer transmitted, then the connection should only be retained for 2 hours (our system is set to 7200), and it is reasonable to say that it will not be two hours. This number of connections will be generated.
The result obtained through the above ss command also helped us to be more certain. Some sockets were not released normally and have been in ESTABLISHED . In order to count the accurate data, we enter the following two commands separately, and distinguish the statistics timer

ss -aoen | grep '127.0.0.1' | grep 'ESTAB' | grep -v 'timer:' | wc -l # 28636
ss -aoen | grep '127.0.0.1' | grep 'ESTAB' | grep 'timer:' | wc -l    # 115

The number gap is very terrible. We then found some of the ss port numbers through sampling, and filtered and viewed the file descriptor information lsof -p <process ID> | grep <port number>. Then get the specific value in the FD ll /proc/<process ID>/fd/<FD>, and find that there are half a day ago, which must have exceeded the detection time keepalive

Based on the above information, the problem is almost determined.

`Fix the problem`

After locating the problem, we started to fix it, because the goal was clearer, to find those keepalive was not set, and add keepalive . This brings us back to the more critical function in the previous section:

export function pipeSocket(from: Socket, to: Socket) {
  from.on('data', chunk => {
    to.write(chunk)
  })
  from.on('close', () => {
    to.destroy()
  })
}

As mentioned before, Active Handle and Socket are also one of them. Because it involves the establishment of a connection, we focused on this function for the first time after the previous exploration. By looking at the documentation of the net module, we found that the default Socket will not add keepalive : https://nodejs.org/api/net.html#net_socket_setkeepalive_enable_initialdelay So the simple solution is to add the setKeepAlive operation of Socket inside the function. While we were already looking at the net module documentation, we studied the Socket carefully and found that not only was there no keepalive default, but even timeout was not available by default.

Incidentally, we also added the timeout . This is because the project is used as a proxy service. All global requests will be serviced. It is inevitable that users occasionally, non-subjectively, and accidentally access some illegal resources. Will cause the request to remain in a pending state, so we have to add timeout to prevent this from happening.

close about the processing of internal events in the function. It is found that the existing logic only handles 060e721144c7c3, but in fact Socket has many events, including error , end , timeout etc. And even if we set timeout , in fact the Socket will not be closed automatically, but the timeout event will be triggered, and the real shutdown requires us to execute it manually. And is currently in close event uses destroy , but more recommended practice is to use end , end will send FIN to the peer, and destroy will directly destroy the current connection, the event trigger on it end trigger end and close , And destroy will only trigger close .

So after our above changes, the function becomes like this:

export function pipeSocket(from: Socket, to: Socket) {
  from.setTimeout(60 * 1e3)
  from.setKeepAlive(true, 60 * 1e3)
  from.on('data', chunk => {
    to.write(chunk)
  })
  from.on('close', () => {
    to.end()
  })
  from.on('error', () => {
    to.end()
  })
  from.on('end', () => {
    to.end()
  })
  from.on('timeout', () => {
    from.destroy()
  })
}

`Verification problem`

After modifying and submitting the above code, I tried to run for a period of time, and check the number of TCP connections through the command again:

ss -aoen | grep '127.0.0.1' | grep 'ESTAB' | grep -v 'timer:' | wc -l # 191
ss -aoen | grep '127.0.0.1' | grep 'ESTAB' | grep 'timer:' | wc -l    # 261

Verify the number of connections in each state of TCP through netstat -an | awk '/^tcp/ {++y[$NF]} END {for(w in y) print w, y[w]}'

Although TIME_WAIT looks a lot, it is different from the previous ESTABLISHED, it does not always exist, because there are many requests, it will always close and create a loop, so every time you enter a command, you will see a lot of TIME_WAIT status, which is the TCP connection is closed normal status.

Both the memory indicator and the Active Handle indicator we are concerned about are in a normal state:

`to sum up`

Summarize the problems encountered in this troubleshooting, mainly due to the posture of the socket, and the resources were not released in time.

Socket does not set a timeout period
KeepAlive is not set for Socket
Socket status monitoring is not perfect
Socket Destroy is a non-elegant way to close

After dealing with the above several items, the problems encountered by users such as inaccessibility, slow access speed, and forced restart at regular intervals have all been solved.

Take a troubleshooting record for the Node.js http service

Preface

Investigation process

A brief introduction to Event Loop

analyse problem

`Fix the problem`

`Verification problem`

`to sum up`

Jarvis

`引用和评论`

Verdaccio publish 时包含 deprecated 导致历史版本丢失问题原因分析

Koa+Typescript起手式(空环境) 不用每次玩node都要搭环境了！

解决windows下安装electron 8.0.0 在node install.js卡住的问题

Nest.js中文文档上线啦！(最新V11版本)

7天撸完KTV点歌系统,含后台管理系统(完整版)

Node.js使用puppeteerjs提取网页中的视频地址

Taro 源码揭秘：4. 每次 npm run dev:weapp 开发小程序，build 编译打包是如何实现的？