How We Fix Occasional 502 Errors

From the monitoring center of ingress, we can see that the failure rate is not high, but it remains at the level of 0.05 to 0.1:

We use such conditions to query and find that most of the errors are 502 errors:

 status>=500 | select status, count(*) a group by status order by a desc

So what exactly is a 502 error? The explanation given by Baidu Encyclopedia is:

502 Bad Gateway refers to a bad gateway, an invalid gateway; it means a network error in the Internet. Shows the page feedback given in the web browser. It usually does not mean that the upstream server is down (unresponsive gateway/proxy), but that the upstream server and gateway/proxy are using inconsistent protocols to exchange data. Given that the internet protocol is fairly clear, it often means that one or both machines have been programmed incorrectly or not completely.

Others say that it is caused by timeout:

Someone immediately refuted in the comment area:

Baidu Encyclopedia's explanation of the 504 error:

The 504 error stands for Gateway timeout, which means that the server acts as a gateway or proxy, but does not receive a request from the upstream server in time. The server (not necessarily a web server) is acting as a gateway or proxy to fulfill requests from clients (such as your browser or our CheckUpDown bot) to access the desired URL.

Apparently a 504 error is the timeout, and a 502 is not.

And from our further analysis of the 502 error log, the request time and response time when a 502 error occurs are extremely short, and it cannot be a timeout.

Check the difference between 502 and 504, only this statement is relatively reliable:

That is to say, our backend service can respond, but the response does not meet the requirements, so a 502 error occurs. But this kind of error is not inevitable. If it is inevitable, the website as a whole is unavailable and has been discovered long ago. Just because it is accidental, it is necessary to look at what happened when the 502 occurred.

To do this we turn on stderr output of nginx's logtail log:

Previously, it was false, but now we change it to true, so that it can output the error log, so that we can find the reason.

Immediately after the stderr error output, you can see a large number of these errors in the log:

 2022/04/02 16:59:55 [error] 11168#11168: *739601507 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 49.93.83.68, server: www.domain.com, request: "POST /myserver/service HTTP/1.1", upstream: "http://192.108.1.121:8080/myserver/service", host: "www.domain.com"

Literally, it means that the upstream server directly closed the connection. But why does the upstream server close the connection? Put the error information into the search engine for further investigation. Many articles guide our thinking in the direction of keepalive. The two attributes that should be checked are keepalive_timeout and keepalive_request.

What is keepalive? This is the default configuration of the http 1.1 protocol. In http 1.0, if there are 10 pictures on your web page, 10 connections should be established between the browser and the server at the same time, and the 10 pictures should be sent and then closed. These 10 connections, obviously for the server, establishing 10 connections and then closing 10 connections, the consumption is relatively large. Therefore, the keepalive function is added to the http 1.1 protocol. When sending 10 pictures, only one connection needs to be established. As long as there is still content to be transmitted, this channel will always remain open, not immediately after the transmission is completed. Close, which is what keepalive means.

But keepalive cannot keep this connection forever. If there is no content to keep it, it is undoubtedly a waste, so here comes the concept of timeout. Keepalive_timeout means that if there is no content in the connection and it exceeds this time , then disconnect this connection, keepalive_requests means how much content our connection is allowed to transmit at most, and if it exceeds this content, it will also be disconnected.

So what is the relationship between this keepalive_timout and our 502 error? Because the architecture of all websites is not that the browser directly connects to the back-end application server, but there must be an nginx server as a reverse proxy in the middle. A keepalive connection is established between the browser and the nginx server, and nginx is then established with the back-end application server. A keepalive connection , so these are two different keepalive connections. We call the keepalive connection between the browser and nginx as ka1, and the keepalive connection between nginx and the application server as ka2.

If the timeout of ka1 is set to 100 seconds, that is to say, if there is no new content to be transmitted within 100 seconds, the connection between nginx and the browser will be disconnected. At the same time, we set ka2 to 50 seconds, which means that if there is no new content to be transmitted between nginx and the application server, then the connection between the application server and nginx is disconnected. Then there will be a problem at this time: no content is transmitted in the first 50 seconds. At the 51st second, the browser sends a request to nginx. At this time, ka1 has not been disconnected, because there is no time to 100 seconds, so this There is no problem, but when nginx tries to send a request to the application server, there is a problem, ka2 is broken! Because the timeout setting of ka2 is 50 seconds, it has already exceeded at this time, so it is interrupted. At this time, nginx can no longer obtain a correct response from the application server, and has to return a 502 error in the browser!

But we haven't set these parameters at all. How can there be such a problem?

It doesn't matter. Since it has not been set, the system must use the default parameters. Let's take a look at the default setting of ka1, that is, the default keepalive_timeout value between nginx (ingress) and the browser:

upstream-keepalive-timeout
Sets a timeout during which an idle keepalive connection to an upstream server will stay open. default: 60

The default setting for ka1 is 60 seconds.

Let's take a look at how many seconds the default setting of ka2 is. The official Tomcat documentation says:

The number of milliseconds this Connector will wait for another HTTP request before closing the connection. The default value is to use the value that has been set for the connectionTimeout attribute. Use a value of -1 to indicate no (ie infinite) timeout.

The default value is equal to the value of connectionTimeout. What is the value of connectionTimeout?

The number of milliseconds this Connector will wait, after accepting a connection, for the request URI line to be presented. Use a value of -1 to indicate no (ie infinite) timeout. The default value is 60000 (ie 60 seconds) but note that the standard server.xml that ships with Tomcat sets this to 20000 (ie 20 seconds). Unless disableUploadTimeout is set to false, this timeout will also be used when reading the request body (if any).

The default value of connectionTimeout is 60 seconds, but the standard server.xml they provide sets this value to 20 seconds!

So now the problem is clear, our ka1 is 60 seconds, and ka2 is 20 seconds, and a 502 error will occur when a request comes in at any time from 21 seconds to 60 seconds.

After finding the root cause of the problem, it is easy to solve it. We only need to ensure that the timeout setting of ka1 is smaller than the setting of ka2. It is possible to modify ka1 or modify ka2.

Let's first modify ka1 and take a look. For ingress, to modify ka1, we need to modify it in the configMap of ingress, so we find the place where the configMap is set and add a new attribute to it:

Here we set upstream-keepalive-timeout to 4 to make sure it is lower than 20 of ka2. After setting, ingress will automatically load the new settings. Let's take a look at the result:

The original 502 error that was constantly generated completely disappeared!

Let's take a look at the error map again:

Pay attention to the 5XX ratio of the yellow color, from the moment we set it, it will always lie on the ground!

How We Fix Occasional 502 Errors

张京

引用和评论

只在工作日执行的脚本

DNS服务器地址大全

k8s集群部署（一主两从）

AD系列：Windows Server 2025 搭建AD域控和初始化

一体化运维，降本增效！秒云助力某基金打造智能运维平台

k8s实战基础

HTTP500代码怎么解决？常见的5xx网页错误及其原因