Table of contents
- start testing
rooting
- TCP half-open
- keepalive
- retransmission timeout
- Zero window timeout
Apply the timeout setting of the socket layer
- TCP_USER_TIMEOUT
- SO_RCVTIMEO / SO_SNDTIMEO
- poll timeout
- Root-seeking summary
- What's the use of being honest
keepalive check for idle connections
- As upstream (server)
- as downstream (client)
- TCP_USER_TIMEOUT
Envoy application layer health detection
- Health check and connection pool
- Health check and endpoint discovery
- Active Health Check: Health checking
- Passive Health Detection: Outlier detection
- Health testing and EDS, who do you listen to?
Envoy application layer timeout
- Connection-level timeouts at the Envoy application layer
Request-level timeouts at the Envoy application layer
- Request read timeout for downstream(client)
- Timeout waiting for response to upstream(server)
- Talking about timeout, don't forget the impact of retry
- think
- a little summary
- main reference
If the picture is not clear, please go to the original
Recently, I need to do some HA / Chaos Testing (chaos testing) on the environment of k8s cluster + VIP load balance + Istio. As shown in the figure below, in this environment, you need to see the impact of worker node B on external users (Client) in the case of abnormal shutdown or network partition:
- Request success rate impact
- Performance (TPS/Response Time) impact
The above picture needs to be explained:
- The load balancing of the TCP/IP layer for the external VIP (virtual IP) is to distribute the load through the Modulo-N algorithm of ECMP (Equal-Cost Multi-Path), which is essentially a 5-tuple (protocol, srcIP, srcPort, dstIP, dstPort) to distribute external traffic. Note that this load balancing algorithm is
无状态
. When the target number changes, the result of the load balancing algorithm will also change. That is不稳定算法
. - dstIP is VIP's TCP traffic. After arriving at the waker node, it is stateful by ipvs/conntrack rules. DNAT and dstIP are mapped and converted to the address of any Istio Gateway POD. Note that this load balancing algorithm is
有状态
. When the target number changes, the load balancing result of the original connection will not change. Even if it is稳定算法
. The Istio Gateway POD also load balances HTTP/TCP traffic. The difference between the two protocols is:
- for HTTP. Multiple requests for a connection from the same downstream (traffic sender) may be load balanced to different upstream (traffic targets)
- for TCP. Multiple packets of a connection of the same downstream (traffic sender) will be load balanced to the same upstream (traffic destination)
start testing
The method of Chaos Testing is to forcefully shut down worker node B . As shown in the figure above, it can be inferred that the connection between 红色
and 绿色
will directly affect the connection. The impact seen from the client is:
- Request success rate decreased by only 0.01%
- TPS decreased by 1/2, and it only recovered after half an hour.
- Avg Response Time (average response time) basically unchanged
It should be noted that the various resources of a single Worker Node are not the performance bottleneck of this test. So, where is the problem?
The client is a JMeter program. By looking at the test report generated by it, it is found that after the worker node is closed, Avg Response Time
does not change much. But the Response Time of P99 and MAX became abnormally large. It can be seen that Avg Response Time
this thing hides a lot of things, the thread of the test is probably where the Block is, which causes the TPS to drop.
After a lot of tossing, the JMeter timeout time of 外部客户端
was modified to 6s, and the problem was solved. After the worker node is shut down, TPS recovers quickly.
rooting
Problem with external client solved. You can finish work and have dinner. But as someone who loves toss, I want to find out why. More want to know, this situation is really fast recovery, or there are hidden hidden dangers.
Before we start, let's talk about a concept:
TCP half-open
According to RFC 793, a TCP connection is called
半打开
when the host on one end of the connection crashes, or the socket is dropped without notifying the other end. If the half-open end is idle (i.e. no data/keepalive is sent), the connection may remain half-open for an indefinite period of time.
After worker node B is shut down, from the perspective of 外部客户端
, as shown in the figure above, the TCP connection to worker node B may be in two states:
The client kernel layer needs to send data packets to the opposite end due to sending (or retransmitting) data, or the idle time to keepalive. Worker node A receives this packet. Since it is an illegal TCP, the possible situations are:
- A TCP RESET was responded. The client closed the connection after receiving it. The client Block (blocked) thread on the socket also returns because the connection is closed, continues to run and closes the socket
- Since the DNAT mapping table could not find the relevant connection, the data packet was dropped directly and did not respond. The client Block continues to Block in the thread of the socket. That happened
TCP half-open
- The client connection does not enable keepalive, or the idle time does not reach the keepalive time, and the kernel layer has no data to send (or retransmit), and the client thread Block waits for socket read, that is,
TCP half-open
It can be seen that for the client, in a high probability, it takes a certain amount of time to find that a connection has failed. In the worst case, if keepalive is not enabled, it may never be found TCP half-open
.
keepalive
from [TCP/IP Illustrated Volume 1]
The keepalive probe is an empty (or 1 byte)
segment(段)
with a sequence number 1 less than the largestACK
对端(peer)
seen so far.peer
,peer
空segment
任何副作用,peer
Returns aACK
to determine ifpeer
is alive.探测 probe segment
ACK
contain any new data.
探测 probe segment
If lost, TCP will not retransmit either. [RFC1122] states that, due to this fact, a singlekeepalive
probe not receivedACK
should not be considered sufficient evidence that the peer is dead. Multiple interval probes are required.
If the socket is open SO_KEEPALIVE
, then it is enabled keepalive
.
For TCP connections enabled with keepalive
, Linux has the following global default configuration:
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/net.html
tcp_keepalive_time - INTEGER
How often TCP sends out keepalive messages when keepalive is enabled. Default: 2 hours.
tcp_keepalive_probes - INTEGER
How many keepalive probes TCP sends out, until it decides that the connection is broken. Default value: 9.
tcp_keepalive_intvl - INTEGER
How frequently the probes are send out. Multiplied by tcp_keepalive_probes it is time to kill not responding connection, after probes started. Default value: 75 sec ie connection will be aborted after ~11 minutes of retries.
At the same time, Linux also provides configuration items independently specified for each socket:
https://man7.org/linux/man-pages/man7/tcp.7.html
TCP_KEEPCNT (since Linux 2.4)
The maximum number of keepalive probes TCP should send
before dropping the connection. This option should not be
used in code intended to be portable.
TCP_KEEPIDLE (since Linux 2.4)
The time (in seconds) the connection needs to remain idle
before TCP starts sending keepalive probes, if the socket
option SO_KEEPALIVE has been set on this socket. This
option should not be used in code intended to be portable.
TCP_KEEPINTVL (since Linux 2.4)
The time (in seconds) between individual keepalive probes.
This option sh
You can calculate, by default, the fastest time a connection can be closed by keepalive:
TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT = 2*60*60 + 75*9 = 2小时 + 11分钟
retransmission timeout
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
- tcp_retries2 - INTEGER
This value influences the timeout of an alive TCP connection, when RTO retransmissions remain unacknowledged. Given a value of N, a hypothetical TCP connection following exponential backoff with an initial RTO of TCP_RTO_MIN would retransmit N times before killing the connection at the (N+1)th RTO.The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout.RFC 1122 recommends at least 100 seconds for the timeout, which corresponds to a value of at least 8.
The above configuration item, in the configuration retransmission state, how many retransmissions should be exponentially backed off before the kernel closes the connection. The default configuration is 15. The conversion time of the calculation is about 924s, which is about 15 minutes.
Zero window timeout
When 对端
advertises its window size as zero, it indicates that the peer TCP receive buffer is full and cannot receive more data. It may be that data processing is too slow due to resource constraints on the peer end, eventually causing the TCP receive buffer to fill up.
In theory, after the peer end processes the data accumulated in the receive window, it will use ACK to notify the window to open. But for various reasons, sometimes, this ACK is lost.
Therefore, senders with unsent data need to periodically detect the window size. The sender will select the first byte of data from the undelivered buffer to send as a probe packet. When the probe exceeds a certain number of times and the other party does not respond, or always responds to 0 windows, the connection will be automatically closed. The default in Linux is 15 times. The configuration item is: tcp_retries2
. Its probe retry mechanism is similar to TCP retransmission.
Reference: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/#:~:text=value%20is%20ignored.- ,Zero%20window,-ESTAB%20is...% 20forever
Apply the timeout setting of the socket layer
TCP_USER_TIMEOUT
man tcp
TCP_USER_TIMEOUT (since Linux 2.6.37)
This option takes an unsigned int as an argument. When
the value is greater than 0, it specifies the maximum
amount of time in milliseconds that transmitted data may
remain unacknowledged, or bufferred data may remain
untransmitted (due to zero window size) before TCP will
forcibly close the corresponding connection and return
ETIMEDOUT to the application. If the option value is
specified as 0, TCP will use the system default.
Increasing user timeouts allows a TCP connection to
survive extended periods without end-to-end connectivity.
Decreasing user timeouts allows applications to "fail
fast", if so desired. Otherwise, failure may take up to
20 minutes with the current system defaults in a normal
WAN environment.
This option can be set during any state of a TCP
connection, but is effective only during the synchronized
states of a connection (ESTABLISHED, FIN-WAIT-1, FIN-
WAIT-2, CLOSE-WAIT, CLOSING, and LAST-ACK). Moreover,
when used with the TCP keepalive (SO_KEEPALIVE) option,
TCP_USER_TIMEOUT will override keepalive to determine when
to close a connection due to keepalive failure.
The option has no effect on when TCP retransmits a packet,
nor when a keepalive probe is sent.
This option, like many others, will be inherited by the
socket returned by accept(2), if it was set on the
listening socket.
Further details on the user timeout feature can be found
in RFC 793 and RFC 5482 ("TCP User Timeout Option").
That is, it is specified that the kernel closes the connection and returns an error to the application after the sending cannot be confirmed (not receiving ACK
), or the peer receiving window is 0.
It should be noted that TCP_USER_TIMEOUT
will affect the TCP_KEEPCNT
configuration effect of keepalive:
https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
With
TCP_USER_TIMEOUT
set, theTCP_KEEPCNT
is totally ignored. If you wantTCP_KEEPCNT
to make sense, the only sensibleUSER_TIMEOUT
value is slightly smaller than:TCP_USER_TIMEOUT < TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT
SO_RCVTIMEO / SO_SNDTIMEO
https://man7.org/linux/man-pages/man7/socket.7.html
SO_RCVTIMEO and SO_SNDTIMEO
Specify the receiving or sending timeouts until reporting
an error. The argument is a struct timeval. If an input
or output function blocks for this period of time, and
data has been sent or received, the return value of that
function will be the amount of data transferred; if no
data has been transferred and the timeout has been
reached, then -1 is returned with errno set to EAGAIN or
EWOULDBLOCK, or EINPROGRESS (for connect(2)) just as if
the socket was specified to be nonblocking. If the
timeout is set to zero (the default), then the operation
will never timeout. Timeouts only have effect for system
calls that perform socket I/O (e.g., read(2), recvmsg(2),
send(2), sendmsg(2)); timeouts have no effect for
select(2), poll(2), epoll_wait(2), and so on.
It should be noted that in this example, our client is JMeter, which is implemented in java. He uses the socket.setSoTimeout
method to set the timeout. according to:
https://stackoverflow.com/questions/12820874/what-is-the-functionality-of-setsotimeout-and-how-it-works
According to the source code I saw, the Linux implementation should use the timeout parameter of select/poll described in the next section, not the socket Options above.
https://github.com/openjdk/jdk/blob/4c54fa2274ab842dbecf72e201d5d5005eb38069/src/java.base/solaris/native/libnet/solaris_close.c#L96
Java JMeter automatically closes the socket after catching the SocketTimeoutException. And reconnect, so the problem of dead socket is solved at the application layer.
poll timeout
https://man7.org/linux/man-pages/man2/poll.2.html
int poll(struct pollfd *fds, nfds_t nfds, int timeout);
Root-seeking summary
Reference: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/#:~:text=typical%20applications%20sending%20data%20to%20the%20Internet
To ensure that the connection can detect timeouts relatively quickly in various states:
- Enable
TCP keepalive
and configure a reasonable time. This is necessary to keep some data flowing in case of idle connections. - Set
TCP_USER_TIMEOUT
toTCP_KEEPIDLE
+TCP_KEEPINTVL
*TCP_KEEPCNT
. - Use read and write timeout detection at the application layer, and the application will actively close the connection after the timeout. (This is the case for this article)
Why is there TCP keepalive
, and also TCP_USER_TIMEOUT
? The reason is that if a network partition occurs, the connection in the retransmission state will not trigger the keepalive detection. I document the principle to the diagram below:
What's the use of being honest
🤔 ❓ Speaking of this, some students will ask, after all, this time, you just need to adjust the read timeout of the application layer. Research and verify so many other things?
At this time, let's go back to the "original intention" in the figure below to see if all the hidden dangers have been solved:
Obviously, only the redline part of External Client
to k8s worker node B
is resolved. Other red and green lines have not been investigated. tcp half-opent
连接,是用tcp keepalive
、 tcp retransmit timeout
、 应用(Envoy) 层 timeout
机制快速关闭了,还是长期未检测到问题而关闭不及时, or even a connection leak?
keepalive check for idle connections
As upstream (server)
As can be seen below, Istio gateway does not enable keepalive by default:
$ kubectl exec -it $ISTIO_GATEWAY_POD -- ss -oipn 'sport 15001 or sport 15001 or sport 8080 or sport 8443'
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
tcp ESTAB 0 0 192.222.46.71:8080 10.111.10.101:51092 users:(("envoy",pid=45,fd=665))
sack cubic wscale:11,11 rto:200 rtt:0.064/0.032 mss:8960 pmtu:9000 rcvmss:536 advmss:8960 cwnd:10 segs_in:2 send 11200000000bps lastsnd:31580 lastrcv:31580 lastack:31580 pacing_rate 22400000000bps delivered:1 rcv_space:62720 rcv_ssthresh:56576 minrtt:0.064
At this time, you can use EnvoyFilter to add keepalive:
refer to:
https://support.f5.com/csp/article/K00026550
https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/socket_option.proto
https://github.com/istio/istio/issues/28879
https://istio-operation-bible.aeraki.net/docs/common-problem/tcp-keepalive/
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: ingress-gateway-socket-options
namespace: istio-system
spec:
configPatches:
- applyTo: LISTENER
match:
context: GATEWAY
listener:
name: 0.0.0.0_8080
portNumber: 8080
patch:
operation: MERGE
value:
socket_options:
- description: enable keep-alive
int_value: 1
level: 1
name: 9
state: STATE_PREBIND
- description: idle time before first keep-alive probe is sent
int_value: 7
level: 6
name: 4
state: STATE_PREBIND
- description: keep-alive interval
int_value: 5
level: 6
name: 5
state: STATE_PREBIND
- description: keep-alive probes count
int_value: 2
level: 6
name: 6
state: STATE_PREBIND
The istio-proxy sidecar can also be set up in a similar way.
as downstream (client)
Reference: https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-TCPSettings-TcpKeepalive
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: bookinfo-redis
spec:
host: myredissrv.prod.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
connectTimeout: 30ms
tcpKeepalive:
time: 60s
interval: 20s
probes: 4
TCP_USER_TIMEOUT
The story is here, it should be over, but it is not yet. Review the two previous graphs:
At this time, the retransmit timer will periodically retransmit at the TCP layer. There are two possibilities here:
- After worker node B was powered off, Calico quickly discovered the problem, updated the routing table of worker node A, and deleted the route to worker node B.
- Route not updated in time
The default retransmit timer takes 15 minutes to close the connection and notify the application. How to speed up?
You can use the above mentioned TCP_USER_TIMEOUT
acceleration half-open TCP
to find the problem in the retransmission state:
kind: EnvoyFilter
metadata:
name: sampleoptions
namespace: istio-system
spec:
configPatches:
- applyTo: CLUSTER
match:
context: SIDECAR_OUTBOUND
cluster:
name: "outbound|12345||foo.ns.svc.cluster.local"
patch:
operation: MERGE
value:
upstream_bind_config:
source_address:
address: "0.0.0.0"
port_value: 0
protocol: TCP
socket_options:
- name: 18 #TCP_USER_TIMEOUT
int_value: 10000
level: 6
The above speeds up the discovery of die upstream (server crash). For die downstream, a similar method may be used to configure the listener.
Envoy application layer health detection
At this point, the story should really end, but it's not there yet.
The health detection of the application layer may also speed up the discovery of the TCP half-open
, or endpoint outlier
problem of the upstream cluster. <mark>Note that the health check here is not k8s liveness/readiness probe
. It is a pod-to-pod health check, including pod-to-pod connectivity. </mark>
Envoy has two health checks:
- Active Health Check: Health checking
- Passive Health Detection: Outlier detection
Health check and connection pool
See: Health checking interactions
可用状态
为主动或被动健康检查,则所有从---40f6b8336ac25e13abc04b21491a13c5---转换为不可用状态
的主机
的连接池
关闭。 If the host recovers and re-enters load balancing, it will create new connections, which will minimize the problem of dead connections (due to ECMP routing or otherwise).
Health check and endpoint discovery
See: On eventually consistent service discovery
Active Health Check: Health checking
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/health_checking
Passive Health Detection: Outlier detection
https://istio.io/latest/docs/tasks/traffic-management/circuit-breaking/
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/circuit_breaking
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: httpbin
spec:
host: httpbin
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 1 #The maximum number of requests that will be queued while waiting for a ready connection pool connection
outlierDetection:
consecutive5xxErrors: 1
interval: 1s
baseEjectionTime: 3m
maxEjectionPercent: 100
EOF
Health testing and EDS, who do you listen to?
When worker node B
is powered off, the state of the running pods finally (about 10 minutes by default) reaches Terminaling
. k8s will notify istiod to delete the endpoint. So the question is, in the end, is the EDS fast, or the health check detects the failure fast, and which data does Envoy use as the basis for load selection?
This issue is discussed in this document:
On eventually consistent service discovery
Envoy was designed from the beginning with the idea that service discovery does not require full consistency. Instead, Envoy assumes that hosts come and go from the mesh in an eventually consistent way. Our recommended way of deploying a service to service Envoy mesh configuration uses eventually Consistent service discovery along with active health checking (Envoy explicitly health checking upstream cluster members) to determine cluster health. This paradigm has a number of benefits:
- All health decisions are fully distributed. Thus, network partitions are gracefully handled (whether the application gracefully handles the partition is a different story).
- When health checking is configured for an upstream cluster, Envoy uses a 2x2 matrix to determine whether to route to a host:
Discovery Status Health Check OK Health Check Failed Discovered Route (participate in load balancing) Don't Route Absent (missing) <mark>Route (participate in load balancing)</mark> Don't Route / Delete
Host discovered / health check OK
Envoy will route to the target host.
<mark>Host absent / health check OK:</mark>
Envoy will route to the target host. This is very important since the design assumes that the discovery service can fail at any time. If a host continues to pass health check even after becoming absent from the discovery data, Envoy will still route. Although it would be impossible to add new hosts in this scenario, existing hosts will continue to operate normally. When the discovery service is operating normally again the data will eventually re-converge.
Host discovered / health check FAIL
Envoy will not route to the target host. Health check data is assumed to be more accurate than discovery data.
Host absent / health check FAIL
Envoy will not route and will delete the target host. This is the only state in which Envoy will purge host data.
One thing that is not fully understood is that the
Absent
refers to the Absent when the EDS service fails to access, or the access succeeds, but there is no original endpoint in the result.
Review the previous diagram:
You can probably know where you can consider using the health check configuration to speed up problem discovery.
Envoy application layer timeout
Connection-level timeouts at the Envoy application layer
- New connection timeout: connect_timeout , Istio defaults to 10s , this configuration affects the time limit of outlier detection.
Idle connection timeout: idle_timeout , default 1 hour
- Maximum connection duration: max_connection_duration , default unlimited
Request-level timeouts at the Envoy application layer
Request read timeout for downstream(client)
Envoy:
- Envoy application layer request reception timeout: request_timeout , infinite by default
- Header read timeout: request_headers_timeout , default infinite
- See more: https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/timeouts
Timeout waiting for response to upstream(server)
That is, the time from the complete reading of downsteam's request to the complete reading of the response from upstream. See:
https://istio.io/latest/docs/tasks/traffic-management/request-timeouts/
Talking about timeout, don't forget the impact of retry
See: Istio retry
think
If the powered-off worker node is restarted, can the previous peer receive the TCP RST quickly and disconnect the failed connection?
If there is NAT/conntrack
on the connection processing link of the powered-off worker node, will TCP RST be returned after the session and port mapping state is lost? Is it still dropped?
a little summary
This article is a bit messy. To be honest, some configurations and principles are interrelated and affect each other, and it is very difficult to completely sort out:
- Various timeouts of the TCP layer
- Various timeouts of syscall
- Various timeouts at the application layer
- Health check and Outlier detection
- Retry
Hopefully one day someone (or oneself) can clarify these things. The goal of this article is to record all the variables of the matter first, determine a range, and then subdivide and scrutinize the principle. Hope it will be useful to readers 🥂
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。