tcp-ip - Good Breakup - A Preliminary Study on Rapid Recovery of Istio Grid Node Failures - 个人文章

Table of contents

start testing
rooting
- TCP half-open
- keepalive
- retransmission timeout
- Zero window timeout
- Apply the timeout setting of the socket layer
  - TCP_USER_TIMEOUT
  - SO_RCVTIMEO / SO_SNDTIMEO
  - poll timeout
- Root-seeking summary
What's the use of being honest
keepalive check for idle connections
- As upstream (server)
- as downstream (client)
TCP_USER_TIMEOUT
Envoy application layer health detection
- Health check and connection pool
- Health check and endpoint discovery
- Active Health Check: Health checking
- Passive Health Detection: Outlier detection
- Health testing and EDS, who do you listen to?
Envoy application layer timeout
- Connection-level timeouts at the Envoy application layer
- Request-level timeouts at the Envoy application layer
  - Request read timeout for downstream(client)
  - Timeout waiting for response to upstream(server)
- Talking about timeout, don't forget the impact of retry
think
a little summary
main reference

If the picture is not clear, please go to the original

Recently, I need to do some HA / Chaos Testing (chaos testing) on the environment of k8s cluster + VIP load balance + Istio. As shown in the figure below, in this environment, you need to see the impact of worker node B on external users (Client) in the case of abnormal shutdown or network partition:

Request success rate impact
Performance (TPS/Response Time) impact

The above picture needs to be explained:

The load balancing of the TCP/IP layer for the external VIP (virtual IP) is to distribute the load through the Modulo-N algorithm of ECMP (Equal-Cost Multi-Path), which is essentially a 5-tuple (protocol, srcIP, srcPort, dstIP, dstPort) to distribute external traffic. Note that this load balancing algorithm is 无状态 . When the target number changes, the result of the load balancing algorithm will also change. That is 不稳定算法 .
dstIP is VIP's TCP traffic. After arriving at the waker node, it is stateful by ipvs/conntrack rules. DNAT and dstIP are mapped and converted to the address of any Istio Gateway POD. Note that this load balancing algorithm is 有状态 . When the target number changes, the load balancing result of the original connection will not change. Even if it is 稳定算法 .
The Istio Gateway POD also load balances HTTP/TCP traffic. The difference between the two protocols is:
- for HTTP. Multiple requests for a connection from the same downstream (traffic sender) may be load balanced to different upstream (traffic targets)
- for TCP. Multiple packets of a connection of the same downstream (traffic sender) will be load balanced to the same upstream (traffic destination)

start testing

The method of Chaos Testing is to forcefully shut down worker node B . As shown in the figure above, it can be inferred that the connection between 红色 and 绿色 will directly affect the connection. The impact seen from the client is:

Request success rate decreased by only 0.01%
TPS decreased by 1/2, and it only recovered after half an hour.
Avg Response Time (average response time) basically unchanged

It should be noted that the various resources of a single Worker Node are not the performance bottleneck of this test. So, where is the problem?

The client is a JMeter program. By looking at the test report generated by it, it is found that after the worker node is closed, Avg Response Time does not change much. But the Response Time of P99 and MAX became abnormally large. It can be seen that Avg Response Time this thing hides a lot of things, the thread of the test is probably where the Block is, which causes the TPS to drop.

After a lot of tossing, the JMeter timeout time of 外部客户端 was modified to 6s, and the problem was solved. After the worker node is shut down, TPS recovers quickly.

rooting

Problem with external client solved. You can finish work and have dinner. But as someone who loves toss, I want to find out why. More want to know, this situation is really fast recovery, or there are hidden hidden dangers.

Before we start, let's talk about a concept:

TCP half-open

📖 TCP half-open
According to RFC 793, a TCP connection is called 半打开 when the host on one end of the connection crashes, or the socket is dropped without notifying the other end. If the half-open end is idle (i.e. no data/keepalive is sent), the connection may remain half-open for an indefinite period of time.

After worker node B is shut down, from the perspective of 外部客户端 , as shown in the figure above, the TCP connection to worker node B may be in two states:

The client kernel layer needs to send data packets to the opposite end due to sending (or retransmitting) data, or the idle time to keepalive. Worker node A receives this packet. Since it is an illegal TCP, the possible situations are:
- A TCP RESET was responded. The client closed the connection after receiving it. The client Block (blocked) thread on the socket also returns because the connection is closed, continues to run and closes the socket
- Since the DNAT mapping table could not find the relevant connection, the data packet was dropped directly and did not respond. The client Block continues to Block in the thread of the socket. That happened TCP half-open
The client connection does not enable keepalive, or the idle time does not reach the keepalive time, and the kernel layer has no data to send (or retransmit), and the client thread Block waits for socket read, that is, TCP half-open

It can be seen that for the client, in a high probability, it takes a certain amount of time to find that a connection has failed. In the worst case, if keepalive is not enabled, it may never be found TCP half-open .

keepalive

from [TCP/IP Illustrated Volume 1]
The keepalive probe is an empty (or 1 byte) segment(段) with a sequence number 1 less than the largest ACK 对端(peer) seen so far. peer ， peer空segment任何副作用， peer Returns a ACK to determine if peer is alive. 探测 probe segment ACK contain any new data.
探测 probe segment If lost, TCP will not retransmit either. [RFC1122] states that, due to this fact, a single keepalive probe not received ACK should not be considered sufficient evidence that the peer is dead. Multiple interval probes are required.

If the socket is open SO_KEEPALIVE , then it is enabled keepalive .

For TCP connections enabled with keepalive , Linux has the following global default configuration:

https://www.kernel.org/doc/html/latest/admin-guide/sysctl/net.html

tcp_keepalive_time - INTEGER
How often TCP sends out keepalive messages when keepalive is enabled. Default: 2 hours.
tcp_keepalive_probes - INTEGER
How many keepalive probes TCP sends out, until it decides that the connection is broken. Default value: 9.
tcp_keepalive_intvl - INTEGER
How frequently the probes are send out. Multiplied by tcp_keepalive_probes it is time to kill not responding connection, after probes started. Default value: 75 sec ie connection will be aborted after ~11 minutes of retries.

At the same time, Linux also provides configuration items independently specified for each socket:

https://man7.org/linux/man-pages/man7/tcp.7.html

 TCP_KEEPCNT (since Linux 2.4)
              The maximum number of keepalive probes TCP should send
              before dropping the connection.  This option should not be
              used in code intended to be portable.

       TCP_KEEPIDLE (since Linux 2.4)
              The time (in seconds) the connection needs to remain idle
              before TCP starts sending keepalive probes, if the socket
              option SO_KEEPALIVE has been set on this socket.  This
              option should not be used in code intended to be portable.

       TCP_KEEPINTVL (since Linux 2.4)
              The time (in seconds) between individual keepalive probes.
              This option sh

You can calculate, by default, the fastest time a connection can be closed by keepalive:

 TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT = 2*60*60 + 75*9 = 2小时 + 11分钟

retransmission timeout

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

 - tcp_retries2 - INTEGER

This value influences the timeout of an alive TCP connection, when RTO retransmissions remain unacknowledged. Given a value of N, a hypothetical TCP connection following exponential backoff with an initial RTO of TCP_RTO_MIN would retransmit N times before killing the connection at the (N+1)th RTO.The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout.RFC 1122 recommends at least 100 seconds for the timeout, which corresponds to a value of at least 8.

The above configuration item, in the configuration retransmission state, how many retransmissions should be exponentially backed off before the kernel closes the connection. The default configuration is 15. The conversion time of the calculation is about 924s, which is about 15 minutes.

Zero window timeout

When 对端 advertises its window size as zero, it indicates that the peer TCP receive buffer is full and cannot receive more data. It may be that data processing is too slow due to resource constraints on the peer end, eventually causing the TCP receive buffer to fill up.

In theory, after the peer end processes the data accumulated in the receive window, it will use ACK to notify the window to open. But for various reasons, sometimes, this ACK is lost.

Therefore, senders with unsent data need to periodically detect the window size. The sender will select the first byte of data from the undelivered buffer to send as a probe packet. When the probe exceeds a certain number of times and the other party does not respond, or always responds to 0 windows, the connection will be automatically closed. The default in Linux is 15 times. The configuration item is: tcp_retries2 . Its probe retry mechanism is similar to TCP retransmission.

Reference: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/#:~:text=value%20is%20ignored.- ,Zero%20window,-ESTAB%20is...% 20forever

Apply the timeout setting of the socket layer

TCP_USER_TIMEOUT

man tcp

 TCP_USER_TIMEOUT (since Linux 2.6.37)
              This option takes an unsigned int as an argument.  When
              the value is greater than 0, it specifies the maximum
              amount of time in milliseconds that transmitted data may
              remain unacknowledged, or bufferred data may remain
              untransmitted (due to zero window size) before TCP will
              forcibly close the corresponding connection and return
              ETIMEDOUT to the application.  If the option value is
              specified as 0, TCP will use the system default.

              Increasing user timeouts allows a TCP connection to
              survive extended periods without end-to-end connectivity.
              Decreasing user timeouts allows applications to "fail
              fast", if so desired.  Otherwise, failure may take up to
              20 minutes with the current system defaults in a normal
              WAN environment.

              This option can be set during any state of a TCP
              connection, but is effective only during the synchronized
              states of a connection (ESTABLISHED, FIN-WAIT-1, FIN-
              WAIT-2, CLOSE-WAIT, CLOSING, and LAST-ACK).  Moreover,
              when used with the TCP keepalive (SO_KEEPALIVE) option,
              TCP_USER_TIMEOUT will override keepalive to determine when
              to close a connection due to keepalive failure.

              The option has no effect on when TCP retransmits a packet,
              nor when a keepalive probe is sent.

              This option, like many others, will be inherited by the
              socket returned by accept(2), if it was set on the
              listening socket.

              Further details on the user timeout feature can be found
              in RFC 793 and RFC 5482 ("TCP User Timeout Option").

That is, it is specified that the kernel closes the connection and returns an error to the application after the sending cannot be confirmed (not receiving ACK ), or the peer receiving window is 0.

It should be noted that TCP_USER_TIMEOUT will affect the TCP_KEEPCNT configuration effect of keepalive:

https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
With TCP_USER_TIMEOUT set, the TCP_KEEPCNT is totally ignored. If you want TCP_KEEPCNT to make sense, the only sensible USER_TIMEOUT value is slightly smaller than:
 TCP_USER_TIMEOUT < TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT

SO_RCVTIMEO / SO_SNDTIMEO

https://man7.org/linux/man-pages/man7/socket.7.html

 SO_RCVTIMEO and SO_SNDTIMEO
              Specify the receiving or sending timeouts until reporting
              an error.  The argument is a struct timeval.  If an input
              or output function blocks for this period of time, and
              data has been sent or received, the return value of that
              function will be the amount of data transferred; if no
              data has been transferred and the timeout has been
              reached, then -1 is returned with errno set to EAGAIN or
              EWOULDBLOCK, or EINPROGRESS (for connect(2)) just as if
              the socket was specified to be nonblocking.  If the
              timeout is set to zero (the default), then the operation
              will never timeout.  Timeouts only have effect for system
              calls that perform socket I/O (e.g., read(2), recvmsg(2),
              send(2), sendmsg(2)); timeouts have no effect for
              select(2), poll(2), epoll_wait(2), and so on.

It should be noted that in this example, our client is JMeter, which is implemented in java. He uses the socket.setSoTimeout method to set the timeout. according to:

https://stackoverflow.com/questions/12820874/what-is-the-functionality-of-setsotimeout-and-how-it-works

According to the source code I saw, the Linux implementation should use the timeout parameter of select/poll described in the next section, not the socket Options above.

https://github.com/openjdk/jdk/blob/4c54fa2274ab842dbecf72e201d5d5005eb38069/src/java.base/solaris/native/libnet/solaris_close.c#L96

Java JMeter automatically closes the socket after catching the SocketTimeoutException. And reconnect, so the problem of dead socket is solved at the application layer.

poll timeout

https://man7.org/linux/man-pages/man2/poll.2.html

 int poll(struct pollfd *fds, nfds_t nfds, int timeout);

Root-seeking summary

Reference: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/#:~:text=typical%20applications%20sending%20data%20to%20the%20Internet

To ensure that the connection can detect timeouts relatively quickly in various states:

Enable TCP keepalive and configure a reasonable time. This is necessary to keep some data flowing in case of idle connections.
Set TCP_USER_TIMEOUT to TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT .
Use read and write timeout detection at the application layer, and the application will actively close the connection after the timeout. (This is the case for this article)

Why is there TCP keepalive , and also TCP_USER_TIMEOUT ? The reason is that if a network partition occurs, the connection in the retransmission state will not trigger the keepalive detection. I document the principle to the diagram below:

What's the use of being honest

🤔 ❓ Speaking of this, some students will ask, after all, this time, you just need to adjust the read timeout of the application layer. Research and verify so many other things?

At this time, let's go back to the "original intention" in the figure below to see if all the hidden dangers have been solved:

Obviously, only the redline part of External Client to k8s worker node B is resolved. Other red and green lines have not been investigated. tcp half-opent连接，是用tcp keepalive 、 tcp retransmit timeout 、 应用(Envoy) 层 timeout机制快速关闭了，还是长期未检测到问题而关闭不及时, or even a connection leak?

keepalive check for idle connections

As upstream (server)

As can be seen below, Istio gateway does not enable keepalive by default:

 $ kubectl exec -it $ISTIO_GATEWAY_POD -- ss -oipn 'sport 15001 or sport 15001 or sport 8080 or sport 8443'                                                         
Netid               State                Recv-Q                Send-Q                               Local Address:Port                               Peer Address:Port                
tcp                 ESTAB                0                     0                                    192.222.46.71:8080                                10.111.10.101:51092                users:(("envoy",pid=45,fd=665))
         sack cubic wscale:11,11 rto:200 rtt:0.064/0.032 mss:8960 pmtu:9000 rcvmss:536 advmss:8960 cwnd:10 segs_in:2 send 11200000000bps lastsnd:31580 lastrcv:31580 lastack:31580 pacing_rate 22400000000bps delivered:1 rcv_space:62720 rcv_ssthresh:56576 minrtt:0.064

At this time, you can use EnvoyFilter to add keepalive:

refer to:
https://support.f5.com/csp/article/K00026550
https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/socket_option.proto
https://github.com/istio/istio/issues/28879
https://istio-operation-bible.aeraki.net/docs/common-problem/tcp-keepalive/

 apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: ingress-gateway-socket-options
  namespace: istio-system
spec:
  configPatches:
  - applyTo: LISTENER
    match:
      context: GATEWAY
      listener:
        name: 0.0.0.0_8080
        portNumber: 8080
    patch:
      operation: MERGE
      value:
        socket_options:
        - description: enable keep-alive
          int_value: 1
          level: 1
          name: 9
          state: STATE_PREBIND
        - description: idle time before first keep-alive probe is sent
          int_value: 7
          level: 6
          name: 4
          state: STATE_PREBIND
        - description: keep-alive interval
          int_value: 5
          level: 6
          name: 5
          state: STATE_PREBIND
        - description: keep-alive probes count
          int_value: 2
          level: 6
          name: 6
          state: STATE_PREBIND

The istio-proxy sidecar can also be set up in a similar way.

as downstream (client)

Reference: https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-TCPSettings-TcpKeepalive

 apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: bookinfo-redis
spec:
  host: myredissrv.prod.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        connectTimeout: 30ms
        tcpKeepalive:
          time: 60s
          interval: 20s
          probes: 4

TCP_USER_TIMEOUT

The story is here, it should be over, but it is not yet. Review the two previous graphs:

At this time, the retransmit timer will periodically retransmit at the TCP layer. There are two possibilities here:

After worker node B was powered off, Calico quickly discovered the problem, updated the routing table of worker node A, and deleted the route to worker node B.
Route not updated in time

The default retransmit timer takes 15 minutes to close the connection and notify the application. How to speed up?

You can use the above mentioned TCP_USER_TIMEOUT acceleration half-open TCP to find the problem in the retransmission state:

https://github.com/istio/istio/issues/33466
https://github.com/istio/istio/issues/38476

 kind: EnvoyFilter
metadata:
  name: sampleoptions
  namespace: istio-system
spec:
  configPatches:
  - applyTo: CLUSTER
    match:
      context: SIDECAR_OUTBOUND
      cluster:
        name: "outbound|12345||foo.ns.svc.cluster.local"
    patch:
      operation: MERGE
      value:
        upstream_bind_config:
          source_address:
            address: "0.0.0.0"
            port_value: 0
            protocol: TCP
          socket_options:
          - name: 18 #TCP_USER_TIMEOUT
            int_value: 10000
            level: 6

The above speeds up the discovery of die upstream (server crash). For die downstream, a similar method may be used to configure the listener.

Envoy application layer health detection

At this point, the story should really end, but it's not there yet.

The health detection of the application layer may also speed up the discovery of the TCP half-open , or endpoint outlier problem of the upstream cluster. Note that the health check here is not k8s liveness/readiness probe . It is a pod-to-pod health check, including pod-to-pod connectivity.

Envoy has two health checks:

Active Health Check: Health checking
Passive Health Detection: Outlier detection

Health check and connection pool

See: Health checking interactions

可用状态为主动或被动健康检查，则所有从---40f6b8336ac25e13abc04b21491a13c5---转换为不可用状态的主机的连接池关闭。 If the host recovers and re-enters load balancing, it will create new connections, which will minimize the problem of dead connections (due to ECMP routing or otherwise).

Health check and endpoint discovery

See: On eventually consistent service discovery

Active Health Check: Health checking

https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/health_checking

Passive Health Detection: Outlier detection

https://istio.io/latest/docs/tasks/traffic-management/circuit-breaking/
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/circuit_breaking

 kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: httpbin
spec:
  host: httpbin
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1 #The maximum number of requests that will be queued while waiting for a ready connection pool connection
    outlierDetection:
      consecutive5xxErrors: 1
      interval: 1s
      baseEjectionTime: 3m
      maxEjectionPercent: 100
EOF

Health testing and EDS, who do you listen to?

When worker node B is powered off, the state of the running pods finally (about 10 minutes by default) reaches Terminaling . k8s will notify istiod to delete the endpoint. So the question is, in the end, is the EDS fast, or the health check detects the failure fast, and which data does Envoy use as the basis for load selection?

This issue is discussed in this document:

On eventually consistent service discovery
Envoy was designed from the beginning with the idea that service discovery does not require full consistency. Instead, Envoy assumes that hosts come and go from the mesh in an eventually consistent way. Our recommended way of deploying a service to service Envoy mesh configuration uses eventually Consistent service discovery along with active health checking (Envoy explicitly health checking upstream cluster members) to determine cluster health. This paradigm has a number of benefits:
All health decisions are fully distributed. Thus, network partitions are gracefully handled (whether the application gracefully handles the partition is a different story).
When health checking is configured for an upstream cluster, Envoy uses a 2x2 matrix to determine whether to route to a host:
Discovery Status Health Check OK Health Check Failed
Discovered Route (participate in load balancing) Don't Route
Absent (missing) Route (participate in load balancing) Don't Route / Delete
Host discovered / health check OK
Envoy will route to the target host.
Host absent / health check OK:
Envoy will route to the target host. This is very important since the design assumes that the discovery service can fail at any time. If a host continues to pass health check even after becoming absent from the discovery data, Envoy will still route. Although it would be impossible to add new hosts in this scenario, existing hosts will continue to operate normally. When the discovery service is operating normally again the data will eventually re-converge.
Host discovered / health check FAIL
Envoy will not route to the target host. Health check data is assumed to be more accurate than discovery data.
Host absent / health check FAIL
Envoy will not route and will delete the target host. This is the only state in which Envoy will purge host data.
One thing that is not fully understood is that the Absent refers to the Absent when the EDS service fails to access, or the access succeeds, but there is no original endpoint in the result.

Discovery Status	Health Check OK	Health Check Failed
Discovered	Route (participate in load balancing)	Don't Route
Absent (missing)	<mark>Route (participate in load balancing)</mark>	Don't Route / Delete

Review the previous diagram:

You can probably know where you can consider using the health check configuration to speed up problem discovery.

Envoy application layer timeout

Connection-level timeouts at the Envoy application layer

New connection timeout: connect_timeout , Istio defaults to 10s , this configuration affects the time limit of outlier detection.
Idle connection timeout: idle_timeout , default 1 hour
- Istio destination-rule
Maximum connection duration: max_connection_duration , default unlimited

Request-level timeouts at the Envoy application layer

Request read timeout for downstream(client)

Envoy:

Envoy application layer request reception timeout: request_timeout , infinite by default
Header read timeout: request_headers_timeout , default infinite
See more: https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/timeouts

Timeout waiting for response to upstream(server)

That is, the time from the complete reading of downsteam's request to the complete reading of the response from upstream. See:

Envoy Route timeouts
https://istio.io/latest/docs/tasks/traffic-management/request-timeouts/

Talking about timeout, don't forget the impact of retry

See: Istio retry

think

If the powered-off worker node is restarted, can the previous peer receive the TCP RST quickly and disconnect the failed connection?

If there is NAT/conntrack on the connection processing link of the powered-off worker node, will TCP RST be returned after the session and port mapping state is lost? Is it still dropped?

a little summary

This article is a bit messy. To be honest, some configurations and principles are interrelated and affect each other, and it is very difficult to completely sort out:

Various timeouts of the TCP layer
Various timeouts of syscall
Various timeouts at the application layer
Health check and Outlier detection
Retry

Hopefully one day someone (or oneself) can clarify these things. The goal of this article is to record all the variables of the matter first, determine a range, and then subdivide and scrutinize the principle. Hope it will be useful to readers 🥂

Good Breakup - A Preliminary Study on Rapid Recovery of Istio Grid Node Failures

start testing

rooting

TCP half-open

keepalive

retransmission timeout

Zero window timeout

Apply the timeout setting of the socket layer

TCP_USER_TIMEOUT

SO_RCVTIMEO / SO_SNDTIMEO

poll timeout

Root-seeking summary

What's the use of being honest

keepalive check for idle connections

As upstream (server)

as downstream (client)

TCP_USER_TIMEOUT

Envoy application layer health detection

Health check and connection pool

Health check and endpoint discovery

Active Health Check: Health checking

Passive Health Detection: Outlier detection

Health testing and EDS, who do you listen to?

Envoy application layer timeout

Connection-level timeouts at the Envoy application layer

Request-level timeouts at the Envoy application layer

Request read timeout for downstream(client)

Timeout waiting for response to upstream(server)

Talking about timeout, don't forget the impact of retry

think

a little summary

main reference

MarkZhu

引用和评论

最长的 4 月天