1.5w words + 24 pictures liver turn TCP.

TCP is a connection-oriented unicast protocol. In TCP, there is no such behavior of multicast and broadcast, because the IP addresses of the sender and the receiver can be specified in the TCP segment.

Before sending data, the two communicating parties (ie the sender and the recipient) need to establish a connection. After sending the data, the communicating parties need to disconnect. This is the establishment and termination of the TCP connection.

`TCP connection establishment and termination`

If you have read an article about the network layer I wrote before, you should know that there are four basic elements of TCP: sender’s IP address, sender’s port number, receiver’s IP address, receiver’s The port number is . The IP + port number of each party can be regarded as a socket, and the socket can be uniquely marked. The socket is equivalent to the door, and data transmission is about to be carried out after the door is out.

TCP connection establishment -> termination is divided into three stages in total

The focus of our discussion below is also focused on these three levels.

The following figure is a very typical TCP connection establishment and closing process, which does not include the data transmission part.

`TCP connection establishment-three-way handshake`

The server process is ready to receive the TCP connection from the outside. Generally, it is completed by calling the bind, listen, and socket functions. This opening method is considered to be passive open. Then the server process is in the LISTEN state, waiting for the client connection request.
The client initiates active open connect , and sends a connection request to the server. The header synchronization bit SYN = 1 in the request, and an initial sequence number sequence is selected at the same time, abbreviated as seq = x. The SYN segment is not allowed to carry data and only consumes a sequence number. At this point, the client enters the SYN-SEND state.
After the server receives the client connection, it needs to confirm the client's message segment. In the acknowledgment segment, both the SYN and ACK bits are set to 1. The confirmation number is ack = x + 1, and at the same time choose an initial sequence number seq = y for yourself. This segment can also not carry data, but it also consumes a sequence number. At this time, the TCP server enters the SYN-RECEIVED (synchronously received) state.
After receiving the response from the server, the client also needs to confirm the connection. The ACK in the confirmation connection is set to 1, the sequence number is seq = x + 1, and the confirmation number is ack = y + 1. TCP stipulates that this message segment may or may not carry data. If it does not carry data, then the sequence number of the next data message segment is still seq = x + 1. At this time, the client enters the ESTABLISHED (connected) state
After the server receives the client's confirmation, it also enters the ESTABLISHED state.

This is a typical three-way handshake process, and the establishment of a TCP connection can be completed through the above 3 message segments. The purpose of the three-way handshake is not only to let the communicating parties know that a connection is being established, also uses the option field in the data packet to exchange some special information and exchange the initial sequence number .

Generally, the first party to send a SYN message is considered to be actively opening a connection, and this party is usually called the client. The receiver of SYN is usually called server, which is used to receive this SYN and send the following SYN, so this opening method is passive opening.

TCP requires three segments to establish a connection, and four segments to release a connection.

`TCP disconnect-wave four times`

After the data transmission is over, the two parties in the communication can release the connection. After the data transmission is over, both the client host and the server host are in the ESTABLISHED state, and then enter the process of releasing the connection.

The process of TCP disconnection is as follows

The client application sends a message segment to release the connection, stops sending data, and actively closes the TCP connection. The client host sends a message segment to release the connection. The FIN position in the header of the message segment is 1 and does not contain data. The sequence number bit seq = u. At this time, the client host enters the FIN-WAIT-1 (termination waiting 1) phase .
After the server host receives the message segment sent by the client, it sends an acknowledgement response message. In the acknowledgement response message, ACK = 1, generates its own serial number bit seq = v, ack = u + 1, and then the server host enters CLOSE-WAIT (close waiting) state.
After the client host receives the confirmation response from the server host, it enters the FIN-WAIT-2 (termination waiting 2) state. Waiting for the client to send a message segment for connection release.
At this time, the server host will send a disconnected message segment. In the message segment, ACK = 1, sequence number seq = v, ack = u + 1. After sending the disconnect request message, the server host will Entered the LAST-ACK (last acknowledgement) stage.
After the client receives the disconnection request from the server, the client needs to respond. The client sends a disconnected segment. In the segment, ACK = 1, sequence number seq = u + 1, because the client After the connection is disconnected, no data is sent again, ack = v + 1, and then enter the TIME-WAIT (time waiting) state. Please note that the TCP connection has not been released at this time. The time waiting setting must pass, that is, 2MSL , the client will enter the CLOSED state, and the time MSL is called Maximum Segment Lifetime.
After the server mainly receives the disconnection confirmation from the client, it will enter the CLOSED state. Because the server ends the TCP connection earlier than the client, and the entire connection disconnection process needs to send four segments, the process of releasing the connection is also called four waves of hands.

Any party to a TCP connection can initiate a close operation, but it is usually the client that initiates a close connection operation. However, some servers, such as Web servers, will also initiate the operation of closing the connection after responding to the request. The TCP protocol stipulates that a close operation is initiated by sending a FIN message.

So in summary, three segments are required to establish a TCP connection, and four segments are required to close a TCP connection. The TCP protocol also supports a half-open state, although this is rare.

`TCP half open`

The TCP connection is in a half-open state because one party of the connection has closed or terminated the TCP connection without notifying the other party. That is to say, two people are chatting on WeChat, cxuan you are offline, you don’t tell me, I’m still talking You gossip. At this time, the connection is considered to be in a half-open state This happens when the party in the communication is in a situation where the host is crashing. If you are xxx, my computer crashes, how can I tell you? As long as the party in the semi-connected state does not transmit data, it is impossible to detect that the other party's host has gone offline.

Another reason for being in a half-open state is that the communicating party has turned off the host power supply instead of shutting down normally. In this case, there will be many half-open TCP connections on the server.

`TCP half closed`

Since TCP supports half-open operation, we can imagine that TCP also supports half-close operation. Similarly, TCP half-close is not common. The TCP half-close operation means that only closes one transmission direction data stream. The two half-close operations together can close the entire connection. Under normal circumstances, both parties in communication will end the connection by sending FIN segments to each other through the application, but when the TCP is half closed, the application will show own ideas : "I have finished sending the data, And sent a FIN segment to the other party, but I still want to receive data from the other party until it sends a FIN segment to me". The following is a schematic diagram of TCP half-close.

Explain this process:

First, the client host and the server host have been transmitting data. After a period of time, the client initiates a FIN message, requesting to actively disconnect. After receiving the FIN, the server responds with ACK, because the party that initiated the semi-close is also The client still wants the server to send data, so the server will continue to send data. After a period of time, the server sends another FIN message. After the client receives the FIN message and responds with an ACK to the server, the connection is disconnected.

In TCP's half-close operation, one direction of the connection is closed, while the other direction is still transmitting data until it is closed. Only very few applications use this feature.

`Simultaneous opening and simultaneous closing`

There is also a more unconventional operation, which is that two applications actively open the connection at the same time. Although this situation seems unlikely, it is possible under certain arrangements. We mainly describe this process.

Before receiving the SYN from the other party, the two parties in the communication will first send a SYN. This scenario also requires that both parties in the communication know each other's IP address + port number .

The following is an example of simultaneous opening

As shown in the figure above, both parties in communication actively sent a SYN message before receiving each other's message, and both responded with an ACK message after receiving each other's message.

A simultaneous opening process requires the exchange of four message segments, which is one more than the ordinary three-way handshake. Since there is no client and server to open at the same time, so here I use the communication parties to call it.

Like simultaneous opening, simultaneous closing means that both parties in the communication simultaneously propose an active closing request and send a FIN message. The following figure shows a simultaneous closing process.

At the same time, during the closing process, the same number of message segments need to be exchanged and normally closed, but the simultaneous closing is not carried out in sequence like four waves of hands, but is carried out alternately.

`Talk about the initial serial number`

Maybe it’s the unprofessional description in the picture or text above. The initial sequence number is expressed in professional terms. The English name of the initial sequence number is Initial sequence numbers (ISN) , so the seq = v we indicated above is actually It means the ISN.

Before sending SYN, both parties in communication will choose an initial serial number. The initial sequence number is randomly generated, and each TCP connection will have a different initial sequence number. The RFC document states that the initial serial number is a 32-bit counter, every 4 us (microsecond) + 1. Because each TCP connection is a different instance, the purpose of this arrangement is to prevent overlapping sequence numbers.

When a TCP connection is established, only the correct TCP quadruple and the correct sequence number will be received by the other party. This also reflects the forged by 160d039044ba90, because as long as I forge the same quad and initial sequence number, I can forge TCP connections, thereby interrupting the normal connection of TCP, so I resist this kind of attack. One way is to use the initial serial number, and the other way is to encrypt the serial number.

`TCP state transition`

We talked about the three-way handshake and four waved hands above, and mentioned some state transitions between TCP connections. Then I will start from the beginning and sort out the transitions between these states.

First, the first step is that the server and the client are in the CLOSED state at the beginning. At this time, it is necessary to determine whether to open actively or passively. If it is actively opened, then the client sends a SYN message to the server, and the client is in the SYN-SEND state. , SYN-SEND means waiting for a matching connection request after sending a connection request. The server will be in the LISTEN when it is passively opened, which is used to monitor SYN messages. If the client calls the close method or does not operate for a period of time, it will revert to the CLOSED state. The transition diagram of this step is as follows

There is a question here. Why does the client in the LISTEN state still send SYN to change to the SYN_SENT state?

Knowing that I saw the answer from the little fat guy, this situation may appear in FTP, LISTEN -> SYN_SENT because this connection may be triggered by the server-side application sending data to the client, and the client passively accepts the connection , After the connection is established, start to transfer files. In other words, it is also possible for a server in the LISTEN state to send SYN packets, but this situation is very rare.

The server in the SYN_SEND state will receive SYN and send SYN and ACK to change to the SYN_RCVD state. Similarly, the client in the LISTEN state will also receive SYN and send SYN and ACK to change to the SYN_RCVD state. If the client in the SYN_RCVD state receives RST it will become the LISTEN state.

It would be better to look at these two pictures together.

Here need to explain what is RST

There is a situation where the IP and port number of the host do not match after receiving the TCP packet segment. Suppose the client host sends a request, and the server host finds that it is not for this server after judging by the IP and port number, then the server will send a RST to the client.

Therefore, when the server sends a special RST segment to the client, it will tell the client there is no matching socket connection, please do not send .

RST: (Reset the connection) used to reset the wrong connection due to some reason, and it is also used to reject illegal data and request . If the RST bit is received, some errors usually occur.

Failure to identify the correct IP port above is a condition that causes RST to appear. In addition, RST may also appear due to request timeout, cancellation of an existing connection, etc.

The server located in SYN_RCVD will receive the ACK message, and the client of SYN_SEND will receive the SYN and ACK message, and send the ACK message. As a result, the connection between the client and the server is established.

One more thing to note here is that I did not deliberately express the state of simultaneous opening. In fact, in the case of simultaneous opening, its state changes like this.

Why is it so? Because you want to, in the case of simultaneous opening, both hosts will initiate SYN packets, and the host that actively initiates SYN will be in the SYN-SEND state. After the transmission is completed, it will wait to receive SYN and ACK, and both hosts will send it. After SYN + ACK, both parties are in the SYN-RECEIVED (SYN-RCVD) state, and then after waiting for the SYN + ACK message to arrive, the two parties will be in the ESTABLISHED state and start to transmit data.

Well, up to now, I have described the state transitions during the TCP connection establishment process. Now you can make a pot of tea and drink some water, and wait for the data transmission.

Okay, now that I have enough water, the data transmission is completed at this time. After the data transmission is completed, the TCP connection can be disconnected.

Now we set the clock forward and adjust it to the moment when the server is in the SYN_RCVD state. Because the SYN packet has just been received and a SYN + ACK packet has been sent, the server is very happy at this time, but at this time, the server application process is closed , And then the application process sends a FIN packet, which will make the server from the SYN_RCVD -> FIN_WAIT_1 state.

Then adjust the clock to now, the client and server have now finished transmitting data. At this time, the client sends a FIN message to disconnect, and the client will also change to the FIN_WAIT_1 state. For the server, it After receiving the FIN segment and replying to the ACK message, it will go from the ESTABLISHED -> CLOSE_WAIT state.

The server in the CLOSE_WAIT state will send a FIN message and then put itself in the LAST_ACK state. The client in FIN_WAIT_1 will change to FIN_WAIT_2 state when receiving ACK message.

Here we need to explain the status of CLOSING first, the transition from FIN_WAIT_1 -> CLOSING is special

CLOSING is a special state, it should be rare in actual situations, and belongs to a relatively rare exceptional state. Under normal circumstances, when you send a FIN message, it stands to reason that you should receive (or receive) the other party's ACK message first, and then receive the other party's FIN message. However, the CLOSING state means that after sending the FIN message, you did not receive the ACK message from the other party, but instead received the FIN message from the other party.

Under what circumstances will this happen? In fact, if you think about it, it is not difficult to draw a conclusion: that is, if both parties close a link at the same time, then the FIN message will be sent at the same time, that is, the CLOSING state will appear, indicating that both parties are closing the connection.

After the client in the FIN_WAIT_2 state receives the FIN + ACK message sent by the server host and sends an ACK response, it will change to the TIME_WAIT state. Clients in CLOSE_WAIT will be in LAST_ACK state when sending FIN.

Many pictures and blogs here are in the LAST_ACK state only after the FIN + ACK message is drawn on the picture, but when describing, generally only describe FIN. In other words, CLOSE_WAIT sends FIN only when it is in LAST_ACK state.

So the state of FIN_WAIT_1 -> TIME_WAIT here is the state the client is in after receiving FIN and ACK and sending ACK.

Then the client in the CLOSINIG state will continue to be in the TIME_WAIT state at this time if there is still an ACK received. It can be seen that the TIME_WAIT state is equivalent to the last state of the client before it is closed, which is an actively closed state; and LAST_ACK It is the last state of the server before it is closed, and it is a passive open state.

There are a few special states above, here we explain to the west.

`TIME_WAIT status`

After the communication parties establish a TCP connection, the party that actively closes the connection will enter the TIME_WAIT state. The TIME_WAIT state is also called the waiting state of 2MSL In this state, TCP will wait maximum segment lifetime (the Maximum Segment Lifetime, MSL) twice as long.

MSL needs to be explained here

MSL is the expected maximum lifetime of a TCP segment, that is, the longest time it can exist in the network. This time is limited because we know that TCP relies on IP data segments for transmission. There are TTL and hop count fields in IP datagrams. These two fields determine the lifetime of IP. In general, TCP’s The maximum survival time is 2 minutes, but this value can be modified, and this value can be modified according to different operating systems.

Based on this, let's explore the status of TIME_WAIT.

When TCP performs an active shutdown and sends the final ACK, TIME_WAIT should exist with 2 * maximum time to live, so that TCP can resend the final ACK to avoid loss. Resending the final ACK is not because TCP retransmitted ACK, but because the other party of the communication retransmitted FIN. The client often sends FIN back because it needs an ACK response to close the connection, if the survival time exceeds 2MSL. , The client will send an RST, causing the server to make an error.

`TCP timeout and retransmission`

There is no error-free communication . This sentence shows that no matter how complete the external conditions are, there will always be errors. Therefore, in the normal communication process of TCP, errors may also occur. Such errors may be caused by data packet loss, data packet duplication, or even data packet out of sequence.

During the TCP communication process, the TCP receiving end will return a series of confirmation messages to determine whether there is an error. Once packet loss occurs, TCP will start the retransmission operation to retransmit the unconfirmed data.

There are two ways to retransmit TCP, one is based on time, and the other is based on confirmation information. Generally, it is more efficient to pass confirmation information than pass time.

So it can be seen from this point that TCP confirmation and retransmission are based on the premise of whether the data packet is confirmed.

TCP will set a timer when sending data. If the confirmation message is not received within the time specified by the timer, it will trigger the corresponding timeout or timer-based retransmission operation. The timer timeout is usually called Retransmission Timeout (RTO) .

But there is another way that does not cause delay, which is fast retransmission .

After TCP retransmits a packet every time, its retransmission time will be doubled . This "double interval time" is called binary exponential backoff . When the interval time is doubled to 15.5 min, the client will display

Connection closed by foreign host.

TCP has two thresholds to determine how to retransmit a segment. These two thresholds are defined in RFC [RCF1122]. The first threshold is R1 , which indicates the number of retransmission attempts. The threshold R2 indicates that TCP should give up. Time to connect. R1 and R2 should be set to at least three retransmissions and 100 seconds to abandon the TCP connection.

It should be noted here that for the connection establishment message SYN, its R2 should be set to at least 3 minutes, but in different systems, the R1 and R2 values are set in different ways.

In Linux systems, the values of R1 and R2 can be set by the application, or by modifying the values of net.ipv4.tcp_retries1 and net.ipv4.tcp_retries2 The variable value is the number of retransmissions.

The default value of tcp_retries2 is 15. The time for this enrichment number is about 13-30 minutes. This is only an approximate value. The final time required depends on the RTO, which is the retransmission timeout period. The default value of tcp_retries1 is 3.

For the SYN segment, the two values net.ipv4.tcp_syn_retries and net.ipv4.tcp_synack_retries limit the number of SYN retransmissions. The default is 5, which is about 180 seconds.

There are also R1 and R2 variables under Windows operating system, and their values are defined in the registry below

HKLM\System\CurrentControlSet\Services\Tcpip\Parameters
HKLM\System\CurrentControlSet\Services\Tcpip6\Parameters

One of the very important variables is TcpMaxDataRetransmissions . This TcpMaxDataRetransmissions corresponds to the tcp_retries2 variable in Linux. The default value is 5. The meaning of this value is the number of times that TCP has not confirmed the data segment on the existing connection.

`Fast retransmission`

We mentioned fast retransmission above. In fact, the fast retransmission mechanism is triggered based on the feedback information of the receiving end, and it is not affected by the retransmission timer. Therefore, compared with timeout retransmission, fast retransmission can effectively repair the packet loss. When out-of-sequence packets (such as 2-4-3) arrive at the receiving end during the TCP connection, TCP needs immediately generate an acknowledgment message, which is also called Repeat ACK .

When an out-of-sequence message arrives, the repeated ACK must be returned immediately, and no delay in sending is allowed. The purpose of this action is to tell the sender that a segment of the message has arrived out of sequence, and hope that the sender will indicate the sequence number of the out-of-sequence message segment.

There is also a situation that will cause repeated ACKs to be sent to the sender, that is, subsequent messages of the current segment are sent to the receiving end, and it can be judged that the current sender's segment is missing or delayed in arrival. Because the consequence of these two situations is that the receiver did not receive the message, but we cannot judge whether the message segment is lost or the message segment is not delivered. Therefore, the TCP sender will wait for a certain number of repeated ACKs to be accepted to determine whether the data is lost and trigger a fast retransmission. Generally, the number of this judgment is 3. This text may not be clearly understood. Let's take an example.

As shown in the figure above, segment 1 is successfully received and confirmed as ACK 2, and the expected sequence number of the receiving end is 2. When segment 2 is lost, segment 3. Arrival out of sequence, but does not match the expectation of the receiving end, so the receiving end will send redundant ACK 2 repeatedly.

In this way, before the timeout retransmission timer expires, after receiving three consecutive identical ACKs, the sender will know which segment is lost, so the sender will retransmit the missing segment, so that there is no need Waiting for the expiration of the retransmission timer greatly improves efficiency.

`SACK`

In the standard TCP acknowledgment mechanism, if the sender sends data between 0-10000 sequence number, but the receiver only receives data between 0 -1000, 3000-10000, and the data between 1000-3000 does not arrive On the receiving end, the sender will retransmit the data between 1000 and 10000 at this time. In fact, this is not necessary because the data after 3000 has already been received. But the sender cannot perceive the existence of this situation.

How to avoid or solve this kind of problem?

In order to optimize this situation, we need to let the client know more messages. In the TCP segment, there is a SACK option field. This field is a selective acknowledgment mechanism. The mechanism can tell the TCP client to explain it in our common saying: "I am allowed to receive the segment after 1000 at most, but I have received the segment between 3000-10000, please give me the segment between 1000-3000 Segment".

However, whether this selective acknowledgment mechanism is enabled is also affected by a field, this field is SACK allow option field, both parties in the SYN segment or SYN + ACK segment add the SACK allow option field to inform the peer host whether it supports SACK, if both parties support it, the SACK option can be used in the SYN segment later.

Note here: The SACK option field can only appear in the SYN section.

`Pseudo timeout and retransmission`

In some cases, even if there is no loss of message segment, it may cause message retransmission. This kind of retransmission is called spurious retransmission , this kind of retransmission is not necessary, the factor causing this situation may be due to spurious timeout (spurious timeout) , the meaning of spurious timeout is An early determination timeout occurred. There are many factors that cause false timeouts, such as out-of-sequence arrival of message segments, duplication of message segments, and ACK loss.

There are many ways to detect and deal with spurious timeouts. These methods are collectively referred to as the detection algorithm and the response algorithm. The detection algorithm is used to determine whether there is a timeout phenomenon or a timer retransmission phenomenon. Once there is a timeout or retransmission, the response algorithm will be executed to cancel or reduce the impact of timeout. The following are several algorithms. This article will not go into these implementation details for the time being

Repeat SACK extension-DSACK
Eifel detection algorithm
Forward RTO Recovery-F-RTO
Eifel response algorithm

`Out of order and duplicate packets`

What we discussed above are how TCP handles packet loss. Let's discuss the problems of out-of-sequence and duplication of packets.

`Packet out of order`

The out-of-sequence arrival of data packets is an extremely prone situation in the Internet. Since the IP layer cannot guarantee the orderliness of the data packets, the transmission of each data packet may choose the link with the fastest transmission speed in the current situation, so It is very possible that three packets of A -> B -> C are sent, and the order of the packets arriving at the receiving end is C -> A -> B or B -> C -> A and so on. This is a phenomenon in which packets are out of order.

In packet transmission, there are mainly two links: forward link (SYN) and reverse link (ACK)

If the disorder occurs on the forward link, TCP cannot correctly determine whether the data packet is lost. The loss and disorder of the data will cause the receiving end to receive the disordered data packet, resulting in a gap between the data. If the vacancy is not large enough, this situation has little impact; but if the vacancy is relatively large, it may cause false retransmissions.

If the out-of-sequence occurs on the reverse link, it will move the TCP window forward, and then receive repeated ACKs that should be discarded, resulting in unnecessary traffic bursts , which affects the available network bandwidth.

Going back to the fast retransmission we discussed above, since fast retransmission is started by inferring packet loss based on repeated ACKs, it does not have to wait until the retransmission timer expires. Since the TCP receiver will immediately return ACKs to the received out-of-sequence packets, any out-of-sequence packets in the network may cause duplicate ACKs. Assuming that once an ACK is received, the fast retransmission mechanism will be activated. When the number of ACKs increases sharply, a large number of unnecessary retransmissions will occur. Therefore, the fast retransmission should reach repetition threshold (dupthresh) before triggering. But in the Internet, serious disorder is not common, so the value of dupthresh can be set as small as possible. Generally speaking, 3 can handle most cases.

`Packet duplication`

Packet duplication is also a rare situation in the Internet. It refers to the situation that packets may be transmitted multiple times during network transmission. When retransmissions are generated, TCP may be confused.

The repetition of packets can cause the receiving end to generate a series of repeated ACKs. This situation can be solved by SACK negotiation.

`TCP data flow and window management`

We show you in 40 pictures to take you to understand TCP and UDP This article knows that you can use sliding windows to achieve flow control, that is to say, the client and the server can provide data flow information exchange and data flow correlation. The information mainly includes segment sequence number, ACK number and window size .

The two arrows in the figure indicate the direction of data flow, and the direction of data flow is also the transmission direction of TCP packet segments. It can be seen that each TCP segment includes sequence number, ACK, and window information, and there may also be user data . The window size in the TCP segment indicates the size of the buffer space that the receiver can still receive, in bytes. This window size is dynamic, because message segments are received and disappeared all the time. This dynamically adjusted window size is called sliding window. Let’s take a closer look at the sliding window.

`Sliding window`

Each end of the TCP connection can send data, but the sending of data is not unlimited. In fact, both ends of the TCP connection maintain a send window structure (send window structure) and receive window structure (receive window) structure) , these two window structures are the limits of data transmission.

`Sender window`

The figure below is an example of a sender window.

In this picture, four concepts involving sliding windows are involved:

Segments that have been sent and confirmed: After being sent to the receiver, the receiver replies with an ACK to respond to the segment. The segment marked in green in the figure is the segment that has been confirmed by the receiver.
Segments that have been sent but not yet confirmed: The green area in the figure is the segment that has been confirmed by the receiver, and the light blue segment refers to the segment that has been sent but has not yet been confirmed by the receiver.
Message segment waiting to be sent: the dark blue area in the figure is the message segment waiting to be sent, which is part of the sending window structure, that is to say, the sending window structure is actually composed of sent unacknowledged + message segments waiting to be sent .
The message segment that can be sent when the window is sliding: If the message segment in the set [4,9] in the figure is sent, the entire sliding window will move to the right. The orange area in the figure can only be sent when the window is moved to the right. Message segment.

The sliding window also has a border. The borders are Left edge and Right edge . Left edge is the left edge of the window, and Right edge is the right edge of the window.

When Left edge moves to the right and Right edge does not change, this window may be in the closed state close This happens as the data that has been sent is gradually confirmed, causing the window to become smaller.

When Right edge moves to the right, the window will be in open open state, allowing more data to be sent. When the receiving process reads the buffer data, so that the buffer receives more data, it will be in this state.

It may also happen that the Right edge moves to the left, which will cause the message segment sent and confirmed to become smaller. This situation is called Confused Window Syndrome , which is something we don't want to see. When the Confused Window Syndrome appears, the size of the data segment used by the two communicating parties for exchange will become smaller, while the fixed network overhead remains unchanged. The ratio of useful data to the header information in each segment is small, resulting in transmission efficiency very low.

This is equivalent to the fact that you had the ability to spend a day writing a complicated page before, but now you have spent a day changing a bug in the title, which is overkill.

Each TCP segment contains ACK number and window notification information , so whenever a response is received, the TCP receiver will adjust the window structure according to these two parameters.

The Left edge of the TCP sliding window can never move to the left, because the message segment sent and confirmed can never be cancelled, just like there is no regret medicine in this world. This edge is controlled by the ACK number sent by another segment. When the ACK label moves the window to the right but the window size does not change, the window is said to slide forward .

If the number of ACK increases but the window notification information decreases with the arrival of other ACKs, then the left edge will approach the right edge. When Left edge and Right edge coincide, the sender will not transmit any more data at this time. This situation is called zero window. At this time, the TCP sender will initiate a window detection and wait for an appropriate time to send data.

`Receiver window`

The receiver also maintains a window structure, which is much simpler than the sender. This window records the data that has been received and confirmed, and the maximum serial number it can receive. The receiver's window structure does not store duplicate message segments and ACKs, and the receiver's window does not record message segments and ACKs that should not be received. The following is the window structure of the TCP receiver.

Like the sender window, the receiver window structure also maintains a Left edge and a Right edge. The segment on the left of the left edge is called the received and confirmed segment, and the segment on the right of the right edge is called the unreceivable segment.

For the receiving end, the data whose arrival sequence number is less than Left efge is considered to be repeated data and needs to be discarded. Those exceeding Right edge are considered to be beyond the processing range. Only when the arriving message segment is equal to Left edge, the data will not be discarded and the window can slide forward.

The window structure of the receiver will also have zero windows. If an application process consumes data very slowly, but the TCP sender sends a large amount of data to the receiver, it will cause the TCP buffer to overflow and notify the sender not to send any more data However, the application process consumes the data in the buffer at a very slow rate (for example, 1 byte), which tells the receiving end that only one byte of data can be sent. This process continues slowly, resulting in high network overhead and high efficiency. low.

We mentioned above that there is a window with Left edge = Right edge. At this time, it is called a zero window. Let's take a closer look at the zero window.

`Zero window`

TCP implements flow control through window notification information on the receiving end. The notification window tells TCP the amount of data that the receiver can receive. When the receiver's window becomes 0, it can effectively prevent the sender from continuing to send data. When the receiving end regains the available space, it will transmit a window update to the sending end to inform itself that it can receive data. Window update is generally pure ACK, that is, without any data. But pure ACK cannot guarantee that it will reach the sender, so relevant measures are needed to deal with this kind of packet loss.

If the pure ACK is lost, the two parties in the communication will always be in a waiting state. How can the receiver let me send the data if the sender wants to pull it down! The receiving end wonders why the sender who is goddamned hasn't sent data yet! To prevent this, the sender will use a continuous timer to intermittently query the receiver to see if its window has grown. The duration timer will trigger the window detection, forcing the receiver to return an ACK with an updated window.

The window detection contains one byte of data, using the TCP loss retransmission method. When the TCP persistence timer expires, the sending of window probes is triggered. Whether a byte of data can be received by the receiving end also depends on the size of its buffer.

`Congestion control`

With the window control of TCP, the two hosts in the computer network are no longer sent in the form of a single data segment, but can send a large number of data packets continuously. However, a large number of data packets are also accompanied by other problems, such as network load and network congestion. In order to prevent such problems, TCP uses the congestion control mechanism. The congestion control mechanism will curb the sender's data transmission when facing network congestion.

There are two main methods for congestion control

end-to-end congestion control: Because the network layer does not provide display support for the transport layer congestion control. Therefore, even if there is congestion in the network, the end system must be inferred by observing the network behavior. TCP uses the end-to-end congestion control method . The IP layer does not provide feedback information about network congestion to the end system. So how does TCP infer network congestion? times out or three redundant confirmations are considered to be network congestion, TCP will reduce the window size or increase the round-trip delay to avoid .
Network-assisted congestion control: In network-assisted congestion control, the router will provide the sender with feedback on the state of congestion in the network. This feedback information is a bit of information, which indicates the congestion in the link.

The following figure describes these two congestion control methods

`TCP congestion control`

If you see this, then I tentatively believe that you understand the basis of TCP's reliability, that is, the use of serial numbers and confirmation numbers. In addition, another foundation for TCP reliability is TCP congestion control. If

The method used by TCP is to allow each sender to limit the rate of sending message segments according to the perceived network congestion. If the TCP sender perceives no congestion, the TCP sender will increase the sending rate; if sending If the party senses that there is congestion along the path, the sender will reduce the sending rate.

But this method has three problems
How does the TCP sender limit the rate at which it sends segments to other connections?
How does a TCP sender perceive network congestion?
When the sender perceives end-to-end congestion, what algorithm should be used to change its sending rate?

Let's discuss the first question first. TCP sender limit the rate at which it sends segments to other connections ?

We know that TCP is composed of receiving buffer, sending buffer and variables (LastByteRead, rwnd, etc.). The sender's TCP congestion control mechanism will track a variable, that is, the variable of the congestion window. The congestion window is represented as cwnd , which is used to limit the amount of data that TCP can send to the network before receiving an ACK. The receiving window (rwnd) is used to tell the receiver the amount of data that can be accepted.

Generally speaking, the amount of unconfirmed data from the sender shall not exceed the minimum value of cwnd and rwnd, which is

LastByteSent - LastByteAcked <= min(cwnd,rwnd)

Since the round-trip time of each data packet is RTT, we assume that the receiver has enough buffer space for receiving data. We don’t need to consider rwnd and focus on cwnd. Then, the sending rate of the sender is about cwnd/ RTT bytes/sec. By adjusting cwnd, the sender can therefore adjust the rate at which it sends data to the connection.

How does a TCP sender perceive network congestion ?

As we discussed above, TCP perceives it based on timeout or 3 redundant ACKs.

When the sender perceives end-to-end congestion, what algorithm is used to change its sending rate ?

This problem is more complicated, and let me talk about it. Generally speaking, TCP will follow the following guiding principles

If the message segment is lost during the sending process, it means that the network is congested. At this time, it is necessary to appropriately reduce the TCP sender's rate.
An acknowledgment segment indicates that the sender is delivering the segment to the receiver, so when an acknowledgment of the previously unacknowledged segment arrives, the sender's rate can be increased. why? Because the unconfirmed message segment arrives at the receiver, it means that the network is not congested and can be reached smoothly. Therefore, the length of the congestion window of the sender will become larger, so the sending rate will become faster.
bandwidth detection, bandwidth detection says that TCP can increase/decrease the number of ACK arrivals by adjusting the transmission rate. If a packet loss event occurs, the transmission rate will be reduced. Therefore, in order to detect how often congestion starts to occur, the TCP sender should increase its transmission rate. Then slowly reduce the transmission rate, and then start probing again to see if the congestion start rate has changed.

After understanding TCP congestion control, let's talk about TCP's congestion control algorithm (TCP congestion control algorithm). The TCP congestion control algorithm mainly consists of three parts: slow start, congestion avoidance, and fast recovery. Let’s take a look at it in turn.

`Slow start`

When a TCP starts to establish a connection, the value of cwnd is initialized to a smaller value of MSS. This makes the initial transmission rate approximately MSS/RTT bytes/sec. For example, if 1000 bytes of data are to be transmitted and the RTT is 200 ms, the initial transmission rate obtained is approximately 40 kb/s. In actual situations, the available bandwidth is much larger than this MSS/RTT. Therefore, if TCP wants to find the best sending rate, it can use the slow-start . In the slow-start method, the value of cwnd will change It is initialized to 1 MSS, and one MSS will be added every time the transmission message is confirmed. The value of cwnd will change to 2 MSS. After the two message segments are successfully transmitted, each message segment + 1 will change. For 4 MSS, and so on, the value of cwnd will double every time it succeeds. As shown below

It is impossible for the sending rate to increase all the time. The increase will always come to an end, so when will it end? Slow start usually uses the following methods to end the growth of the sending rate.

If packet loss occurs during the slow-start sending process, TCP will set the sender's cwnd to 1 and restart the slow-start process. At this time, a ssthresh (slow start threshold) concept will be introduced. The value is the value of cwnd/2 which causes packet loss, that is, when congestion is detected, the value of ssthresh is half of the window value.
The second method is directly related to the value of ssthresh, because when congestion is detected, the value of ssthresh is half of the window value. Then when cwnd> ssthresh, packet loss may occur every time it doubles, so the best The way is that the value of cwnd = ssthresh, so that TCP will switch to congestion control mode and end slow start.
The last way to end the slow start is if 3 redundant ACKs are detected, TCP will perform a fast retransmission and enter the recovery state.

`Congestion avoidance`

When TCP enters the congestion control state, the value of cwnd is equal to half of the value during congestion, which is the value of ssthresh. Therefore, it is impossible to double the value of cwnd every time the segment arrives. Instead, a conservative adopted. After each transmission, the value of cwnd is only increased by one MSS. For example, a confirmation of 10 segments is received, but the value of cwnd is only increased by one MSS. This is a linear growth mode, and it will also have a growth over value. Its growth over value is the same as slow start. If packet loss occurs, then the value of cwnd is an MSS, and the value of ssthresh is equal to half of cwnd; or Receiving 3 redundant ACK responses can also stop MSS growth. If TCP still receives 3 redundant ACKs after halving the value of cwnd, it will record the value of ssthresh as half of the value of cwnd and enter the fast recovery state.

`Fast recovery`

In fast recovery, for the missing segment that causes TCP to enter the fast recovery state, for each redundant ACK received, the value of cwnd will increase by one MSS. When an ACK for the lost segment arrives, TCP enters the congestion avoidance state after lowering cwnd. If a timeout occurs after the congestion control state, it will migrate to the slow start state, the value of cwnd is set to 1 MSS, and the value of ssthresh is set to half of cwnd.

I own six PDFs, which have been spread over 10w+ throughout the Internet. After searching for "programmer cxuan" on WeChat and following the

six PDF links