1.5W words + 24 pictures liver flap TCP.

TCP is a connectionless unicast protocol. In TCP, there is no such behavior as multicast and broadcast, because the IP address of the sender and receiver can be defined in the TCP message segment.

Before sending data, the two communicating parties (that is, the sender and the receiver) need to establish a connection, and after sending data, the communication parties need to break the connection, which is the establishment and termination of TCP connection.

Establishment and termination of a TCP connection

If you’ve read my previous article on the network layer, you know that there are four basic elements of TCP: the sender’s IP address, the sender’s port number, the receiver’s IP address, and the receiver’s port number. The IP + port number of each party can be considered a socket, and the socket can be uniquely identified. A socket is like a door, out of which data is transferred.

TCP connection establishment -> termination is divided into three phases

The focus of our discussion below also focuses on these three aspects.

The following figure shows a very typical TCP connection setup and closure process, not including data transfer part.

TCP establishes a connection – three handshakes

The server process is ready to receive an external TCP connection, usually by calling the three functions bind, listen, and socket. This opening is considered to bePassive Open. The server-side process is then inLISTENState, waiting for client connection requests.
The client passes theconnectinitiateActive Open, makes a connection request to the server with the first synchronization bit SYN = 1, and selects an initial sequence (abbreviated as SEQ = X). The SYN segment is not allowed to carry data and consumes only one sequence number. At this point, the client entersSYN-SENDState.
After the server receives the client connection, it needs to confirm the client’s message segment. In the acknowledgement segment, both the SYN and ACK bits are set to 1. The confirmation number is ACK = X + 1, and you also choose an initial sequence number for yourself, SEQ = Y. This segment also cannot carry data, but it also consumes a sequence number. At this point, the TCP server comes inSyn-received (synchronously RECEIVED)State.
After the client receives the response from the server, it also needs to confirm the connection. Confirm that the ACK in the connection is set to 1, the serial number is seq = x + 1, and the ACK is y + 1. TCP specifies that this segment may or may not carry data. If it does not carry data, the sequence number of the next segment is still seq = x + 1. At this point, the client entersESTABLISHED (Connected)state
The server also enters after receiving the customer’s confirmationESTABLISHEDState.

This is a typical three-handshake process, with the above three packets being used to establish a TCP connection. The purpose of the triple handshake is not only to let the communication parties know that a connection is being established, but also to exchange special information using the option fields in the packet, exchanging the initial sequence number.

The first party to send a SYN message is considered to initiate a connection, and this party is often referred to as
The client. And the recipient of SYN is usually called
The service side, which is used to receive this SYN and send the following SYN, so this open mode is passive.

TCP requires three message segments to establish a connection and four to release a connection.

TCP Disconnect – Four waves

At the end of the data transmission, the two communicating parties can release the connection. The client and server hosts are in the ESTABLISHED state after the data transfer, and then the connection is released.

The process required for TCP disconnection is as follows

The client application initiatively closes the TCP connection by issuing a message segment that releases the connection and stops sending data. The client host sends the message segment to release the connection. In the message segment, the first FIN position is 1, does not contain data, and the serial number bit is seq = u. At this time, the client host entersFIN-WAIT-1(terminate WAIT 1)Phase.
After receiving the message segment sent by the client, the server host will issue the acknowledgement reply message, confirm the ACK = 1 in the reply message, generate its own serial number bit seq = v, ACK = u + 1, and then the server host will enterClose -WAIT(CLOSE WAIT)State.
The client host enters when it receives an acknowledgement from the server hostFIN-WAIT-2(terminate WAIT 2)In the state. Wait for the client to issue a connection release segment.
At this time, the server host will issue the disconnected message segment, ACK = 1, serial number SEQ = V, ACK = U + 1. After sending the disconnected request message, the server host will enterLast-ack (final confirmation)Phase.
After the client receives the disconnection request from the server, the client needs to make a response. The client issues the disconnection message segment, in which ACK = 1 and the serial number SEQ = U + 1, because the client has not sent any more data since the connection was disconnected, ACK = V + 1, and then entersTime-waitStatus, please note that the TCP connection has not been released at this time. Must go through the time wait setting, i.e2MSLAfter, the client will enterCLOSEDState, time MSL is calledMaximum Segment Lifetime (Maximum Segment Lifetime).
The server will enter the CLOSED state after receiving confirmation that the client is disconnected. Because the server terminates the TCP connection sooner than the client, and the disconnect process requires four packets to be sent, the process of releasing the connection is also known as the fourth wave.

Either side of a TCP connection can initiate a close operation, but it is usually the client that initiates and closes the connection. However, some servers, such as Web servers, initiate a connection closure after responding to the request. TCP provides that a shutdown is initiated by sending a FIN message.

So, to establish a TCP connection requires three segments, and to close a TCP connection requires four segments. The TCP protocol also supports a half-open state, although this is not common.

TCP half open

A TCP connection is partially open because one side of the connection has closed or terminated the TCP connection without informing the other side. In other words, two people are chatting on WeChat. CXuan is offline and you don’t tell me. The connection is considered to be partially open. This happens when one side of the communication is in the middle of a mainframe crash, you fucking, my computer is down, how am I going to tell you? As long as the party in the semi-connected state does not transmit data, it is impossible to detect that the other host has been offline.

Another reason for being in a half-on state is that the communication party has turned off the mainframe instead of shutting it down properly. This situation results in many half-open TCP connections on the server.

TCP half closed

Since TCP supports half-open operations, we can assume that TCP also supports half-closed operations. Likewise, TCP half shutdown is not common. The semi-close operation of TCP is to close only one direction of the data stream. The two half-close operations combined close the entire connection. Normally, the communication parties would end the connection by sending FIN packets to each other through the applications, but with TCP partially closed, the applications would say what they wanted: “I have finished sending the data and sent a FIN message segment to the other party, but I still want to receive the data from the other party until it sends me a FIN message segment”. Here is a schematic of TCP half shut down.

Explain the process:

First client host and server host at the start of the data transmission, after a period of time, the client has launched a FIN packet, actively ask disconnected, server, after receipt of the FIN response ACK, due to the side of a half closed is the client still want the server to send data, so the server will continue to send data, After some time, the server sends another FIN message, and the client disconnects after receiving the ACK from FIN message to the server.

In TCP’s semi-close operation, one direction of the connection is closed, while the other continues to transmit data until it is closed. It’s just that few applications use this feature.

Open and close simultaneously

A more unusual operation is when two applications actively open a connection at the same time. While this may seem unlikely, it can happen with certain arrangements. We’re going to focus on this process.

Each communicator will first send a SYN before receiving a SYN from the other. This scenario also requires each communicator to know the IP address + port number of the other.

Here is an example of opening simultaneously

As shown in the figure above, both communication parties actively sent SYN messages before receiving each other’s messages, and both responded with an ACK message after receiving each other’s messages.

A simultaneous opening process requires the exchange of four message segments, one more than the normal three handshakes. Since there is no client and server for simultaneous opening, I refer to both sides of the communication here.

Like simultaneous opening, simultaneous closing means that both parties of the communication make an active closing request at the same time and send FIN message. The following figure shows a process of simultaneous closing.

Simultaneous closing requires the same number of packets to be swapped as normal closing, except that simultaneous closing does not occur sequentially like four waves, but intersecting.

Talk about the initial serial number

The Initial sequence number is expressed in a technical term. The English name of the Initial sequence number is Initial Sequence Numbers (ISN). Therefore, the Initial sequence number seq = v, which is expressed above, actually means ISN.

Before sending a SYN, the communicating parties choose an initial sequence number. The initial sequence number is randomly generated, and each TCP connection has a different initial sequence number. The RFC documentation states that the initial sequence number is a 32-bit counter, + 1 every 4 us (microseconds). Since each TCP connection is a different instance, the purpose of this arrangement is to prevent sequence number overlap.

When a TCP connection is established, only the correct TCP quad and the correct sequence number are received. This also reflects the vulnerability of TCP message segments to forgery. As long as I forge the same quad and the initial sequence number, I can forge a TCP connection and interrupt the normal TCP connection. So one way to resist this attack is to use the initial sequence number, and another way is to encrypt the sequence number.

TCP state transition

We’ve talked about three handshakes and four waves, and we’ve talked about some state transitions between TCP connections, so I’m going to start at the beginning and walk you through these state transitions.

At the beginning, both the server and the client are in the state of CLOSED. At this time, it is necessary to decide whether to open actively or passively. If it is actively opened, the client will SEND SYN message to the server, and the client will be in SYN-SEND state at this time. SYN-SEND means that after sending a connection request, the server waits for a matching connection request. The server is passively opened and in the LISTEN state, which is used to LISTEN for SYN messages. If a client calls a Close method or has no operations for a period of time, it will revert to the Closed state, as shown in the diagram below

Why would a client in LISTEN state send SYN to SYN_SENT?

The reason why LISTEN -> SYN_SENT is that the connection may have been triggered by the application on the server side sending data to the client. The client passively accepts the connection and starts transferring files after the connection has been established. That said, it is possible for a server in LISTEN state to send a SYN message, but this is very rare.

A server in SYN_SEND will receive SYN and send SYN and ACK to SYN_RCVD, just as a client in LISTEN will receive SYN and send ACK to SYN_RCVD. If a client in SYN_RCVD receives an RST, it becomes LISTEN.

It’s better to look at these two pictures together.

So what is RST

Here is a case where the host receives a TCP packet and its IP and port number do not match. If the client host sends a request and the server host determines that the request is not intended for the server, then the server will send a special RST segment to the client.

Therefore, when a server sends a special RST segment to a client, it tells the client that there is no matching socket connection and that it should not continue sending it.

RST :(Reset the Connection) is used to Reset the wrong connection for some reason. It is also used to reject illegal data and requests. If an RST bit is received, there is usually some error.

Failing to identify the correct IP port above is one of the conditions that can cause RST. RST can also occur due to request timeout, cancellation of an existing connection, and so on.

The server located in SYN_RCVD receives ACK messages, and the client of SYN_SEND receives SYN and ACK messages and sends ACK messages, thus establishing a connection between the client and server.

The other thing to notice here is that I didn’t make it explicit up here, but in fact, when it’s open at the same time, it’s going to look like this.

Why is this so? When both hosts SEND a SYN message, the host initiates the SYN message. When both hosts SEND a SYN message, the host initiates the SYN message. When both hosts SEND a SYN message, the host initiates the SYN message. Both parties are in the SYN-Received (SYN-RCVD) state, and after waiting for the message from SYN + ACK to arrive, both parties are in the ESTABLISHED state and start transferring data.

So far, I’ve given you a description of the state transitions during TCP connection establishment, and now you can make a pot of tea and drink some water while you wait for the data to arrive.

Okay, now that we’ve had enough water, the data transfer is complete, and when the data transfer is complete, the TCP connection can be disconnected.

Now let’s move the clock back a little bit to the point where the server is in SYN_RCVD because it just received a SYN packet and sent SYN + ACK. The server is happy, but then the server application process shuts down and sends a FIN packet. Will get the server from the syn_rcvd-> FIN_WAIT_1 state.

The client sends a FIN message and wants to disconnect. At this time, the client also becomes FIN_WAIT_1 state. For the server, it receives the FIN message segment and replies to the ACK message. It will start with the ESTABLISHED -> CLOSE_WAIT state.

The server in the CLOSE_WAIT state sends the FIN message and then puts itself in the LAST_ACK state. A client in FIN_WAIT_1 becomes FIN_WAIT_2 when it receives an ACK message.

We need to explain the CLOSING state first. The transition between fin_wait_1-> CLOSING is a special one

Closing is a special state, which should be very rare in actual circumstances and belongs to a relatively rare exception state. Normally, when you send a FIN message, you should receive the ACK message first (or at the same time) and then receive the FIN message from the other party. But the CLOSING state means that when you send the FIN message, you do not receive the ACK message of the other party, but you also receive the FIN message of the other party.

Under what circumstances does this happen? In fact, think about it, it is not difficult to draw a conclusion: that is, if both sides close a link at the same time, then there is a case of sending FIN message at the same time, that is, there will be a CLOSING state, that both sides are CLOSING the connection.

A client in FIN_WAIT_2 state will change to TIME_WAIT state after receiving the FIN + ACK message sent by the server host and sending the ACK response. A client in CLOSE_WAIT sending a FIN will be in LAST_ACK state.

There are a lot of pictures and blogs that are in the LAST_ACK state only when they’re drawing a FIN + ACK message, but when they’re describing it, they’re usually describing it only for FIN. That is, CLOSE_WAIT sends FIN to the LAST_ACK state.

So the state of fin_wait_1-> TIME_WAIT is the state that the client is in after receiving the FIN and the ACK and sending the ACK.

A CLOSINIG client will then remain in TIME_WAIT if it still has an ACK. As you can see, TIME_WAIT is the last time the client is in a closed state. It is an active closed state. LAST_ACK is the last state of the server before it is closed, and it is a passively opened state.

There are a couple of special states up here that we’re going to explain to the west.

TIME_WAIT state

When a TCP connection is established, the party that initiates the closure enters a TIME_WAIT state. The TIME_WAIT state is also known as the 2MSL wait state. In this state, TCP will wait twice as long as the Maximum Segment Lifetime (MSL).

There is a need to explain the Muslims

MSL is the maximum expected lifetime of a TCP segment, which is the maximum time it can live on the network. This time is limited, because we know that TCP is based on the IP data segment for transmission, IP datagram has TTL and hop count fields, these two fields determine the IP lifetime, under normal conditions, TCP’s maximum lifetime is 2 minutes, but this value can be modified. This value can be changed depending on the operating system.

With this in mind, let’s explore the state of TIME_WAIT.

When TCP performs an active shut down and sends the final ACK, TIME_WAIT should have a maximum life time of 2 * so that TCP can resend the final ACK to avoid missing cases. Resending the final ACK is not because TCP retransmitted the ACK, but because the other side of the communication retransmitted the FIN, the client often sends the FIN back, because it needs the ACK response to close the connection, if the survival time is longer than 2MSL, the client will send the RST, causing the server error.

TCP timeout and retransmission

There is no communication that can never go wrong. This saying shows that no matter how complete the external conditions are, there will always be the possibility of mistakes. Therefore, in the normal communication process of TCP, there will be errors, such errors may be caused by packet loss, packet duplication, or even packet out of order.

In the process of TCP communication, a series of confirmation information will be returned by the receiving end of TCP to determine whether there is an error. In case of packet loss, TCP will start the retransmission operation and retransmit the data that has not been confirmed.

There are two ways of TCP retransmission, one is based on time, the other is based on confirmation information, generally through confirmation information is more efficient than through time.

So from this can be seen, TCP acknowledgement and retransmission, are based on whether the packet is acknowledged as the premise.

TCP sends data by setting a timer. If no acknowledgement is received within the time specified by the timer, a timeout or timer-based retransmission operation is triggered. Timer timeouts are often called retransmission timeouts (RTOs).

But there is another way to do this without causing delay, and that is to do a quick retransmission.

TCP doubles the retransmission time each time a packet is retransmitted. This “doubling of time” is known as binary exponential backoff. Wait until the interval has doubled to 15.5 minutes, the client will display

Connection closed by foreign host.

TCP has two thresholds that determine how to retransmit a segment. These two thresholds are defined in RFC[RCF1122]. The first threshold is R1, which indicates the number of times it is willing to attempt a retransmission, and the threshold R2 indicates the time when TCP should drop the connection. R1 and R2 should be set to at least three retransmissions and 100 seconds to drop the TCP connection.

It is important to note that for the connection establishment message SYN, its R2 should be set to at least 3 minutes, but R1 and R2 values are set differently on different systems.

On Linux, the values of R1 and R2 can be set by the application, or by modifying the values of net.ipv4.tcp_retries1 and net.ipv4.tcp_retries2. The value of the variable is the number of retransmissions.

The default value for tcp_retries2 is 15, and the time taken for this enrichment is about 13-30 minutes, which is a rough approximation, depending on the RTO, which is the retransmission timeout. The default value for tcp_retries1 is 3.

Net.ipv4.tcp_syn_retries and net.ipv4.tcp_synack_retries limit the number of retries SYN can attempt. The default is five, which is about 180 seconds.

Windows also has R1 and R2 variables, whose values are defined in the registry below

HKLM\System\CurrentControlSet\Services\Tcpip\Parameters
HKLM\System\CurrentControlSet\Services\Tcpip6\Parameters

One of the important variables are TcpMaxDataRetransmissions, the corresponding in Linux TcpMaxDataRetransmissions tcp_retries2 variables, the default value is 5. This value indicates the number of times TCP has not acknowledged a segment over an existing connection.

The fast retransmission

We mentioned fast retransmission above, but the fast retransmission mechanism is actually triggered based on feedback from the receiver and is not affected by the retransmission timer. Therefore, compared with timeout retransmission, fast retransmission can effectively repair the loss of packets. When an out-of-order message (such as 2-4-3) arrives during a TCP connection, TCP needs to generate an acknowledgement message immediately. This acknowledgement message is also called a duplicate ACK.

When the out-of-order message arrives, the repeated ACK should be returned immediately and no delay is allowed. The purpose of this action is to tell the sender that a certain segment of message has arrived out of order, and the sender is expected to indicate the sequence number of the out-of-order message segment.

There is also a situation in which a duplicate ACK is sent to the sender, that is, a subsequent message of the current segment is sent to the receiver, which can determine that the segment of the current sender is lost or delayed. In both cases, the receiver did not receive the message, but we could not determine whether the message segment was lost or not delivered. Therefore, the TCP sender waits for a certain number of duplicate ACKS to be accepted to determine whether data has been lost and trigger a quick retransmission. Usually the number of this judgment is 3, and this paragraph of text may not be clear enough to understand, so let’s take an example.

As shown in the figure above, message segment 1 is successfully received and confirmed as ACK 2, and the expected serial number of the receiver is 2. When message segment 2 is lost, message segment 3. Arries out of order, but does not match the receiver’s expectations, so the receiver sends redundant ACK 2 repeatedly.

In this way, before the expiration of the timeout retransmission timer, the sender will know which packet segment is missing after receiving three consecutive ACK, so the sender will resend the lost packet segment, so that there is no need to wait for the expiration of the retransmission timer, which greatly improves the efficiency.

SACK

In standard TCP acknowledgement, if the sender sends data between 0 and 10000, but the receiver only receives data between 0 and 1000, 3000 and 10000, and the data between 1000 and 3000 does not reach the receiver, At this point, the sender will retransmit the data between 1000 and 10000. In fact, this is not necessary because the data after 3000 has already been received. But the sender is not aware that this is the case.

How can we avoid or fix this problem?

In order to optimize this situation, it is necessary for us to let the client know more information. In the TCP message segment, there is a SACK option field, which is a selective recognition mechanism. This mechanism can tell the TCP client, which can be explained in our idiom: “I am allowed to receive messages after 1000 at most, but I have received messages between 3000 and 10000, please give me messages between 1000 and 3000”.

The SACK option field is used by both parties in a SYN segment or SYN + ACK segment to inform the host whether SACK is supported or not. If both sides support SACK, the SACK option field is used by both parties to inform the host whether SACK is supported or not. The SACK option is then available in the SYN segment.

Note that the SACK option field can only appear in a SYN segment.

Pseudo timeout and retransmission

In some cases, a packet retransmission may occur even if there is no loss of a segment. This retransmission is called spurious retransmission and is unnecessary, possibly due to spurious timeout, which means that an early decision timeout occurred. There are many factors that cause pseudo timeout, such as disordered arrival of message segments, repetition of message segments, ACK loss and so on.

There are many methods to detect and deal with pseudo-timeout, which are collectively called detection algorithm and response algorithm. The detection algorithm is used to determine whether there is a timeout or a retransmission of the timer. In the event of a timeout or retransmission, a response algorithm is executed to undo or mitigate the effects of the timeout. The following algorithms are not implemented in detail in this article

Repeat the SACK extension – DSACK
EIFEL detection algorithm
Forward RTO recovery – F-RTO
EIFEL response algorithm

Packet disorder and packet duplication

All we have discussed above is how TCP handles packet loss. Let’s discuss packet disorder and packet duplication.

Packet disorder

The out-of-order arrival of packets is an extremely easy situation to occur in the Internet. Since the IP layer cannot guarantee the order of packets, each packet may be sent by choosing the link with the fastest transmission speed in the current situation, so it is very likely that there will be three packets sent A->, B-> and C. The order of packets arriving at the receiving end is C->, A->, B, B->, C->A, and so on. This is a phenomenon of packet disorder.

In packet transmission, there are two main types of links: forward link (SYN) and back link (ACK)

If the out-of-order occurs in the forward link, TCP cannot correctly judge whether the packet is lost or not. The data loss and out-of-order will lead to the receiving end receiving the out-of-order packet, resulting in the vacancy between the data. This doesn’t matter much if the gap isn’t big enough; However, if the vacancy is too large, it may result in a false retransmission.

If the disorder occurs on a reverse link, it causes the TCP window to move forward and receives duplicate ACK that should be discarded, resulting in unnecessary traffic bursts at the sender and affecting available network bandwidth.

Back to our discussion of fast retransmission above, since fast retransmission is initiated based on repeated ACK inferring packet loss, it does not have to wait for the retransmission timer to time out. Since the TCP receiver will immediately return an ACK after receiving an out-of-order packet, any out-of-order packet in the network may cause a duplicate ACK. It is assumed that once an ACK is received, the fast retransmission mechanism will be activated. When the ACK number increases, a large number of unnecessary retransmits will occur, so the fast retransmission should reach the repeat threshold (DUPTHRESH) before triggering. However, in the Internet, severe disorder is not common, so the value of DUPTHRESH can be set as small as possible, generally 3 will handle most cases.

Package to repeat

Packet duplication is also a rare occurrence in the Internet. It refers to the fact that during network transmission, a packet may be transmitted multiple times, and when a retransmission is generated, TCP may be confused.

Duplication of a packet can cause the receiver to generate a series of duplicate ACKs, which can be resolved using SACK negotiation.

TCP data flow and window management

Flow control can be achieved by using sliding Windows. In other words, the client and server can exchange data flow information, such as segment sequence numbers, ACK numbers, and window sizes.

The two arrows in the figure indicate the data flow direction, which is the transmission direction of TCP message segments. As you can see, each TCP segment includes sequence number, ACK, window information, and possibly user data. The window size in a TCP segment represents the amount of cache space, in bytes, that the receiver is still able to receive. This window size is a kind of dynamic, because all the time there will be a message segment receive and disappear, this dynamic adjustment of the window size is called sliding window, let’s have a specific understanding of sliding window.

The sliding window

Each end of a TCP connection can send data, but the transmission of data is not unlimited. In fact, Each end of the TCP connection maintains a send window structure and a receive window structure, which are the limits on sending data.

Sender window

Below is an example of a sender window.

In this picture, there are four concepts involved in sliding Windows:

Sent and confirmed message segment: After it is sent to the receiver, the receiver will reply with ACK to respond to the message segment. The message segment marked in green in the figure is the message segment confirmed by the receiver.
Sent but not confirmed message segments: The green area in the figure is the message segment that has been confirmed by the receiver, while the light blue area is the message segment that has been sent but not confirmed by the receiver.
Message segments waiting to be sent: The dark blue area in the figure is the message segment waiting to be sent, which is a part of the sending window structure. In other words, the sending window structure is actually composed of the sent and unconfirmed + message segments waiting to be sent.
Message segments that can only be sent when the window slides: If the message segments in the set [4,9] in the figure are sent, the whole sliding window will move to the right. The orange area in the figure is the message segment that can only be sent when the window moves to the right.

Sliding Windows also have a boundary. The boundary is Left edge and Right edge. The Left edge is the Left edge of the window and the Right edge is the Right edge of the window.

This window may be closed when the Left Edge moves to the Right and the Right Edge stays the same. This can happen as the window becomes smaller as the data that has been sent is gradually acknowledged.

When the Right Edge moves to the Right, the window is open in an open state, allowing more data to be sent. This state occurs when the receiving process reads the buffer, causing the buffer to receive more data.

It is also possible for the Right Edge to move to the left, causing the message segment to be sent and acknowledged to become smaller. This condition is known as confused window syndrome, and it is undesirable. When the confused window syndrome occurs, the size of the data segment used for exchange between the two communication parties will become smaller, while the fixed overhead of the network does not change. The proportion of useful data in each message segment relative to the header information is small, leading to very low transmission efficiency.

This is the equivalent of fixing a title bug in a day of writing a complex page when you were able to do so.

Each TCP segment contains an ACK number and window notification information, so each time a response is received, the TCP receiver adjusts the window structure based on these two parameters.

The TCP Sliding Window’s Left Edge can never move to the Left, because a message segment sent and confirmed can never be canceled, just like there is no such thing as regret. This edge is controlled by an ACK number sent by another segment. When the ACK label moves a window to the right but does not change the size of the window, the window is said to slide forward.

If the ACK number increases but the window notification message shrinks as other ACs arrive, then the Left Edge will approach the Right Edge. When the Left edge and the Right edge coincide, the sender will not transmit any more data. This situation is called zero window. At this point, the TCP sender initiates a window probe and waits for the appropriate time to send data.

Receiver window

The receiver also maintains a window structure that is much simpler than the sender’s. This window records the data that has been received and acknowledged, as well as the maximum sequence number it can receive. The receiver’s window structure does not store duplicate segments and ACs, and the receiver’s window does not record segments and ACs that should not be received. Below is the window structure of the TCP receiver.

Like the sender window, the receiver window structure maintains a Left Edge and a Right Edge. The message segment to the Left of the Left Edge is said to have been received and acknowledged, while the message segment to the Right of the Right Edge is said to be unreceived.

For the receiver, data that arrives with a sequence number smaller than the Left efge is considered duplicate and needs to be discarded. Anything over the Right Edge is considered out of range. Only when the arriving segment is equal to the Left Edge will the data not be dropped and the window will be able to slide forward.

Structure of the receiver window will exist zero window, if an application process consumption data is slow, while TCP sender to send a large amount of data to the receiving party, will cause the TCP buffer overflow, notice the sender don’t send the data again, but the application process in a very slow speed of consumption of the buffer data (such as 1 byte), It will tell the receiver to send only one byte of data, and this process continues slowly, resulting in high network overhead and low efficiency.

We mentioned above that a window has a Left Edge = Right Edge, which is called a zero window. Now let’s explore the zero window in detail.

Zero window

TCP is through the receiver window notification information to achieve flow control. The notification window tells TCP how much data the receiver can receive. When the receiver’s window becomes 0, it can effectively prevent the sender from continuing to send data. When the receiver regains free space, it sends a window update to the sender saying it is ready to receive data. Window updates are typically pure ACK, that is, without any data. However, pure ACK does not guarantee that it will arrive at the sending end, so some measures are needed to deal with such packet loss.

If the pure ACK is lost, the communication parties will be in a waiting state all the time, and the sender thinks how can the receiver still let me send data! The receiving end thinks how the goddamned sender hasn’t sent the data! To prevent this, the sender uses a persistent timer to intermittently query the receiver to see if its window has grown. A persistent timer triggers a window probe, forcing the receiver to return an ACK with an updated window.

The window probe contains one byte of data and uses TCP lost retransmission. When the TCP persistence timer times out, the window probe is triggered to be sent. Whether a byte of data can be received by the receiver depends on the size of its buffer.

Congestion control

With the window control of TCP, the computer network between two hosts is no longer in the form of a single data segment, but can send a large number of data packets continuously. However, large numbers of packets are also associated with other problems, such as network load, network congestion, etc. To prevent this problem, TCP uses a congestion control mechanism, which prevents the sender from sending data in the face of network congestion.

There are two main methods of congestion control

End-to-end congestion control: Because the network layer does not provide display support for transport layer congestion control. So even if there is congestion in the network, the end system has to infer from the observation of the network behavior.TCP uses end-to-end congestion control. The IP layer does not provide feedback to the end system about network congestion. So how does TCP infer network congestion?If a timeout or three redundant acknowledgements are considered network congestion, TCP will reduce the window size or increase the round-trip delay to avoid it.
Network assisted congestion control: In network-assisted congestion control, the router provides feedback to the sender about the congestion status in the network. This feedback is a bit of information that indicates congestion in the link.

The figure below depicts the two types of congestion control

TCP Congestion Control

If you see this, I assume for the time being that you understand the basis of TCP implementation reliability, which is the use of serial numbers and acknowledgements. In addition, another implementation of TCP reliability basis is TCP congestion control. if

The TCP approach is to have each sender limit the rate at which a packet is sent according to the perceived level of congestion in the network. If the TCP sender senses that there is no congestion, the TCP sender increases the transmission rate. If the sender senses a blockage along the path, the sender will slow down the transmission rate.

But there are three problems with this approach

How does a TCP sender limit the rate at which it can send packets to other connections?

How does a TCP sender perceive network congestion?

When the sender senses end-to-end congestion, what algorithm is used to change its transmission rate?

Let’s start with the first question. How does a TCP sender limit the rate at which it can send packets to other connections?

We know that TCP consists of a receive cache, a send cache, and variables (LASTBYTEREAD, RWND, etc.). The sender’s TCP Congestion Control mechanism tracks one variable, the Congestion Window, represented as CWND and used to limit the amount of data that TCP can send to the network before an ACK is received. The Receive Window (RWND) is a window that tells the receiver the amount of data it can accept.

In general, the amount of unacknowledged data by the sender must not exceed the minimum values of CWND and RWND, i.e

LastByteSent – LastByteAcked <= min(cwnd,rwnd)

Since the round-trip time of each packet is RTT, we assume that the receiver has enough cache space to receive the data, we will not consider RWND and only focus on CWND, then the sender will send at a rate of roughly CWND /RTT bytes/second. By tuning the CWND, the sender can therefore adjust the rate at which it sends data to the connection.

How does a TCP sender perceive network congestion?

This, as discussed above, is perceived by TCP based on timeouts or 3 redundant ACKS.

When the sender senses end-to-end congestion, what algorithm is used to change its transmission rate?

The problem is complicated, and as I’ll explain, TCP generally follows a few guiding principles

If a segment is lost during transmission, it means that the network is congested and the TCP sender needs to slow down appropriately.
An acknowledgement segment indicates that the sender is passing the segment to the receiver, thus increasing the sender’s rate when an acknowledgement arrives for a previously unacknowledged segment. Why? As the unconfirmed message segment arrives at the receiver, it means that the network is not congested and can arrive smoothly. Therefore, the length of the sender’s congestion window will be larger, so the transmission rate will be faster
Bandwidth detectionThe bandwidth probe says that TCP can increase/decrease the number of ACK arrivals by adjusting the transmission rate, and if a packet loss event occurs, the transmission rate will be reduced. Therefore, in order to detect the frequency at which congestion begins to appear, the TCP sender should increase its transmission rate. Then you slowly slow down the transmission rate and start probing again to see if there is any change in the congestion onset rate.

TCP Congestion Control Algorithm For Congestion Control TCP Congestion Control Algorithm For Congestion Control TCP congestion control algorithm mainly contains three parts: slow start, congestion avoidance, fast recovery, let’s take a look at the following

Slow start

When a TCP connection is started, the CWND value is initialized to a smaller value of MSS. This results in an initial transmission rate of roughly MSS/RTT bytes/second. For example, if you want to transfer 1000 bytes of data and RTT is 200 ms, you get an initial transmission rate of about 40 KB /s. In practice, the bandwidth available is much larger than this MSS/RTT, so TCP can find the best transmission rate by slow start, in which the CWND value is initialized to 1 MSS. After each transmission message is confirmed, one MSS will be added, and the value of CWND will change to two MSS. After the two message segments are transmitted successfully, each segment + 1 will change to four MSS, and so on, the value of CWND will be doubled for each successful transmission. As shown in the figure below

The rate of delivery can’t keep growing. Growth has to come to an end, so when does it end? A slow start usually ends the increase in the transmission rate in one of several ways.

If packet loss occurs during a slow start send, TCP sets the sender’s CWND to 1 and restarts the slow start process, at which point one is introducedSSTHRESH (slow start threshold)The initial value is the value / 2 of the CWND generating the packet loss, that is, when congestion is detected, the value of ssthresh is half of the window value.
The second way is to directly relate to the value of ssthresh, because when the congestion is detected, the value of ssthresh is half of the window value, so when the CWnd > ssthresh, every doubling may occur packet loss, so the best way is the value of CWND = ssthresh, This will shift TCP into congestion control mode, ending the slow start.
The final way to end a slow start is that if three redundant ACs are detected, TCP performs a quick retransmission and enters a recovery state.

Congestion avoidance

When TCP enters the congestion control state, the value of CWND is equal to half of the value of the congestion time, that is, the value of SSTHRESH. Therefore, it is not possible to double the CWND value every time a segment arrives. Instead, a relatively conservative approach is adopted, in which the CWND value is only increased by one MSS after each transmission. For example, the CWND value is only increased by one MSS when 10 packets are acknowledged. This is a linear growth model, it also has a growth outage, its growth outage is the same as slow start, if there is packet loss, then CWND value is an MSS, SSTHRESH value is equal to half of CWND value; Or receiving three redundant ACK responses can also stop MSS growth. If TCP still receives three redundant ACs after halving the CWND value, the SSTHRESH value is recorded as half of the CWND value and the rapid recovery state is entered.

Fast recovery

In quick recovery, the value of CWND is increased by one MSS for each received redundant ACK for the missing message segment that puts TCP in the quick recovery state. When an ACK arrives for the missing segment, TCP enters the congested avoidance state after lowering the CWND. If a timeout occurs after the congestion control state, it is migrated to the slow start state, where the value of CWND is set to 1 MSS and the value of SSTHRESH is set to half of that of CWND.

I have read six PDF books, and the whole network spread more than 10W +. WeChat search “programmer CXuan” concern public number, reply to CXuan in the background, get all the PDF, these PDF are as follows

Links to six PDF books