The principle of TCP reliable transmission

TCP data segments are transmitted as the data part of IP datagrams. The IP layer provides best effort services, but does not guarantee reliable data transmission. In order to provide reliable transmission, TCP needs to take some measures to make unreliable transmission channels provide reliable transmission service. For example, if something goes wrong, let the sender retransmit the data; Let the sender slow down when the receiver has no time to process the data.

1. Data stream transmission mode

Data transmitted through TCP is generally divided into two types: interactive data and block data. Interaction data is generally small. For example, it takes at least 41 bytes to send 1-byte interaction data, including the TCP data segment header and IP data packet header. In a WAN, the high volume of interactive data increases the likelihood of congestion. The packet segments that form blocks of data are almost always full length. According to different situations, the two data flows adopt different transmission modes.

1. Interactive data stream transmission

In the early days of the Internet, communication links were not reliable, so reliable transmission protocols were used for data transmission at the link layer, among which the simplest protocol was called “stop-wait protocol”. The stop-wait protocol is not used in the transport layer, but the nagle algorithm used in the transmission of interactive data flow is very similar to the principle of this protocol.

(I) Stop waiting for ARQ agreement

“Stop waiting” means to stop sending each packet and wait for the confirmation of the other party. After receiving the confirmation of the other party, the next packet will be sent. If the sender does not receive confirmation within a certain period of time, the packet is retransmitted.

1. When receiving an incorrect data packet, the receiver directly discards the packet. 2. If the array group is lost during transmission.

In both cases, no information is sent by the recipient. If the sender does not receive an acknowledgement within a certain period of time, it considers the packet lost and retransmits the packet. This is called timeout retransmission. The stop-wait ARQ protocol realizes reliable communication on unreliable networks through this mechanism of confirmation and retransmission.

(2) Nagle algorithm

Nagle algorithm is generally used when transmitting interactive data streams. Nagle’s algorithm requires that there be at most one small unacknowledged packet on a TCP connection, and that no other packet can be sent until the acknowledgement for that packet arrives. It is important to note that rather than sending additional packets immediately after the acknowledgement is received, no data packets are allowed on the TCP connection. The specific rules of Nagle algorithm are as follows:

1. When the data in the cache reaches the maximum packet length, the packet can be sent. 2. If the length of the data in the cache reaches half of the size of the send window, the data can be sent. 3. If the FIN flag at the front of the packet segment is 1, the packet can be sent. 4. If the TCP_NODELAY option is set in the header of the packet segment, the packet can be sent. 5. If a timeout occurs (generally 200ms), send it immediately.

TCP collects the scattered data when no acknowledgement is received. When the acknowledgement arrives, TCP sends it in a packet segment. The faster the confirmation arrives, the faster the data is sent. Nagle algorithm can effectively solve the problem of too much small data in interaction class and reduce the possibility of network congestion. However, because the data received in the cache is not immediately sent out, there is a certain delay. In addition, the receiver generally delays the acknowledgement so as to combine the acknowledgement message with the data to be sent. The delay is generally 200ms. For some real-time applications, the delay caused by Nagle’s algorithm is unacceptable. The TCP standard states that the Nagle algorithm must be implemented, but it must also provide a way to turn it off. The TCP_NODELAY option in the fourth nagle algorithm rule above is a sign that the Nagle algorithm is turned off.

2. Block data stream transmission

Block data often exceeds the maximum packet length. Therefore, the fragmentation of interactive data does not occur. Chunking data is accomplished by continuous ARQ protocol based on sliding window protocol.

(1) Sliding window protocol

The disadvantage of using the Stop-wait protocol-based Nagle algorithm is the low channel utilization. As shown below:

Sliding window protocol

Sliding window protocol

1. The back edge moves to the right. This occurs when data is sent and acknowledged. 2. The front edge moves to the right, allowing more data to be sent. This occurs when the receive window increases or network congestion eases. 3. The front edge moves to the left. This happens when the receiver wants the send window to shrink, which is strongly discouraged by the TCP standard. Because the sender may have already sent some data of the reduced part when receiving the notification of the reduced window, it is easy to cause errors.

The back edge of the window cannot be moved to the left because TCP clears the cache of data that has been acknowledged outside the window. TCP requires the cumulative confirmation function of the receiver. The receiver does not need to confirm the received data immediately, which reduces transmission overhead. In addition, it is possible to piggy-back the data that the receiver wants to send when sending an acknowledgement. According to the TCP standard, the delay for confirmation cannot exceed 0.5 seconds. If a packet segment with the maximum packet length is received, an acknowledgement must be sent every other packet segment. Cumulative acknowledgement enables the receiver to send acknowledgement only to the last packet that arrives in sequence, indicating that all the packets before this packet have arrived. The receiver cannot accurately notify the sender of the packet of data that has been sent to the receiver. For example, in the figure above, packets numbered 5, 6, 7, 9, and 10 arrive at the receiver. The receiver sends confirmation number 8, that is, the receiver does not receive the packet numbered 8 and hopes to receive the packet numbered 8 next time. The sender was not aware of the fact that groups 9 and 10 had already arrived. This leads to the question: should a timeout retransmission occur, does the sender need to send the data that has already arrived again? If not, how do you know exactly what data has arrived? Continuous ARQ protocol is formed by combining sliding window protocol and automatic retransmission request technology. Continuous ARQ protocol can be divided into backward N frame ARQ protocol and selective retransmitting ARQ protocol according to different ways of retransmitting data over time.

(2) Back N frame ARQ protocol

In case of timeout retransmission, the ARQ protocol backs up n-frame data directly from the confirmation sequence number, regardless of whether the packet after the confirmation sequence number has been sent to the receiver.

(3) Select the retransmission ARQ protocol

The ARQ protocol for selective retransmission means that when the receiver receives a data stream that is not in order, the sender is notified to retransmit the missing data rather than the whole data. This can be done by adding the select confirmation option SACK to the TCP data segment header.

2. Timeout retransmission time

TCP manages four different timers for each connection:

1. Retransmission timer: determines when to retransmit unconfirmed data groups. 2, adhere to the timer: make the window size information to keep flowing. 3. Keepalive timer: Detect whether the other end of the idle connection crashes or restarts. 4. 2MSL timer: Measures the time of a connection in TIME_WAIT state.

In the case of timeout retransmission, if the timeout retransmission time is set too short, there will be a lot of unnecessary retransmission and increase the network load. If the time is too long, the idle time of the network increases and the transmission efficiency is reduced. TCP uses an adaptive algorithm to dynamically calculate the timeout retransmission time.

1. Round-trip time of the packet segment

The only difference between the time for sending a packet segment and the time for receiving an acknowledgement is the round-trip time of the packet segment RTT. The smoothed round trip time RTT_S is the weighted average of RTT. For the first measurement, RTT_S is equal to the value of RTT, and then the new VALUE of RTT is calculated according to the following formula:

New RTT_S = (1 – α) × (old RTT_S) + α × (new RTT sample)

The TCP standard recommends a value of 0.125 for α. The weighted mean of RTT bias, RTT_D, is related to the difference between RTT_S and the new RTT sample. In the first measurement, the value of RTT_D is half of RTT, and the following formula is used in the subsequent measurement:

New RTT_D = (1 – beta) * (old RTT_D) + beta * | | RTT_S – new RTT sample

The TCP standard recommends a value of 0.25 for β. Timeout timer The timeout retransmission time RTO is calculated by the following formula:

RTO is equal to RTT_S plus 4 times RTT_D

2. Karn algorithm

Some problems will be encountered in actual measurement of RTT of packet segment, as shown in the figure below:

When calculating weighted average RTT_S, the round-trip time sample is not used as long as the message segment is retransmitted

Three, flow control

If the sender sends data slowly over the TCP connection, resources may be wasted. If the sender sends data too fast, the receiver cannot receive the data. Flow control refers to sending data reasonably and quickly within the range that the receiver can receive.

1. Flow control based on sliding window

The flow control of sender can be realized by sliding window mechanism. When a TCP connection is established, the receiver specifies the size of the receiving window in the confirmation packet. The size of the receiving window can be dynamically adjusted and the sender will be informed each time the acknowledgement packet is sent. As shown below:

Detection of message

Zero window probe message segment
Acknowledgment message segment
The segment of a message carrying emergency data

2. Confused Window syndrome

Confused window syndrome is a condition in which only a small amount of data is exchanged over a connection, rather than a full length message segment. This will lead to low efficiency of network transmission. If the receive cache is full and the receiver’s application reads only a small amount of data from the receive cache at a time, the receiver’s receive window stays at a low value, causing the sender to send only a small amount of data at a time, resulting in confused window syndrome. If the sender application writes a small amount of data to the send cache at a time, TCP chooses to send immediately after receiving the data, which can also cause confused window syndrome. To avoid the confused window syndrome, take measures at both ends:

1. The receiver does not notify the small window. The usual algorithm is that the receiver does not advertise a window larger than the current window (which can be 0), unless the window can increase the size of the message segment (that is, the MSS to be received) or increase the receiver’s cache space by half, no matter how much it actually is. 2. The sender sends the packet only when the length of the packet segment is full or the size of the notification window is half.

Fourth, the MSS

Maximum Packet Segment Length MSS indicates the maximum length of data fields in each TCP packet segment. The length of the data field plus the length of the header equals the length of the TCP packet segment. MSS is determined through negotiation between communication parties during the establishment of a TCP connection. For the first handshake, the sender can add the MSS option to the header. If there is no MSS option, the DEFAULT MSS is 1460 bytes. During the second handshake, the receiver can also add the MSS option to the options, and the final MSS value is the minimum declared by the two parties. If Ethernet is used at the data link layer, the MTU is 1500 bytes, the IP datagram header is at least 20 bytes, the TCP data segment header is at least 20 bytes, and the MSS is at most 1460 bytes. If the data link layer uses the Internet, the MTU=576 bytes and the MSS maximum is 536 bytes. At the network layer, if the transmitted data is larger than the MTU, the data is fragmented at the sending end and then combined at the receiving end. If an error occurs in any fragment, the entire TCP packet segment is retransmitted. Therefore, TCP segmented data. After segmented data is delivered, it does not exceed the MTU, preventing data fragmentation at the network layer. At the transport layer, UDP does not fragment data as TCP does. UDP encapsulates the entire data delivered by an application into a datagram. If the size of the datagram exceeds the MTU, it is fragmented by the network layer.

5. Congestion control

Congestion control prevents too much data from being injected into the network so that routers or links in the network are not overloaded. Nowadays, the transmission quality of communication lines is generally very good, and the probability of discarding packets due to transmission errors is very small. Therefore, network congestion is judged by the occurrence of timeout. TCP adopts four congestion control algorithms: slow start, congestion avoidance, fast retransmission, and fast recovery.

1. Slow start

TCP maintains a congestion window for the sender, denoted as CWND. The congestion window is the flow control used by the sender, and the receive window declared by the receiver is the flow control used by the receiver. The sender’s send window size is equal to the minimum of the two Windows. The value of the congestion window is related to SMSS, which is the maximum length of the sent packet segment. The old rule is that the initial value of the congestion window is 1 to 2 SMSS, while RFC 5681 states that the initial value of the congestion window is no more than 2 to 4 SMSS. The specific provisions are as follows:

1. If SMSS>2190 bytes, CWND =2 x SMSS bytes, and the number of packets must not exceed two. 2. If 2190 is greater than or equal to SMSS>1095 bytes, CWND =3 x SMSS bytes and cannot exceed three packet segments. 3. If SMSS≥1095 bytes, CWND =4 x SMSS bytes, and no more than 4 packet segments.

According to the slow start algorithm, after the congestion window is initialized, the size of the congestion window is increased by one SMSS each time a new message is received. The congestion window is measured in bytes, but the slow start increases in SMSS size. According to the slow start algorithm, after one round of transmission, the congestion window doubles, which is an exponential growth relationship.

2. Congestion avoidance

In addition to maintaining the CWND variable of the congestion window, the slow start algorithm also maintains the slow start threshold of another variable, SSthRESH. When CWND grows exponentially to greater than or equal to SSTHRESH, congestion avoidance algorithm is adopted instead of slow start algorithm for congestion control. The congestion avoidance algorithm specifies that the CWND is increased by 1/ CWND SMSS each time an acknowledgement is received. That is, instead of doubling CWND after one round of transmission like the slow start algorithm, one SMSS is added after one round of transmission. This is a relationship of additive growth. When congestion occurs (timeout or duplicate acknowledgement is received), CWND is set to 1 SMSS. Ssthresh is set to half the size of the current window, but at least 2 message segments. For example, assume that the initial value of TCP SSthRESH is 8 SMSS. TCP uses slow start and congestion avoidance when a network timeout occurs when the congestion window rises to 12 SMSS. The congestion window size is shown in the figure below:

3. Fast retransmission

If certain packet segments are lost on the network and the network is not congested, the sender cannot receive the acknowledgement packet and retransmits the packet after the timeout. The sender mistakenly thinks that the network is congested and starts an incorrect slow start algorithm, which reduces the transmission efficiency. Using the fast retransmission algorithm, the sender can know the loss of individual message segments as soon as possible. The fast retransmission algorithm requires the receiver not to delay the acknowledgement, but to immediately send the repeated acknowledgement of the received packet even if the packet segment is out of order. As shown below:

Fast retransmission

4, quick recovery

After the fast retransmission algorithm is executed, the sender knows that only individual packets are lost, not network congestion. Instead of the slow start algorithm, the fast recovery algorithm is performed: adjust the threshold ssTHRESH = CWND /2 and set CWND = SSTHRESH + 3 SMSS. The set threshold value is half of the current congestion window, and the form of adjusting the congestion window according to the threshold value is called multiplication reduction. Why set the value of the congestion window to the threshold plus 3 segments, rather than equal to the threshold? After receiving three acknowledgement packets, the sender considers that three packets have left the network and reached the cache of the receiver. These three acknowledgement packets do not occupy network resources, and the size of the congestion window can be appropriately increased.

Six, summarized

TCP data flow is transmitted in two modes: interactive data flow and block data flow. Nagle algorithm is used to control the flow of interactive data flow, and block data flow is completed by continuous ARQ protocol based on sliding window protocol. Continuous ARQ protocols are classified into the backward N frame ARQ protocol and selective retransmission ARQ protocol according to whether all data is retransmitted during timeout retransmission or selective retransmission. The timeout retransmission time is dynamically determined and calculated according to the improved Karn algorithm after the round trip time of the message segment. Flow control refers to sending data reasonably and quickly within the range that the receiver can receive. Both sender and receiver should take measures to prevent confusion window syndrome. The MAXIMUM packet segment length (MSS) is determined by the negotiation between the two parties during connection establishment. The MSS of TCP packets is limited by the MTU, the maximum transmission unit (MTU) at the network layer. TCP divides data delivered by applications into segments. Each packet segment plus the TCP packet header and IP packet header does not exceed the MTU value. In this way, the network layer does not fragment TCP data segments. UDP packages the data delivered by the application into a UDP datagram, and sharding is done by the network layer. TCP adopts four congestion control algorithms: slow start, congestion avoidance, fast retransmission, and fast recovery. These four algorithms are generally not used in isolation. The switching between slow start and congestion avoidance depends on whether the congestion window reaches SSthresh and whether retransmission occurs. The fast retransmission algorithm is followed by the fast recovery algorithm.