review

UDP packets are unreliable, whereas TCP packets provide reliable delivery, have congestion control, and are stateful connections. These features are guaranteed by some complex structures, notably the UDP header’s simplicity and the TCP header’s complexity. Although TCP has the so-called reliability guarantee, but its network environment is not good. As the IP layer, if the network conditions are poor, then there is no guarantee that the packet is sent, as the upper layer of TCP also has no way, can only constantly retransmission, through the algorithm to ensure reliability. Therefore, this article consists of the following points:

  • TCP Header Format
  • How is the TCP packet number generated for the three handshakes containing the TCP packet number
  • Four times to wave
  • Data structures used for flow control and congestion control
  • Flow control
  • Congestion control
  • Added: Sequence issues and packet loss issues – timeout retransmission, fast retransmission

The key issues of TCP are:
Order problem,
Packet loss problem,
Connect the maintenance,
Flow control,
Congestion control

TCP Header Format

The components of the TCP header are shown below:

  • Source port number and destination port number. With these two ports, you know to which application the data in TCP should be sent
  • The serial number. Packets are numbered to solve the problem of out-of-order, so that TCP can ensure that all packets arrive in order
  • Confirm the serial number. A reply to an outgoing packet to confirm that it has been received and, if not, to resend it until it is delivered. Even if the receiver does not receive the packet, the sender knows which packet was not received and needs to be retransmitted. This structure solves the problem of packet loss

    For order, each package has an ID. When the connection is established, the starting ID is agreed upon, and each ID is sent one by one. In order to ensure not to lose packets, all sent packets should be answered, but the reply is not to receive a reply one by one, but to a previous ID, indicating that they have received. This mode is called cumulative acknowledgement or cumulative reply. For example, the confirmation number sent is 20, indicating that the first 19 packets have been received and a 20th packet is needed

  • Head length field. The length of the TCP header is variable due to the following TCP option field
  • State. SYN initiates a connection; ACK is a reply; RST is reconnect; FIN is the end of the connection; When PSH is set, it indicates that the receiver should immediately hand over the data to the upper layer; URG is used to indicate the existence of data in the segment that has been set as “urgent” by the upper entity at the sending end. The last byte of the urgent data is indicated by the following emergency pointer field. Because TCP is connection-oriented, these status bits change the state of the connection.

In practice, PSH, URG, and emergency data Pointers are not used, only for completeness, so these fields are mentioned

  • Window size. TCP has the function of flow control, and both sides of the communication declare a window size to indicate their respective processing power. In addition to traffic control, TCP also uses congestion control to control the speed at which it sends packets to suit the network environment

    Connection management

    Next up is TCP connection management, divided into three handshakes and four waves. The sections of congestion control and flow control are described in the end

    Three-way handshake

    To establish a TCP connection between two hosts, there are three steps of “request-> reply -> reply reply”. In addition to establishing the connection during the three handshakes, the serial number in the TCP header is also confirmed during the handshake.

The obvious question is, why does TCP have three handshakes instead of two or four? A: Suppose the path between host A and server B is unreliable, so when A wants to initiate A connection, it sends A packet to B, but the path in the middle is unreliable, so the packet may be lost, it may time out, or it may be that B did receive it but doesn’t want to connect with A. So just one communication between A and B is not enough

If the receiver needs to constantly confirm that the packet sent by the sender has been received, then even four times is not enough, because the packet sent will always be lost due to the channel. This stops at the point where the connection is established and subsequent data cannot be sent.

Shake hands three times, for both A and B is A request, and received A reply, most likely there is no problem, the connection has been established. Moreover, the connection is set up to send data, and even if one acknowledgement packet is lost, the problem will be solved

That is, three handshakes ensure that the connection can be established with minimal additional resource consumption

  1. The client first sends a TCP message to the server. This text contains no application-level data, but a flag bit (SYN) at the beginning of the packet is set to 1. In addition, the client randomly selects an initial serial number, “client\_isn”, which is sent to the server with the first handshake
  2. When the server receives the packet, it allocates the cache and variables for the TCP connection and sends a message to the client allowing the connection. The acknowledgement message sent by the server also does not contain the application layer data, but it contains other information at the beginning: the SYN status bit value is 1; The confirmation number field for the first part is set to “Client \ _ISN + 1”; The server sets its initial ordinal number, “SERVER \ _ISN”, and places it in the ordinal number field at the beginning of the message
  3. Upon receipt of the acknowledgement from the server, the client also assigns caches and variables to the connection. The client then sends the “reply of reply” message to the server, in which the TCP header acknowledgement number has the value “SERVER \ _ISN + 1”. The connection is already established, so the value of SYN is 0. Also, in the data section of the packet, add the data that the client will send to the server

The following is a graphical representation of the client and server state changes during the three handshakes:

  1. Both are initially off. The server then turns on, actively listens on a port, and becomes LISTEN
  2. The client initiates the connection and is then in the SYN\_SENT state
  3. The server receives the connection from the client, returns SYN and acknowledges (ACK) the client’s SYN. It is then in the SYN\_RCVD state
  4. After the client receives the SYN and ACK from the server, it sends ACK of ACK, and it is in the ESTABLISHED state, because from the client’s point of view, the connection is considered ESTABLISHED if it receives a reply
  5. The server receives the ACK of an ACK and is also in the ESTABLISHED state, because from the server’s point of view, it also reaches the state where the sent connection was returned. With both in the ESTABLISHED state, you can send data to each other

Rules for sequence number generation

In the above description, the serial numbers of each message are described by words, but in the actual case, they are numbers. And there are rules for the numbers you use:

First of all, the sequence number cannot start at 1, because there will be a conflict. Suppose A connects to B and sends packets 1, 2, and 3, but when sending 3, 3 does not arrive due to various reasons. Then A drop, to establish the connection with B, serial number starting from 1 again, and then send 2 and no. 3, but the three returned detours last time, sent to B, B naturally think, this is the next packet, and an error has occurred Each connection has a different serial number, serial number of the starting value is variable, Think of it as a 32-bit counter, adding 1 every 4 microseconds, which would take “2^ 32/1000/1000/3600 “hours (about 4 + hours) to repeat, (The IP header in the next layer has a TTL indicating the lifetime)

Four times to wave

As part of TCP connection management, in addition to three handshakes to establish the connection, there are four waves to disconnect the connection

The process of disconnecting ideally

The process of the client and server disconnecting from each other is shown in the following figure:

  1. Before disconnection, both parties are in the ESTABLISHED state of sending data to each other, then the client sends a FIN flag bit 1 to the server, enters the FIN\_WAIT\_1 state, and waits for the server to send an acknowledgement message
  2. When a client in the FIN\_WAIT\_1 state receives an acknowledgement from the server, it enters the FIN\_WAIT\_2 state and waits for a disconnected message from the server (in this message, the FIN status bit is 1, indicating that the server is ready to disconnect as well).
  3. A client in the FIN\_WAIT\_2 state, receiving a message from the server, sends an acknowledgement message and enters the TIME\_WAIT state. In practice, the ACK message sent may be lost, so the client will be in the TIME\_WAIT state for a period of TIME, usually 30 seconds or 1 minute or 2 minutes in practice. After waiting, the connection is officially closed and the client releases all resources (including the port number).

Disconnection in the real world

The above disconnection process belongs to the condition of good network conditions, and each packet can reach the destination without loss. But in practice, a lot of surprises can happen.

When A is ready to disconnect, it will enter the state of FIN\_WAIT\_1 after sending the packet with FIN of 1, and B will enter the state of CLOSE\_WAIT after receiving the packet sent by A with A confirmation message

After receiving the confirmation message from B, A enters the state of FIN\_WAIT\_2. If B releases resources directly at this time, A will remain in this state forever. TCP does not handle this state, but Linux does. You can adjust the TCP \ _FIN \ _TIMEOUT parameter to set a timeout.

If B sends A packet with A FIN of 1 instead of releasing the resource, A sends an ACK and ends with the FIN\_WAIT\_2 state. In theory, A can release the resource, but if the last ACK is lost, B will resend A packet with FIN of 1. At this TIME, if A has released the resource, B will not receive any more ACK, so TCP protocol requires A to wait for A period of TIME\_WAIT, This time should be long enough that if B does not receive an ACK, B will resend the packet that was about to be disconnected, and A will resend another ACK in just enough time to reach B

A has to wait long enough because of port occupancy issues and persistent waiting issues:

A direct release resources, along with all the original will take the port to release, but don’t know the corresponding port on B is not A, and B originally had sent many package may be still on the way, if A port is taking up A new application, this new application will receive A connection B sending A package, although is to regenerate the serial number, However, a double safety is needed to prevent confusion, and it is also necessary to wait long enough for all packets sent by B to be discarded before the port is cleared

The wait time here is set to 2 MSL (Maximum Segment Lifetime), and 1 MSL is the longest time that any packet exists on the network. After this time, the packet will be discarded. Since the TCP packet is based on IP protocol, there is a TTL value in the IP header, which is the maximum number of routes that the IP datagram can go through. This value is decreased by 1 for each router that passes through the packet. When this value is 0, the datagram will be discarded, and the ICMP message will be sent to the source host. The protocol stipulates that the MSL is 2 minutes, and the practical application is commonly used in 30 seconds, 1 minute and 2 minutes

Another abnormal problem is that B has not received the ACK of FIN sent by it after 2MSL time. B will resend FIN. At this time, after A receives the package again, A has been waiting for 2MSL time and has done his best, and will not recognize any subsequent packages. By sending RST instead of ACK, B knows that A has released the resource

The control principle

After the process of establishing and disconnecting a TCP connection is described, there are some controls in the connection process. TCP has two mechanisms for flow control and congestion control when sending data between a client and a server

The purpose of flow control is to eliminate the possibility that the sender will overrun the receiver’s cache. That is, flow control is a speed-matched service where the sender’s sending rate matches the read rate of the receiver’s application

Congestion control refers to a form of control in which the sender is held back due to congestion on the IP network. Although the actions are similar, they are actions for different reasons

Because flow control and congestion control act similarly, they both use Windows when they are controlled. The following is the structure of the window and the related concepts. The numbers in the window are the serial numbers when the packet is issued

In the sending cache, there are a series of packets, which can be divided into the following four types according to the processing:

  • It has been sent and the confirmed package has been received
  • Packets that have been sent but have not received a confirmation
  • Packets that have not been sent but are waiting to be sent
  • Packets that have not been sent and are not yet ready to be sent

* The receiver then tells the sender the size of the window, called the Advertised window. The size of this window is equal to the sum of the second part and the third part above, that is, those that have been sent and are awaiting confirmation, and those that are ready to be sent. Beyond this size, the receiver’s cache may overflow, so the sender needs to maintain the following data structures:

  • LastByteacked: The dividing line between Part 1 and Part 2
  • LastBytesent: The dividing line between Part 2 and Part 3
  • LastByteacked + Advertisement Window: The dividing line between Part 3 and Part 4

In the receiver’s cache, the following contents are recorded:

  • Accepted and confirmed
  • I haven’t received it yet, but I will receive it soon. This is the maximum number of messages that the receiver can accept
  • I haven’t received it. I can’t receive it. This is the part of the cache that will overflow

The corresponding data structure is as follows:

  • MaxRCVBuffer: The maximum amount of cache
  • LastByteread: is the leftmost arrow in the figure. The previous one has been received and read by the application layer; Then there are those that have been received, but have not yet been read by the application layer
  • NextByteExpected: To the left of this arrow are confirmed receivers and to the right are awaiting receivers

The difference between NextByteExpected and LastByteRead is the amount of maxRCVBuffer that has not yet been read by the application layer, denuded by A. As mentioned earlier, the size of the window that the receiver will pass to the sender is advertisedWindow: maxRCVBuffer minus A. That is, advertisedWindow = maxRCVBuffer – (nextByteExpedition-lastByTereAD).

NextByteExtract + AdvertisedWindow is the dividing line between Part 2 and Part 3, which is essentially lastBytereRead plus maxRCVBuffer. In the second part, because the packets received may not be in order, there will be a gap. Only those that are continuous with the first part can be replied immediately. The empty part in the middle needs to wait, even if the latter part has come. This is because you want to keep the packages sequential

After introducing the specific data structure, the following is the specific application in the flow control and congestion control

Flow control

TCP provides flow control services by having the sender maintain a variable called the Receive Window (* Window). This variable is used to inform the sender how much cache space the receiver has available. TCP is full duplex communication, so both the client and server have receive Windows.

Next, through a case, to illustrate the principle of flow control. Again, use the graph from the above data structure:

The red box in the figure represents the window. Let’s assume that the window is constant and is always 9. When an ACK for packet 4 is received, the window is moved to the right one space, so packet 13 can be sent. If the sender sends all packets from 10 to 13 at once, it will stop sending, and the unsent portion will become 0:

When the ACK for packet 5 is received, the window moves one more space to the right, so packet 14 can also be sent:

So the window has been moving to the right, is the normal situation. If the receiver is processing so slowly that there is no room in the cache to receive the packet from the sender, the size of the sender window can be resized with the ACK message, or even set to 0 to stop the sender from sending. Here’s where the window can be changed:

Assuming that the receiver’s application is never reading the cached data, when the sender receives the ACK of packet 6, the window needs to shrink to 8. So this window of length 8, when packet 6’s ACK arrives, does not move to the right as a whole, but only to the left, making the window smaller:

If the receiver still doesn’t process the data, the window will shrink and shrink until it shrinks to zero. As shown in the figure below, when the sender receives an ACK for packet 14, the entire window is reset to 0 to stop sending

If this is the case, the sender will periodically send window probing packets to see if there is an opportunity to resize the window. When the receiver is slow, to prevent the low-power window syndrome, do not empty a byte to tell the sender, and then immediately fill up, when the window is too small, do not update the window until it reaches a certain size, or the buffer is half empty, then update the window.

This is the mechanism of TCP flow control

The low-energy window syndrome, mentioned above, works like this (A is the sender, B is the receiver). This occurs when B’s receive cache is full and B has no more data to send to A. In both cases, when B’s application reads in the cache and then empties the cache, TCP will not send A A message whose RWND is no longer zero because B has no data to send to A. In this case, A will always assume that B’s receive cache is full, meaning that A is blocked and cannot send any new data.

The solution to this problem is that TCP specifies that when the receiver window of B becomes 0, host A continues to send packets of only one byte of data. These packets will be acknowledged by the receiver, and eventually the cache will begin to empty, and the ACK will contain A non-zero RWND value

Congestion control

Congestion control, like flow control, also uses a window mechanism. Flow control uses a sliding window, RWND, in case the sender fills the receiver’s cache, while congestion control uses a congested window, CWND, in case the network fills up. There is a formula: lastBytesent – lastByteacked <= min {CWND, RWND}, which means that congestion Windows and sliding Windows together control the speed of sending. Capacity of a channel on a network = bandwidth * round-trip latency.

The following is an example of a TCP dispatch. Normally, setting the send window (that is, the part that has been sent but has not received an acknowledgement) to the capacity of the channel can fill the entire channel. However, if the window is enlarged at this time, more packets are sent per unit time, which is easy to cause packet loss. However, if the cache of the router on the channel is increased at this time, the delay will be increased and the timeout retransmission phenomenon is easy to occur.

As shown in the figure, it is assumed that the round trip time is 8s, 4 seconds each round trip, and one packet is sent per second, each packet is 1024byte. 8s has elapsed, then all 8 packets have been sent, among which the first 4 packets have reached the receiving end, but the ACK has not returned, so the sending cannot be considered successful. The last four packets are still on their way and haven’t been received yet. At this point, the entire pipe is full, and at the sending end, 8 packets have been sent unconfirmed, which is equal to the bandwidth, which is 1 packet per second, times the round-trip time of 8s.

Increase the sender’s window to send more packets per unit of time. Originally to send a package, from one end to another end, assume that a total of four devices, each device processing a package time spend 1 s, so it takes 4 s on the other side, if hair faster, the unit of time, there will be more packet to the intermediate devices, these devices can only process a packets per second, extra packet is discarded

At this point, if the intermediate device cache is increased, the packets that cannot be processed will be placed in the queue first, so that the packets will not be lost. However, the disadvantage is that it will increase the time delay, so that the cached packets can not reach the receiving end after 4S, and if the time delay reaches a certain extent, the packets will be retransmitted over time.

Thus TCP’s congestion control is mainly used to avoid two phenomena, packet loss and timeout retransmission. Once these phenomena appear, it means that the speed of sending is too fast and it should be a little slower. It starts out by sending small packets, which then grow in size a little bit, a process called slow start

When TCP first establishes a connection, it sets the CWND to a single segment that can only be sent one at a time. When this acknowledgement is received, CWND adds one, so it can send two at a time; When these two acknowledgements come, each acknowledgement CWnd is added one, and two acknowledgements CWnd is added two, so four can be sent at once. When these four acknowledgements arrive, each acknowledgement CWND is added one, and the four acknowledgements CWND is added four, so eight can be sent at once. You can see that this is an exponential increase.

But there is an end to this exponential growth. There is a value SSTHRESH of 65535 bytes, beyond which the exponential growth can no longer be achieved because the entire path may be nearly full. After SSTHRESH is exceeded, the CWND is increased by 1/ CWND for each acknowledgement received. The value of the above RWND is 8, and eight are sent at a time. When eight acknowledgements arrive, each acknowledgement is increased by 1/8, and the total CWND of eight confirmations is increased by 1. It grows linearly.

But the linear growth continues to grow, to the point where congestion begins to set in, and the rate of delivery must be slowed down.

One form of congestion is packet loss, requiring timeout retransmission. The traditional approach is to set SSHRESH to CWND /2, CWND to 1, and restart the slow boot. But this approach is too radical, a high-speed transmission speed suddenly stopped, will cause network lag.

The new approach is to use the “fast retransmission algorithm”, which will be mentioned shortly below. When a receiver realizes that a tundish has been lost, it sends the ACK of the previous packet three times, so the sender can quickly retransmit without waiting for a timeout to retransmit. TCP thinks this is not a serious situation because most of the packets are not lost, only a small part is lost, so CWND is halved to CWND /2, then SSHTHRESH = CWND, when the three packets return, CWND = SSHTHRESH + 3, which slows down the speed slightly, Let’s leave room for linear growth. Below is a picture of the change in sending speed when using the two methods:

Congestion Control – TCP BBR Congestion Algorithm

TCP’s congestion control slows down speed when delay matters. But the two phenomena that congestion control is primarily designed to avoid are both problematic:

  1. The problem of packet loss. Packet loss does not necessarily mean that the network channel is full, and it can occur even when bandwidth is not available
  2. TCP’s congestion control does not lose packets until the intermediate devices are full, which slows things down too late. TCP should only fill the network path and should not continue to fill until even the cache is filled.

In order to solve the above two problems, there is the TCP BBR algorithm. The algorithm tries to strike a balance by constantly speeding up the delivery, filling up the pipe but not filling up the intermediate device’s cache, so as to avoid increasing latency. At this equilibrium point, a good balance between high bandwidth and low delay can be achieved.

Sequence problems and packet loss problems

What follows is a description of the problems that TCP has when it comes to guaranteeing sequentiality. Simply put, because of a packet loss, the order of receiving has changed, so the packets that did not arrive need to be resended. There are two ways to determine the timing of sending, one is timeout retransmission, and the other is fast retransmission.

In this section, we use the same images we used when we introduced the data structure in the previous article. The top one is the image of the sender and the bottom one is the image of the receiver

At this point in the sender, 1, 2, 3 have been sent and have received the acknowledgement; 4 ~ 9 are the packages that have been sent but have not been confirmed; 10, 11, and 12 are packets that are ready to be sent but haven’t been sent yet; The latter packets are packets that the receiver does not have enough space to send at this time. 1, 2, 3, 4, 5 have ACK completed, but have not read; 6, 7 are waiting to be received; 8, 9 have received, but no ACK.

Current state of the sender and receiver: 1, 2, 3 There is no problem, both sides have reached an agreement; 4, 5 the receiver said that the ACK, but the sender has not received it, may be lost, may be on the way; 6, 7, 8, 9 have been sent, but 8, 9 have arrived, 6, 7 have not arrived, appear out of order, cached but can not send ACK.



Suppose the ACK of 4 arrives, but the ACK of 5 is lost, and the packets of 6 and 7 are lost. The following measures are taken:

One way to do this is to timeout retry, which means to set a timer for every packet that has been sent but does not have an ACK. After a certain period of time, you try again. However, this time should not be too short, the time must be longer than the round-trip time (English name is RTT), otherwise it will cause unnecessary retransmission; Should not be too long, so that the timeout time becomes long, the access will be slow

Estimating round-trip time requires TCP to sample the RTT time and then weighted average it to arrive at a value that is constantly changing as network conditions change. In addition to sampling the RTT, the fluctuation range of the RTT is also sampled to calculate an estimated timeout. Since the Retransmission time is constantly changing, we call it Adaptive Retransmission Algorithm.

If after some time, 5, 6, and 7 all time out, it will be resent. The receiver discovers that 5 has been received, so he discards 5; 6 received, the receiver sends an ACK, asking for the next 7, but unfortunately 7 is lost again. When 7 times out again, which is when a retransmission is needed, TCP’s policy is to double the timeout interval. Each time a timeout retransmission is encountered, the next timeout interval is set to twice the previous value. Two timeouts, indicating that the network environment is poor, should not be repeatedly sent frequently.

The problem with a timeout triggered retransmission is that the timeout period can be relatively long. Moreover, it is necessary to count RTT and its fluctuation range, which undoubtedly increases the workload. So there is a second option: fast retransmission

The mechanism for fast retransmission is that when a receiver receives a segment with a sequence number larger than the next expected segment, it detects an interval in the data stream and sends the redundant ACK, the value of which is the expected segment. When the client receives three redundant ACKs, it retransmits the missing segment before the timer expires

For example, if message 6 and message 8 were received by the receiver, but message 7 was not received, it is likely that message 7 is lost. So we send an ACK of 6, and the next one is going to be 7. Then a subsequent packet is received, still sending an ACK of 6, requiring the next packet to be 7. When the client receives three duplicate ACKS, it will find that the 7 is indeed lost, and will resend it immediately without waiting for the timeout

There is also a method called Selective gment (SACK). This approach requires adding a SACK thing to the TCP header to send the cached map to the sender. For example, you can send ACK6, SACK8, SACK9. With the map, the sender can immediately see that the 7 is missing