I recently reread Mr. Stevens’s TCP/IP detail, combed it out, and decided to write what I understood. TCP/IP is a connection-oriented, reliable, byte stream based transport layer communication protocol, which ensures that data does not lose packets or get out of order. TCP is the Transmission Control Protocol, which is the fourth Transport layer in the OSI model of networks.

The TCP header

  • Port Each TCP data segment contains the source Port number and destination Port number, which are used to search for the application process on the sending and receiving end. These two values plus the source IP address and destination IP address in the IP header are sometimes referred to as socket quads (source IP address, destination IP address, source port, destination port)
  • Sequence Number Sequence number Identifies the byte stream sent from the TCP sender to the TCO receiver. It identifies the first byte in the packet segment. The serial number is an unsigned 32-bit number starting from 0 when it reaches 232-1. TCP provides full-duplex services at the application layer. This means that data can be transmitted independently in both directions. Therefore, each end of the connection must maintain the serial number of the transmitted data in each direction.
  • Acknowledgment Number This Acknowledgment number is similar to the serial number, but it is used to acknowledge the number you have received and the number you want to receive next time. These two serial numbers ensure that TCP transmission is not out of order or packet loss.
  • TCP Flag NS: indicates hidden protection. CWR: The sending host sets the Congestion Window reduction (CWR) flag to indicate that it has received a TCP segment with the ECE flag set and responds in the congestion control mechanism. ECE: ECN-echo has dual roles, depending on the value of the SYN flag. It indicates that if the SYN flag is set to (1), the TCP peer has ECN capability. If the SYN flag is clear (0), packets with congestion experience flag Settings (ECN = 11) in the IP header are received during normal transmission. This is used as an indication of congestion (or impending congestion) on the TCP sender’s network. URG: indicates that the emergency pointer field is important. ACK: indicates that the acknowledgment field is important. This flag should be set for all packets after the initial SYN packet sent by the client. PSH: push function. Requests that buffered data be pushed to the receiving application. RST: resets the connection. SYN: synchronizes the sequence number. Only the first packet sent from each end should have this flag set. Other flags and fields change meaning based on this flag, some only when set, some only when explicit. FIN: Indicates the last packet from the sender. For details on SYN and FIN, see my article TCP three-way handshake and four-way wave
  • Window size TCP traffic control is provided by declaring the Window size on each end of the connection. Window size is a 16bit field, so the Window is up to 65535.
  • Checksum It is calculated by the sender and then verified by the receiver. Its purpose is to ensure what appeared in the process of transmission errors, if the checksum validation fails, the TCP simply discard the data segment (check process will involve a pseudo first, pseudo first data from the IP header to obtain data, its purpose is to arrive in order to detect the TCP data segments are correct, just for check).
  • Urgent Pointer, which only takes effect if the URG flag is set to 1.

TCP data transmission

TCP before the establishment of a connection to write an article, so not here in detail, we directly talk about TCP data transmission how to ensure the transmission order and packet loss problem, and how to improve the throughput of TCP transmission.

  • In general, TCP will not send an ACK Acknowledgement immediately upon receipt of data. It will be sent out later. It can send the data needed by the other party together with Acknowledgement (data accompanied by ACK) or reply to the second ACK directly after the second data comes. A typical implementation uses a 200ms delay (that is, it waits 200ms for data to be sent together)
  • In the process of data transmission, Nagle usually encounters some small packet transmission (such as 41-bit data packet, excluding the TCP header and IP header, only 1 bit data is actually transmitted). If there are many small packets, the transmission on the network will increase the possibility of network congestion. In order to improve transmission efficiency, Nagle algorithm is proposed. This algorithm requires a TCP connection to have at most one unacknowledged incomplete packet, and no other packets can be sent until the packet arrives. TCP then collects the small packets and sends them as a single packet when the acknowledgement arrives, effectively reducing the small packets. In some scenarios with high real-time requirements, using Nagle algorithm will make users feel the delay, so we can choose to turn off Nagle algorithm, Socket API can use TCP_NODELAY option to close, tcp_nodely on Nginx also uses this system call.
  • Retransmission TCP Retransmission policy to prevent data loss. TCP timeout retransmission is serious. It indicates that the TCP timeout has expired and no data confirmation has been received. Therefore, the TCP timeout retransmission enters slow start, while fast retransmission does not. TCP timeout retransmission: The TCP sender maintains a TCP retransmission timer (sometimes called timeout RTO). This timer is calculated according to the round trip time (RTT). Refer to RFC 6298 for the implementation of this algorithm. TCP retransmits data and then goes into slow start in congestion control (more on congestion control later). TCP fast retransmission: It is mainly after receiving three duplicate ACKS (if the data received by the receiver is out of order). It resends the acks it recently received in the correct order) for retransmission, because receiving repeated ACKS means that the data has been sent, and one of the data may not be received because of other reasons (such as changing a distant route during data transmission, or simply losing the data). So it’s not too serious, it’s not going to go into slow start, it’s going to go into fast recovery. After receiving repeated ACK packets, TCP retransmits the next confirmation packet. In this way, correctly transmitted packets may be sent repeatedly, which degrades TCP performance. In order to improve this situation, the Selective Acknowledgement technology has been developed. SACK option can be used to inform the employer of the data received, and the employer will know what data is lost after receiving this information, and then immediately retransmit the missing part.

TCP sliding window

  • The sliding window

    TCP maintains a window that represents the size of the data I can accept during data transfer. If the receiver window size is 0, the sender stops sending. A Sliding Window is called that because it is dynamically variable, not fixed (opening, closing, shrinking). It ensures reliable delivery of data, it ensures sequential delivery of data, and it enforces flow control between senders.


    In the picture above, we can see:

    LastByteAcked on the sender points to the location of the last sequential ACK on the receiver. LastByteSent points to the data that has been sent but has not received an acknowledgement ACK.

    The receiver NextByteExpected refers to the last consecutive data that was received, and LastByteAcked refers to the last data that was received, with whitespace representing data that has not yet been received.

    Here’s a schematic of a sliding window:

    Snd.una: Serial number of the first byte of data sent but not acknowledged. This marks the transfer of the first byte of class # 2; All previous serial numbers refer to bytes in transport class # 1.

    Snd.nxt: Serial number of the next byte of data to be sent to another device (in this case, the server). This marks the transfer of the first byte of class # 3.

    Snd. WND: indicates the size of the send window. Recall that the window specifies the total number of bytes that any device may have “unfinished” (unacknowledged) at any time. Therefore, add the first unacknowledged byte (snD.una) and the first byte of the send window (snD.wnd) with the sequence number mark to send type # 4. NXT: to be sent SND.WND: Size of the send window #1 indicates data that has been confirmed, so the window moves to the right and black represents the size of the window.

    #2 indicates that it has been sent, but no acknowledgement has been received.

    #3 indicates the data that has not been sent but can be received by the receiving party.

    #4 indicates data that cannot be sent, data that cannot be received by the receiver.

Here is a schematic of TCP window sliding:

  • We see TCP doing traffic control by letting the receiver specify the window, which effectively prevents the sender from relaxing data until the window becomes non-zero. One problem, however, is that the window update data sent by the receiver is lost, which puts the sender into an infinite wait state as he waits for the window to update to non-zero. To solve this problem, TCP uses the persist timer to detect window updates. This can lead to a condition called Silly Window Syndrome (SWS). If this happens, a small amount of data will be exchanged over a connection rather than a full length segment of a message. This phenomenon can occur on either end, where the receiver can advertise a small window (rather than waiting for a large window) or the sender can send a small amount of data (rather than waiting for other data to send a large data segment). The phenomenon of avoiding SWS can be taken at either end. 1. The receiver does not notify the small window. The usual algorithm is that the receiver does not advertise a window larger than the current window (which can be 0), unless the window can increase the size of the message segment (that is, the MSS to be received) or increase the receiver’s cache space by half, no matter how much it actually is. 2. The sender’s measure to avoid the confusion window syndrome is to send data only when one of the following conditions is met :(a) a message segment of full length can be sent; (b) a message segment that is at least half the size of the recipient’s notification window can be sent; (c) It can send any data and does not want to receive an ACK (that is, we have no data that has not yet been acknowledged) or the connection cannot use the Nagle algorithm.

TCP congestion control

TCP not only controls end-to-end data transmission, but also monitors the transmission on the network. This makes TCP very intelligent, adjusting its sending and receiving speeds based on network conditions. When the network is smooth, it will be fast, but when it is congested, it will be slower. There are four main congestion control algorithms: slow start, congestion avoidance, fast retransmission and fast recovery.

  • Slow start and congestion avoided

    Slow start and congestion avoidance algorithms must be used by the TCP sender to control the amount of data being sent to the network. To implement these algorithms, two parameters must be added to each TCP connection state. The congestion window (CWND) is a sender limit on the maximum amount of data that can be sent to the network before the sender receives an ACKNOWLEDGEMENT (ACK), and the receiver notification window (RWND) is a receiver limit on the amount of incomplete data. The minimum values of CWND and RWND determine the data transfer. Another state parameter, slow start threshold (SSthRESH), is used to determine whether slow start or congestion avoidance algorithms are used to control data transfer. Sending data to the network without knowing the environment requires TCP to slowly probe the network to determine available traffic, avoiding congestion by suddenly sending large amounts of data. CWND is 1 at the beginning of slow start, and each ACK received to acknowledge new data increases at most SENDER MAXIMUM SEGMENT SIZE (SMSS) bytes. The slow start algorithm is used when CWND < SSTHRESH, and the congestion avoidance algorithm is used when CWND > SSTHRESH. When CWND and SSTHRESH are equal, the sender can use either slow start or congestion avoidance. When congestion occurs, SSTHRESH is set to half the current window size (minimum for CWND and receiver notification window size, but at least 2 message segments). In case of timeout retransmission, CWND is set to 1 segment (this is slow start, but slow start is not slow, it is exponential growth, but its start is low). When SSthRESH is reached, the congestion avoidance algorithm (congestion avoidance is linear growth) is entered.

In this figure we can clearly see that SSTHRESH initially equals 8 MSS. The congestion window rises exponentially during slow start and reaches SSthresh on the third transfer. The congestion window then climbs linearly until a loss (timeout) occurs, just after sending 7. When loss occurs, the congestion window is 12 MSS. Ssthresh is then set to 6 MSS and CWND is set to 1, and the process continues.

  • Fast retransmission and fast recovery

    When a receiver receives data out of order, it should immediately reply with a duplicate ACK. The purpose of this ACK is to notify the sender that an out-of-order segment of data has been received, along with the desired sequence number. The sender may receive this duplicate ACK for a number of reasons, such as loss or network reordering of data. After receiving three duplicate ACKS (four identical acks including the first one), TCP retransmits data segments that appear to have been lost before the retransmission timer expires. Because it’s not as bad on the network as timeout retransmission, it’s not going to slow start,And into a quick recovery. Fast recovery will first halve ssTHRESH (usually rounded to multiples of data segments), and then CWND = SSTHRESH + the cumulative size of the segments received repeated ACK packets.

    In this figure, we can see that after three repeated ACK, CWND does not enter into slow start, but into fast retransmission. In the second timeout retransmission, enter the slow start CWND set 1.

conclusion

The attempt to explain TCP in minimal text was not very successful. TCP has been around for decades, and several books could be written about the technology. You can use it as an index and go through it quickly. Here’s a list of the documents I referred to while writing this article, all of which are good enough to be worth reading. TCP Congestion Control Transmission Control Protocol TCP Sliding Window TCP/IP Guide rfc 5681