TCP is a complex protocol, and the next two articles will briefly introduce some of the key points in TCP.

This article introduces the TCP state machine and retransmission mechanism, and describes flow control and congestion control.

Most of this article is based on the TCP thing (above), some of the ideas are different from the original, important points added explanation.

Front knowledge

Some network basics

TCP is in the fourth layer of the Network OSI seven-layer model — Transport layer, IP is in the third layer — Network layer, ARP is in the second layer — Data Link layer, the Data on the second layer is called Frame, the Data on the third layer is called Packet, the Data on the fourth layer is called Segment.

The data at the application layer is first sent to the TCP Segment, then the TCP Segment is sent to the IP Packet, and then to the Ethernet Frame. After being sent to the peer end, each layer parses its own protocol, and then sends the data to the higher-layer protocol for processing.

TCP header format

PNG – eed30F-1522722733998-0

Note:

  • TCP packets do not have IP addresses, that is at the IP layer. But active ports and target ports.
  • A TCP connection requires four tuples to represent the same connection (src_ip, src_port, dST_IP, dst_port) (quintuple to be exact, and one protocol, but since I’m only talking about TCP, I’ll just say quintuples here).
  • Notice four very important things in the picture above:
    • Sequence Number, the serial number of the bag Seq,Used to solve network packet disorder(reordering).
    • Acknowledgement Number, Ack is used to confirm the receipt of Seq (Ack = Seq + 1, indicating that Seq and the packets before Seq have been received and Seq + 1 is expected).Used to resolve packet loss.
    • Window, also calledAdvertised Window, can be approximately understood asThe sliding window(Sliding Window),Used for flow control.
    • TCP Flag, distinguishes the types of packets, such as SYN packets, FIN packets, and RST packets, and is used to control the TCP state machine.

Refer to the following figure for other fields:

TCP state machine

In fact, network traffic is not connected – what TCP calls a “connection” is simply maintaining a “connection state” on both sides of the communication, making it look like there is a connection. Therefore, TCP state transitions are very important.

The following is a simplified comparison of “TCP state machine” and “TCP three-way handshake establishment, data transfer, and four-way handshake disconnection”. Both figures essentially depict the TCP state machine, but the scenario is slightly different. These two diagrams are very important to remember.

TCP state machine, no distinction between client and server:

The following figure shows the classic “TCP three-way handshake establishment + data transfer + four-way handshake disconnection”. The client initiates a handshake, transmits data to the server (the server does not transmit data to the client), and finally initiates a handshake:

Three handshakes and four waves

Many people ask, why does it take three handshakes to build a connection and four waves to break it?

Three handshakes establish a connection

Initialize the initial value of Sequence Number.

Each communication partner synchronizes the ISN (Inital Sequence Number) of the other — hence the SYN (SYN Sequence Number). That’s x and y in the figure above. In the future data communication, the number is incremented in ascending order on the client and reorganized in ascending order on the server to ensure that the data received by the application layer is not out of order due to network problems.

Four waves disconnection

It’s actually _ two waves each.

Because TCP is full-duplex, both the client and server use their own resources to send segments (the same channel, transmitting seQ and ACK in both directions). Therefore, both parties need to close their own resources (send FIN packets to the other party) and confirm that the other party’s resources are closed (reply ack packets to the other party). And both sides can take the initiative to close at the same time, also can drive the other side to close passively by one side. However, usually one side is active and the other side is passive (see figure, client is active and server is passive), so it looks like four waves.

If both parties actively break the connection at the same time, then both parties will enter the CLOSING state, and then reach the TIME_WAIT state, and finally switch to the CLOSED state. The diagram below shows both parties disconnecting actively at the same time (corresponding to the Simultaneous Close branch of the TCP state machine) :

Other problems with the handshake

SYN timed out during connection establishment

After receiving a SYN from a client, the server replies with an Ack(Ack1). If the client is disconnected (or the network times out), the server cannot receive an Ack(Ack(SYN)) (Ack2). The connection is in an intermediate state (neither successful nor failed).

To solve the problem of intermediate state, the server will resend Ack1 if it does not receive Ack2 within a certain period of time (different from the retransmission mechanism during data transmission). On Linux, the maximum number of retries is six, including the first retry. The retry interval doubles from 1s (Exponential Backoff). The retry times for 5 attempts are 1s, 2s, 4s, 8s, 16s, and 32s after the 5th attempt. Therefore, TCP considers SYN timed out and disconnects the connection only after the SYN is sent at most six times, 1s + 2s + 4s+ 8s+ 16s + 32s = 2^ 6-1 = 63s.

SYN Flood attack

You can use the SYN timeout mechanism during connection establishment to launch SYN Flood attacks. After sending a SYN to the server, the server goes offline immediately. Therefore, the server consumes 63 seconds of resources by default. SYN packets are sent very quickly, so an attacker can easily exhaust the server’s SYN queue and make it unable to process normal new connections.

Linux provides a tcp_syncookies parameter to solve this problem. When the SYN queue is full, TCP sends back a special Sequence Number, called a SYN Cookie, using the source address port, destination address port, and timestamp. If it is a normal connection, the server will send the SYN Cookie back, and then the server can establish a connection with the SYN Cookie (even if you are not in the SYN queue). Connections in the SYN queue are not processed until they are closed due to timeout. Note that the tcp_syncookies parameter is not used to handle normal heavy load connections, because SYN cookies essentially break the SYN timeout mechanism for establishing connections and are a compromised version of TCP.

For normal connection requests, there are three additional parameters to choose from:

  • tcp_synack_retriesParameter Setting SYN timeout retry times
  • tcp_max_syn_backlogParameter Setting Maximum number of SYN connections (SYN queue capacity)
  • tcp_abort_on_overflowParameter to reject a connection if the SYN request cannot be processed

ISN synchronization

  • First, you cannot select a static ISN. For example, if the connection is always 1 for the ISN, if the client sends 30 segments (suppose one segment per byte), but the network is down, the client reconnects and uses 1 for the ISN, but the old connected segments (called “lost repeated groups”) arrive, Since the quintuples that distinguish connections are the same (calling the new connections “incarnations” of the old ones), the server treats them as segments from the new connections.
  • Then, you can see from the previous example that the ISN needs to grow dynamically to ensure that the ISN of the new connection is larger than the old one.
  • Finally, from the perspective of security, the growth of the ISN cannot be regular (such as the direct growth of a simple time clock). This is easy to understand; if the growth pattern is too simple, it is easy to forge an ISN to attack both ends of the network.

Finally, a variety of ISN growth algorithms are designed, which generally make the ISN grow dynamically at any time and have certain randomness. A simple ISN growth algorithm is described in RFC793: the ISN is tied to a fake clock that increses the ISN every 4 microseconds until it exceeds 2^32 and starts at 0 again. Thus, the period of an ISN is about 4.55 (I calculated 4.77?? ?). Hours. Define the Maximum segment Lifetime (MSL) for a network segment. If the segment Lifetime exceeds MSL, the segment is discarded. Therefore, if the ISN growth algorithm in RFC793 is used, the MSL value must be less than 4.55 hours to ensure that the ISN is not reused in adjacent connections (TIME_WAIT also serves this purpose). At the same time, this indirectly limits the size of the network (of course, 4.55 hours of MSL can already build very large networks).

MSL should be longer than the TTL conversion time for IP protocols. RFC793 recommends setting MSL to 2 minutes and Linux to 30 seconds following the Berkeley convention.

Other problems with the wave

About the TIME_WAIT

Why TIME_WAIT

In the TCP state machine, there is a timeout of 2 * MSL from TIME_WAIT to CLOSED. Why do I need a TIME_WAIT state with a timeout of 2 * MSL? There are two main reasons:

  • 2 * MSL Ensures that there is sufficient time for the passive party to receive the ACK or the active party to receive the passive time-out retransmission FIN. That is, if the passive party does not receive an Ack, it will trigger the passive party to retransmit the FIN, sending Ack+ receiving FIN exactly 2 MSLS.TIME_WAITAfter receiving the retransmitted FIN, the connection retransmits the Ack and waits for 2 MSL.
  • Make sure there is enough time for”Stray repetition grouping“Expired discard. This only takes 1 * MSL, and any grouping over MSL will be discarded, otherwise it will be easy to get mixed up with the newly connected data (relying on the ISN alone won’t do).

TIME_WAIT appears in large numbers

A common problem is the large scale occurrence of TIME_WAIT, usually in high-concurrency short connection scenarios, which can consume a lot of resources.

Most articles on the web teach you to turn on two parameters, tcp_tw_reuse or tcp_tw_recycle. Both parameters are turned off by default. Tcp_tw_recycle is more aggressive than tcp_TW_reuse; To use both, you also need to turn tcp_timestamps on (by default), otherwise it won’t work. However, turning on these two parameters can cause weird problems with TCP connections: as mentioned above, if a connection is reused without waiting for a timeout, data from the old and new connections can get mixed up, such as the new connection being reset if a FIN from the old connection is received during the handshake. Therefore, care should be taken when using these two parameters.

Parameters are detailed as follows:

  • tcp_tw_reuse: Official document saystcp_tw_reuseaddtcp_timestampsCan guaranteeThe client(client only) security at the protocol point of view, but needs to be open at both endstcp_timestamps.
  • tcp_tw_recycle: if it istcp_tw_recycleIf a conversation is opened, the peer is assumed to be enabledtcp_timestampsIf the time stamp is too large, the connection can be reused.

Add a parameter:

  • tcp_max_tw_buckets: Controls concurrencyTIME_WAITNumber (default 180,000), if the limit is exceeded, the system will remove the excessTIME_WAITConnect to deStory and send a warning in the log (e.g. “Time wait bucket table overflow”). The document on the official website states that this parameter is used to defend against DDoS attacks and needs to be considered according to actual conditions.

TIME_WAIT advice

In summary, TIME_WAIT occurs on the active side of the wave, that is, whoever initiates the wave sacrifices resources to maintain connections waiting to transition from TIME_WAIT to CLOSED. The existence of TIME_WAIT is necessary. Therefore, it is better to optimize the TIME_WAIT problem for different services than to avoid TIME_WAIT by breaking the protocol mentioned above.

For HTTP servers, you can set the KeepAlive parameter for HTTP, reuse TCP connections at the application layer to handle multiple HTTP requests (requiring browser cooperation), and let the client (i.e. browser) wave, so that TIME_WAIT only occurs on the client side.

The sample

The following is a Wireshark Graph that I captured while accessing coolshell.cn, and you can see how Seq and Ack changed. :

As you can see, Seq is related to the increase in acks and the number of bytes transferred. In the figure above, after three handshakes, two packets Len:1440 arrive, so the first packet is Seq(1) and the second packet is Seq(1441). Then the first Ack(1441) is received, indicating that the data from 1 to 1440 has been received, expecting Seq(1441). In addition, you can see that a packet can act as both Ack and Seq, carrying data and responses in a single transfer.

If you look at the handshake three times with Wireshark, you will find that the ISN is always 0. No, Wireshark uses Relative Seq for a friendlier display. You can see “Absolute Seq” by simply unchecking protocol Preference in the right-click menu.

TCP retransmission mechanism

TCP uses the retransmission mechanism to ensure that all segments can reach the peer end, and sliding Windows allow a certain degree of out-of-order and packet loss (sliding Windows also play the role of flow control, which will not be discussed for the moment). Note that the retransmission mechanism here refers specifically to the data transmission phase. The transmission mechanism of the handshake and wave phase is different from this.

TCP is byte stream oriented, and Seq and Ack grow in bytes. In the most naive implementation, to reduce network traffic, the receiver only replies with an Ack for the last consecutive packet and moves the window accordingly. For example, if the sender sends 1,2,3,4, and 5 copies of data (suppose one byte per copy), the receiver quickly receives Seq 1, Seq 2, and replies with Ack 3 and moves the window. Then Seq 4 is received. Since Seq 3 has not been received before, if it is still in the window, only the window is filled, but no Ack 5 is sent. Otherwise, Seq 3 is discarded (similar to the effect of packet dropping). If it is in the window, when Seq 3 is received in the future, it finds that Seq 4 and previous packets have been received, replies Ack 5 and moves the window.

Timeout retransmission mechanism

If the Ack waiting for Seq 3 (Ack 4) times out, the sender considers that Seq 3 failed to send and retransmits Seq 3. Once the receiver receives Seq 3, an Ack 4 is immediately returned.

The sender cannot distinguish whether Seq 3 packet loss, the receiver fault, or Ack 4 packet loss, which is described as “Seq sending failure” in this paper.

There are some problems with this approach: assume that Seq 4 has now been received; If Seq 3 is not received, the sender retransmits Seq 3. Before receiving the retransmitted Seq 3, the sender cannot reply with Ack, including the newly received Seq5 and the recently received Seq 4. Therefore, it is easy to retransmit Seq 4 and Seq5. The receiver has already saved Seq 4 and Seq 5 in the window, so retransmitting Seq 4 and Seq 5 is obviously wasteful.

In other words, the timeout retransmission mechanism faces the problem of “retransmitting one or all”, i.e. :

  • Retransmission: Retransmits only the packets of timeout (Seq 3), and retransmits subsequent packets after timeout. Saves resources, but is slightly less efficient.
  • Retransmit all: Retransmit the Timeout packet and all subsequent data each time (i.e. Seq 3, 4, 5). More efficient (if bandwidth is not full), but waste resources.

It can be seen that both methods are timeout retransmission mechanisms and have their own advantages and disadvantages. However, both methods need to wait for timeout and are time-driven, and their performance is closely related to the length of timeout. If timeout is long (the common case), the performance of both methods will suffer significantly.

Fast retransmission mechanism

The optimal solution is that the sender is required to Retransmit the timeout packet as soon as possible (Seq 3) through some mechanism, such as Fast Retransmit. This scheme wastes resources (depending on whether to retransmit one or all, see below), but is very efficient (because you don’t need to wait for a timeout).

The fast retransmission mechanism is not time-driven, but data-driven: if packets do not arrive continuously, Ack the last packet that might have been lost; If the sender receives the same Ack for three consecutive times, it retransmits the corresponding Seq.

For example, suppose the sender still sends 5 copies of data: 1,2,3,4,5; The receiver receives Seq 1 and replies Ack 2. Then Seq 2 is lost due to network reasons, Seq 3 is received normally, and Ack 2 is returned. The last packet that can be lost is Seq 2. Now, the sender has received the same Ack (i.e., Ack 2) four times in a row (at least three times), and knows that the largest unreceived packet is Seq 2, so it retransmits Seq 2 and clears the counter for Ack 2. Finally, the receiver received Seq 2, check the window to find Seq 3, 4, 5 received, back Ack 6. The schematic diagram is as follows:

Fast retransmission solves the timeout problem, but still faces the problem of “retransmission of one or all”. For the example above, is it only Seq 2 to be retransmitted or Seq 2, 3, 4, 5?

If only fast retransmission is used, then all must be retransmitted: the sender does not know which Seq is responsible for the above four consecutive Ack 2s. Suppose the sender sends Seq 1 to Seq 20 for 20 pieces of data, and only Seq 1, 6, 10, and 20 reach the receiver, triggering retransmission Ack 2. The sender then retransmits Seq 2, the receiver receives, and replies with Ack 3. Then, neither sender nor receiver sends any more data, and both ends are stuck in a wait. Therefore, the sender can only select Retransmit All, which is the actual implementation of some TCP protocols. The retransmission efficiency is greatly improved when the bandwidth is insufficient.

A more perfect design would be to combine timeout retransmission with fast retransmission. When fast retransmission is triggered, only a small segment of Seq (locality principle, or even only one Seq) is retransmitted, and the rest of the Seq is retransmitted after timeout.


Reference:

  • The TCP thing (PART 1)
  • Lots of TIME_WAIT solutions

TCP (1) : State machines and retransmission by Monkeysayhi.github. IO This article is published under the Creative Commons Attribution – Share alike 4.0 International License.