TCP Protocol Description

The purpose of this article is to briefly refine the knowledge of TCP, so that the readers can form a roughly complete framework understanding of TCP, and leave some initial impression, so that they can know which direction to break through when encountering problems in the project.

For more details about TCP, you are advised to read TCP/IP In Detail.

Introduction to the

In contrast to UDP’s lack of reliability, TCP provides a connection-oriented, reliable byte stream service.

It has the following characteristics:

Connection-oriented: Two applications that use TCP must establish a TCP connection before transmitting data.
Reliable transmission:
- The data is segmented into appropriate segments and passed to the IP.
- An adaptive timeout retransmission policy is configured for the sent packet segment.
- When receiving data, the destination sends an acknowledgement.
- The destination checks for headers and data checksums, and if there is an error, discards the packet and does not send acknowledgements.
- The arrival of IP datagrams may be out of order, and TCP will reorder the received data to the application layer in the correct order.
- Due to timeout retransmission, the received packet segment may be duplicated, and the TCP receiver discards the duplicated data.
- TCP provides traffic control. Each end has a fixed buffer space. The receiving end only allows the sending end to send data that can be stored in the buffer to prevent buffer overflow.
Byte stream oriented: TCP does not insert record identifiers into the byte stream and does not interpret the byte stream content, leaving it to the application layer to interpret.

Header

To learn TCP, you must first understand the composition of the HEADER and the design and significance of each bit in the header. Understanding each part of the protocol from the header design should be relatively clear.

Let’s start with the TCP header:

In the figure above, each 4 bytes (1 word), the first 5 lines are exactly the fixed 20 bytes of the header. Let’s look at the fields in the header in 4-byte units:

Source port and destination port: 2 bytes each, ranging from 0 to 65536. These two values constitute a unique TCP connection with the IP Address in the IP header.
Serial number:
- TCP is a reliable transport protocol that tracks every byte in the byte stream.
- There may be more than one byte of data in a message segment. Serial number Identifies the first byte of data in the current message segment
- For security reasons, the initial serial number of TCP connections is usually a random number rather than 0
- The serial number occupies 4 bytes and is a 32 bit unsigned number. When it reaches the maximum value, the mod starts from 0
Confirm serial number:
- Contains the sequence number of the first data byte of the next packet segment that the receiver expects to receive, and therefore the sequence number of the last successfully received data byte plus 1
- This field is valid only when the ACK flag is 1
Data offset:
- Identifies the start position of the data in the packet segment, and the length of the header of the packet segment
- The unit is 32 bits (Word). Therefore, the maximum offset bit is 60 bytes. Therefore, the LENGTH of TCP header cannot exceed 60 bytes
Mark:
- URG: The Urgent Pointer field is valid, indicating that Urgent data exists in this packet and should be transmitted as soon as possible without the original queuing order
- ACK: Indicates that the ACK number is valid. According to TCP, ACK must be set to 1 for all packet segments after a connection is established
- PSH (push) : The receiver should deliver the packet segment data to the application layer as soon as possible, rather than waiting for the cache to fill up
- RST: If RST is set to 1, it indicates that the TCP connection is seriously faulty and must be released immediately to rebuild the connection
- SYN: Initiates a connection with a synchronization sequence number
- FIN: Indicates that the sender finishes sending the packet and requests to release the connection
Window size:
- The value is used to control TCP traffic. The unit is byte. The maximum window size is 65535 bytes.
- Indicates the amount of data that the receiver allows the sender to send
- This field is dynamically affected by the window enlargement option in the optional header
Inspection and:
- In 2 bytes, the validation and field computation range includes header and data
- This is a mandatory field that must be computed and stored by the sender and validated by the receiver
- The checksum field is set to 0, and the result is stored in the checksum field by summing every 16 bits of binary inverse in the header and data.
- If the data length is an odd number of bytes, the padding byte 0 is added at the end of the calculation, and the padding byte is not transmitted. And the TCP segment contains a 12-byte dummy header that is set up to compute the checksum.
- The receiver shall perform the same calculation after receiving it. Since the calculation of the receiver includes the check and in the header, the calculation results shall all be 1. Otherwise, it shall be discarded.
Emergency pointer: 2 bytes, in bytes, is a positive offset that identifies where emergency data ends
Optional field: Variable length, up to 40 bytes. Common options are as follows
- MSS (Maximum Segment Size) : indicates the Maximum length of the data block sent by TCP to the other end. When a connection is established, both parties notify their MSS in the first SYN packet to limit the length of the datagram sent by the other end.
- Window enlargement option: increases the TCP window definition from 16 bits to 32 bits. Enlarge the window size by defining an offset shift option. This option is available only in SYN packets. The passive connection establishment party can send this option only after receiving SYN packets with this option. If an active connecter sends an expansion factor but does not receive an expansion option from the other end, it will also set the shift counter to 0 to be compatible with older compatible systems that do not understand the new option.
- Timestamp option: Enables the sender to place a timestamp value in each segment of the message that the receiver returns in acknowledgement, allowing the sender to calculate the RTT for each ACK received.

Establish and terminate connections

TCP is a connection-oriented protocol, that is, before sending data, the two parties must establish a connection.

Connection establishment protocol

Generally, the TCP connection establishment process is called “handshake”. The sender and receiver exchange three packets to establish a connection, so it is also called “three-way handshake”. The specific steps are as follows:

The sender sends a SYN segment indicating that it wants to connect to the receiver, along with an initial sequence number.
After receiving a SYN segment, the receiving end sends back an ACK packet and sets the sequence number of the ACK packet to + 1 (SYN occupies one sequence number). The SYN flag bit of the packet is marked as 1 and the initial sequence number of the receiving end is attached.
After receiving the packet segment, the sender sends an ACK packet segment again and confirms that the serial number is the initial serial number + 1 of the receiver.

The end that sends a SYN packet to establish a connection selects an initial ISN for the connection. The ISN changes over time (every 4ms + 1), so each connection has a different ISN.

The end that sends the first SYN performs active opening, and the end that receives the SYN and sends back the next SYN performs passive opening.

Schematic diagram:

Why do we need to exchange three packet segments to establish a connection?

The exchange of cubic message segments is to determine the communication capability of both parties:

First time: The receiving end confirms its receiving capability and the sending end confirms its sending capability.
Second: The sender determines the sending and receiving capabilities of both parties.
Third: The receiving end confirms the sending and receiving capabilities of both parties.

What is a SYN flood attack? And how to defend against it?

The SYN flood attack is one of the most common DDos attacks. The attacker sends a large number of forged TCP connection requests to exhaust the host resources of the attacked party.

After receiving a SYN packet, the receiving end is in syn-REVD state. In this case, the connection is not fully established. The receiving end maintains the connection in a queue, which is called a half-connection queue.

If a client forges a large number of SYN requests with non-existent source IP addresses in a short period of time, the half-connection queue at the receiving end will grow and be repeatedly resended to the maximum due to the unanswered SYN requests. During this period, if the number of connections in the semi-connection queue is large enough, the resources of the receiving end are occupied. As a result, the receiving end cannot receive normal SYN requests, resulting in congestion or even paralysis. This is a typical Dos/DDos attack.

The following is a brief list of defense methods, which can be adjusted according to the actual business:

You can add monitoring and alarm mechanisms for the number of semi-connected queues in the system to respond to attacks in a timely manner.
Shorten the connection timeout period
Increase the length of the half-connection queue appropriately
The number of timeout retries is reduced
Limit the number of concurrent connections for a single IP address
Add IP frequency control and shield abnormal IP addresses
The SYN Cookies algorithm
.

Termination of connection protocol

In general, the TCP connection termination process is called “waving”. Three handshakes are required to establish a connection, and four waves are required to terminate a connection, that is, four packet segments are exchanged between the sender and receiver to terminate the connection.

The party that first sends the FIN will perform an active shutdown (A) and the other party will perform A passive shutdown (B).

User A sends A FIN packet to disable data transmission from user A to user B.
After receiving the FIN, user B sends an ACK packet to confirm that the received ID is + 1 (the FIN occupies an ID).
After sending data, USER B sends A FIN packet to user A to close the connection.
After receiving an ACK packet, user A sends an ACK packet and confirms that the sequence number is set to + 1. B directly closes the connection after receiving the packet, while A waits for 2MSL after the packet is sent before closing the connection.

Schematic diagram:

Why is establishing a connection three handshakes, but terminating a connection four waves?

Semi-closed: One end can still receive data from the other end after it has finished sending. Full duplex: Data can be transmitted simultaneously in both directions.

The four handshakes required to terminate a connection are caused by TCP’s half-shut down. TCP connections are full-duplex, so each direction must be closed separately. After sending data, each end must send a FIN packet to terminate the connection. After receiving a FIN packet, one end must notify the other end of the application layer that data transmission stops.

Why did the active closing party finally wait 2 MSL?

MSL (Maximum Segment Lifetime) : indicates the Maximum Segment Lifetime for a packet. Once the Maximum Segment Lifetime is exceeded, the packet is discarded.

Wait for the 2 MSL to ensure that the last ACK sent by A reaches B. If USER B does not receive the last ACK, the system resends the FIN packet. User A receives the FIN packet and resends the last ACK.

In addition, the port cannot be used during the 2MSL waiting period to ensure that all the packet segments generated during the current connection disappear from the network and prevent the packet segments of the old connection from appearing in the next connection that uses the port.

Generally speaking, it is normal for a client to actively shut down and enter time-wait state, while a server is usually passive shut down and does not enter time-wait state. Because the server uses a known fixed port, if you terminate a server program that has established a connection and try to restart it immediately, the server program will not be able to immediately open a connection with the original fixed port.

Is fin-WaIT-2 half-closed?

If the active closing party is in fin-WaIT-2 state, it indicates that it has sent a FIN and received an ACK from the other end. In this case, unless semi-closed, it will WAIT for the other end to send a FIN to close its connection and enter time-wait state only after receiving the FIN from the other end. Otherwise it could stay that way forever.

Reliable transport

To ensure reliable transmission, you need to consider data damage, loss, duplication, and disorder. TCP implements various mechanisms to address these problems.

Sliding window protocol

The protocol allows the sender to send multiple packets in a row before waiting for confirmation, which speeds up data transmission by not having to stop for confirmation with each packet. The sender slides the window forward one group position each time it receives an ACK.

Generally, the recipient uses the cumulative acknowledgement mode. Instead of sending an acknowledgement for each packet one by one, the recipient sends an acknowledgement for the last packet that arrives in sequence after receiving several packets, indicating that all previous packets have arrived correctly.

The sliding window protocol can be easily converted into the following view:

Timeout and retransmission

One of the ways TCP provides a reliable transport layer is to acknowledge data received from the other end. However, both data and acknowledgement packets may be lost. Therefore, TCP sets a timer for sending packets. When the timer overflows and no acknowledgement is received, TCP determines that the timer times out and retransmits the packets. The key of this implementation is the setting of the timeout period and the retransmission strategy.

TCP manages four different timers for each connection:

The retransmission timer is used to manage the timeout period.
The persistence timer dynamically maintains window size information even if the other end has closed its receiving window.
The keepalive timer detects when the other end of an idle connection crashes or restarts.
2MSL timer Measures how long a connection is in time-wait state.

The setting of the timeout period is the most important part of the timeout retransmission mechanism. It is defined by the round-trip time (RTT) of the connection. As the router and network traffic change, THE RTT may change frequently, and TCP changes the timeout period accordingly.

To measure RTT, the TCP sender times some packet segments and adds a counter each time the TCP timer routine of 500ms is invoked. This means that the timeout is controlled in units of 500ms (1 tick) with a maximum deviation of 500ms. If no confirmation is received after retransmission, the timeout period for each retransmission will increase exponentially by 2, which is called “exponential backoff”, until the maximum number of retransmissions. At this point, the TCP connection is considered abnormal, and the sender sends a reset signal and forcibly closes the connection.

Slow start algorithm

Sending multiple packet segments to the network can improve the efficiency. However, multiple routers and slow links may exist between the sender and the receiver, and some intermediate routers must cache packets and may run out of storage space. Therefore, this method may seriously reduce the throughput of TCP connections.

For these reasons, TCP supports an algorithm called “slow start,” which adds a window for the sender: a “congestion window” (CWND). The sender will take the minimum value between the congestion window and the notification window as the upper limit of sending. When the connection is established, the congestion window is initialized as a message segment. For each ACK received, the congestion window increases exponentially by 2. Then at some point the Internet capacity limit may be reached, and the intermediate router starts dropping packets, and the sender knows that the congestion window is too large.

Congestion avoidance algorithm

In the slow start algorithm, we mentioned that when data reaches the limit of intermediate routing, packets will be discarded, and congestion avoidance algorithm is a method to deal with lost packets.

The algorithm assumes that packet loss due to packet damage is very rare, so packet loss means congestion on a network somewhere between the source host and destination host, which will be determined by timeout and repeated acknowledgement, which will be implemented together with slow start in use.

For a given connection, a congestion window (CWND) and a slow start threshold (SSTHRESH) are initialized. CWND is initialized to 1 message segment and SSTHRESH to 65535 bytes.
The congestion window starts slowly and grows exponentially. The upper limit of the number of message segments is the minimum between the congestion window and the notification window.
When congestion occurs, SSTHRESH is set to the smaller value in CWND /2 and the notification window, at least 2 message segments. In addition, CWND is set to 1 (slow restart) if congestion is caused by timeout.
CWND is added when new data is confirmed by the receiver after congestion adjustment. At this time, if CWND <= SSTHRESH, it is judged that slow start is being performed, and CWND increases exponentially. Otherwise, it is judged that congestion avoidance is in progress, and CWND increases linearly (+1 each time)

Fast retransmission algorithm

When receiving an out-of-order packet segment, TCP generates a duplicate ACK immediately. This ACK is not delayed. The purpose of this ACK is to let the sender know that it has received an out-of-order packet segment and tell the desired sequence number.

The sender does not know whether a duplicate ACK is caused by a missing packet segment or by inconsistent arrival times of several consecutive packet segments, so it needs to wait for a small number of duplicate ACKS to arrive. If only a few segments arrive out of order, the segments may generate only one or two duplicate ACKS before reordering and producing a correct ACK. If three or more consecutive repeated ACKS are received, it is highly likely that a packet segment is lost. Therefore, the sender retransmits the lost data packet segment without waiting for the timeout timer to overflow. This is the fast retransmission algorithm.

Fast recovery algorithm

The fast recovery algorithm enables the TCP sender to execute the congestion avoidance algorithm instead of the slow start algorithm after triggering fast retransmission. This is because receiving multiple repeated ACKS not only indicates that a packet is lost, but also indicates that the subsequent packets are received, and data flows between the two ends rather than being lost due to congestion.

When receiving the third repeated ACK, set SSTHRESH to CWND /2, retransmit the missing segment, set CWND to SSTHRESH plus 3 times the size of the segment (+ 3 because the three segments were received after the repeated ACK, Avoid CWND suddenly becoming small).
Each time it receives another duplicate ACK, CWND increases the size of a packet segment and sends a packet (if allowed).
When the next ACK arrives confirming the new data, set CWND to SSTHRESH (the value set in Step 1).
1. This ACK should be an acknowledgement of the retransmission in Step 1 and of all intermediate packet segments between the lost packet and the first duplicate ACK received.
2. This step uses congestion avoidance and halves the current rate when packets are lost.