This article is published by the Cloud + community

TCP is a complex protocol, and each mechanism introduces other problems as well as advantages. Nagel algorithm and delay ACK mechanism are two mechanisms to reduce the amount of packets at the sender and receiver, which can effectively reduce the amount of network packets and avoid congestion. However, in certain scenarios, Nagel algorithm requires only one unconfirmed packet in the network, while the delay ACK mechanism needs to wait for more packets before sending ack packets, causing the sending and receiving ends to wait for each other to send data, resulting in a deadlock. The deadlock can be unlocked only after the delay ACK times out. As a result, the external delay of the application side is high. Other articles have described the mechanisms involved, and there have been several articles describing this delay scenario. Based on the specific tcpdump package, this paper analyzes the delay ACK trigger scenario, related kernel parameters, and the solution to avoid the delay ACK.

background

A proxy layer was added to Redis. When pressing, it was found that the performance of writing commands decreased significantly when the data length was greater than 2k, only 1/10 of that of directly connecting to Redis -server. The impact of GET requests is not so obvious.

Analysis of the

Observe the load of the system and the amount of network packets. The amount of network packets is relatively low, and the internal time of proxy is also relatively short. The rogue can only sacrifice tcpdump magic, there is evil.

22 TCP request packet, 42ms later the server returned ack. It is suspected that the delay of the network layer leads to the increase in time consumption. Because Nagel algorithm is enabled on the client, delayed ACK is not closed on the server, which will result in delayed ACK timeout, and then send ACK, causing timeout.

The principle of

Nagel algorithm, from Wikipedia

if there is new data to send

  if the window size >= MSS and available data is >= MSS

    send complete MSS segment now

  else

    if there is unconfirmed data still in the pipe

      enqueue data in the buffer until an acknowledge is received

    else

      send data immediately

    end if

  end if

end if
Copy the code

In a nutshell, the rules of Nagel algorithm are:

  1. If more than one MSS is sent, the MSS is sent immediately.
  2. If no previous packet has not been confirmed, send it immediately;
  3. If a packet has not been acknowledged before, the cache sends the content.
  4. If an ACK is received, the cached contents are immediately sent.

The source code for delayed ACK is net/ipv4/tcp_input.c

The rationale is as follows:

  1. If the data received is greater than one MSS, send an ACK.
  2. If it receives data that the receiving window thinks it received, send an ACK;
  3. If in Quick mode, send ACK;
  4. If you receive out-of-order data, send an ACK.
  5. Otherwise, delay sending ACK

Everything else is clear. How to judge the Quick mode? Continue to look at the code:

One factor that affects quick mode is the state of ping Pong. Pingpong is a status value that identifies the status of the current TCP interaction to predict whether it is an interactive communication mode like W-R-W-R-W-R. If so, a delayed ACK can be used to carry Write packets back to the sender using Read packets.

As shown in the figure above, the default is pingpong = 0, which means non-interactive. When the server receives data, it will immediately return ACK. When the server has data response, the server will pingpong = 1.

The problem

According to the above principle analysis, there should be ACK delay every time, why we test less than 2K data, performance is not affected?

Tcpdump:

According to Nagel algorithm and delayed ACK mechanism, the above interaction is shown in the figure below. Since the data generated each time contains a complete request, when the server returns the command response to the client after processing, it carries the ACK of the request to the client to save a network packet.

Reanalyzing the 2K scenario:

As shown in the table below, the 22nd packet sends less data than MSS. Meanwhile, pingpong = 1 is regarded as interactive mode, expecting to reduce the number of packets on the network by means of incidental ACK. However, the data received by the server is not a complete package and cannot generate a reply. The server can send ACK packets only after 40ms timeout.

At the same time, from the client’s point of view, if you are sending a packet, you can also break the received data > MSS limit. However, the client is limited by Nagel’s algorithm, so only one packet can be unacknowledged at a time, and the rest of the data can only be cached and sent.

Triggering scenarios

The data of a TCP request cannot generate one response on the server, or be less than one MSS

Avoid scheme

If Nagel is enabled on the client and tcp_DELay_ACK is enabled on the server at the same time, the preceding deadlock state will occur. The solution can start at both ends of TCP.

Server:

  1. Turn off tcp_delay_ACK so that each TCP request packet has an ACK response without delay. Operation mode: Echo 1 > /proc/sys/net/ipv4/tcp_no_delay_ACK However, each TCP request returns an ACK packet, resulting in an increase in the number of network packets. After TCP deferred acknowledgment is disabled, the number of network packets increases by about 80%.

2. Set the TCP_QUICKACK property. However, it needs to be set again after each RECV. Corresponding to our scenario is not suitable, need to modify the server redis source code.

Client:

  1. Disable nagel, that is, set the socket tcp_no_delay attribute.static void _set_tcp_nodelay(int fd) { int enable = 1; setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (void*)&enable, sizeof(enable)); }
  2. Avoid multiple writes, re-read scenarios, merge into a large package of writes; In our scenario, 1424 bytes of packet no. 22 are cached. When packet no. 22 is larger than one MSS, the server immediately returns a response, and the client continues to send follow-up data to complete the interaction and avoid delay.

This article has been published by Tencent Cloud + community authorized by the author