background

Recently the project test encountered a strange phenomenon, in the test environment through the Apache HttpClient to call the back-end HTTP service, the average time was close to 39.2ms. Maybe at first glance you think this is not very normal, what is so strange? The HTTP service doesn’t have any logic. It just converts a string to uppercase and returns a string of only 100 characters in length. The ping latency is about 1.9ms. Therefore, the call should theoretically take about 2-3ms, but why the average time is 39.2ms?

Due to work reasons, the problem of call time, to me, has been familiar, often to help businesses with internal RPC framework call timeout related issues, but HTTP call time encountered for the first time. However, the routine of troubleshooting is the same. The main methodology is nothing more than the investigation method from the outside to the inside and the top to the bottom. Let’s take a look at some of the outer indicators and see if we can pick up any signs.

Peripheral indicators

System indicators

Look mainly at the peripheral system metrics (note: look at both the machine being called and the machine being called). For example, load, CPU. All you need is a top command.

Therefore, confirm that both the CPU and the load are idle. Since there were no screenshots at that time, I won’t show them here.

Process indicators

Java program process metrics mainly look at the GC, thread stack (note: call and called machine to look at).

Young GCS are very few, and the time is less than 10ms, so there is no long STW.

Since the average call time is 39.2ms, which is quite large, the thread stack should be able to detect something if the time is caused by the code. The stack of related threads for the service mainly shows threads in the thread pool waiting for tasks, which means the threads are not busy.

Feeling like you’ve run out of ideas, what to do next?

Local recurrence

If the local (local is the MAC system) can be reproduced, it is also excellent for troubleshooting problems.

Therefore, in the local use of Apache HttpClient to write a simple Test program, directly call the backend HTTP service, found that the average time is about 55ms. Hey, how with the test environment 39.2ms result is a little different. This is mainly because the HTTP server on the back end of the local and test environment is across regions, and the ping latency is around 26ms, so the latency increases. Local is a problem, though, because the ping latency is 26ms and the back-end HTTP service logic is simple and hardly time consuming, so the average time for local calls should be around 26ms. Why 55ms?

Are you getting confused, confused about how to start?

During the Apache HttpClient doubt whether there is something wrong to use, so use the JDK built-in HttpURLConnection to write a simple program, do the test, the result is the same.

diagnosis

positioning

In fact, judging from the peripheral system metrics, process metrics, and local replicating, it is fairly clear that there is no procedural cause. What about the TCP layer?

Those of you who have experience in network programming know what TCP parameters can cause this phenomenon. Yes, you guessed it, it’s TCP_NODELAY.

Which side of the program is not set, the caller or the called?

The caller uses Apache HttpClient and tcpNoDelay defaults to true. Let’s take a look at the caller, our back-end HTTP service, which uses the JDK’s built-in HttpServer

HttpServer server = HttpServer.create(new InetSocketAddress(config.getPort()), BACKLOGS);
Copy the code

Actually did not see the direct setting tcpNoDelay interface, turned down the source code. Oh, so here we have this static block in the ServerConfig class to get the startup parameters. By default serverconfig. noDelay is false.

static { AccessController.doPrivileged(new PrivilegedAction<Void>() { public Void run() { ServerConfig.idleInterval = Long.getLong("sun.net.httpserver.idleInterval", 30L) * 1000L; ServerConfig.clockTick = Integer.getInteger("sun.net.httpserver.clockTick", 10000); ServerConfig.maxIdleConnections = Integer.getInteger("sun.net.httpserver.maxIdleConnections", 200); ServerConfig.drainAmount = Long.getLong("sun.net.httpserver.drainAmount", 65536L); ServerConfig.maxReqHeaders = Integer.getInteger("sun.net.httpserver.maxReqHeaders", 200); ServerConfig.maxReqTime = Long.getLong("sun.net.httpserver.maxReqTime", -1L); ServerConfig.maxRspTime = Long.getLong("sun.net.httpserver.maxRspTime", -1L); ServerConfig.timerMillis = Long.getLong("sun.net.httpserver.timerMillis", 1000L); ServerConfig.debug = Boolean.getBoolean("sun.net.httpserver.debug"); ServerConfig.noDelay = Boolean.getBoolean("sun.net.httpserver.nodelay"); return null; }}); }Copy the code

validation

In the back-end HTTP service, plus startup “- Dsun.net.httpserver.nodelay=true” parameter, try again. The effect was obvious, with the average time reduced from 39.2ms to 2.8ms.

The problem is solved, but if you stop at this point, it’s too cheap and in this case, it’s a waste. Because you still have a lot of questions?

  • Why is the delay reduced from 39.2ms to 2.8ms when TCP_NODELAY is added?

  • Why is the average latency for local tests 55ms, instead of 26ms for ping?

  • How does TCP actually send packets?

Come on, let’s strike the iron while it’s hot.

To reassure

TCP_NODELAY Who?

In Socket programming, the TCP_NODELAY option is used to control whether to enable the Nagle algorithm. In Java, true means the Nagle algorithm is turned off, and false means it is turned on. You have to ask what is the Nagle algorithm?

What the hell is the Nagle algorithm?

The Nagle algorithm is a way to improve the efficiency of A TCP/IP network by reducing the number of packets sent over the network. It is named after inventor John Nagle, who first used it in 1984 to try to solve the ford Motor Company’s network congestion problem.

If your application generates data one byte at a time, and that data is sent to a remote server in the form of network packets, you can easily overload the network with too many packets. In such a typical case, it takes 40 bytes to transmit a packet with only 1 byte of valid data (20 bytes for IP header + 20 bytes for TCP header). The utilization rate of such a payload is extremely low.

The content of the Nagle algorithm is relatively simple. Here is the pseudocode:

if there is new data to send
  if the window size >= MSS and available data is >= MSS
    send complete MSS segment now
  else
    if there is unconfirmed data still in the pipe
      enqueue data in the buffer until an acknowledge is received
    else
      send data immediately
    end if
  end if
end if
Copy the code

The specific way to do this is:

  • If more than one MSS is sent, the MSS is sent immediately.

  • If no packet has been ACK before, send it immediately.

  • If a packet has not been ACK before, cache the sent contents.

  • If an ACK is received, the cached contents are sent immediately. (MSS is the maximum data segment that a TCP packet can transmit each time)

What the hell is Delayed ACK?

It is well known that the TCP protocol, in order to ensure the reliability of transmission, requires to send an acknowledgement to the other party when the packet is received. Simply sending an acknowledgement is expensive (IP header 20 bytes + TCP header 20 bytes). TCP Delayed ACK is an effort to address this problem by improving network performance by combining several ACK responses into a single response, or by sending the ACK response with the response data to the other party, thereby reducing protocol overhead.

The specific approach is:

  • When there is response data to be sent, ACK immediately sends the response data to the other party.

  • If there is no response data, the ACK will be delayed to see if the response data can be sent with it. On Linux, the default latency is 40ms;

  • If a second packet arrives while you are waiting to send an ACK, you must immediately send an ACK. However, if three packets from the other party arrive in succession, whether to send an ACK immediately upon the arrival of the third data segment depends on the above two.

What’s the chemistry going on with Nagle and Delayed ACK?

Nagle and Delayed ACK both improve the efficiency of network transmission, but together they can do bad things for good. For example, here’s the scenario:

Data transfer between A and B: A runs the Nagle algorithm and B runs the Delayed ACK algorithm.

If USER A sends A packet to user B, user B will not respond immediately due to Delayed ACK. However, if A uses Nagle algorithm, A will keep waiting for B’s ACK, and will not send the second packet until ACK comes. If these two packets are in response to the same request, the request will be delayed for 40ms.

Grab a bag for fun

Let’s capture a packet for validation. This can be done easily by executing the following script on the backend HTTP service.

Sudo tcpdump -i eth0 TCP and host 10.48.159.165 -s 0 -w traffic.pcapCopy the code

The red box is a complete POST request processing process. The difference between serial number 130 and serial number 149 is 40ms (0.1859-0.1448 = 0.0411s = 41ms). This is the chemical reaction sent by Nagle with Delayed ACK, where 10.48.159.165 runs Delayed ACK and 10.22.29.180 runs Nagle’s algorithm. 10.22.29.180 was waiting for an ACK, while 10.48.159.165 triggered an Delayed ACK, thus delaying the Delayed ACK for 40ms.

This explains why the test environment time was 39.2ms because most of it was Delayed by 40ms of Delayed ACK.

But for local replication, why is the average latency for local tests 55ms, instead of 26ms for ping? Let’s grab a bag, too.

As shown in the figure below, the red box is a complete POST request processing process. The difference between serial number 8 and serial number 9 is about 25ms, and the network delay is about 13ms, half of the ping delay. Therefore, Delayed Ack is about 12ms (because the local MAC system is a little different from Linux).

  1. Linux uses the /proc/sys/net/ipv4/tcp_delack_min system configuration to control the Delayed ACK time. The default is 40ms for Linux.
  2. The MAC controls Delayed ACK through the net.inet.tcp. delayed_ACK system configuration. delayed_ack=0 responds after every packet (OFF) delayed_ack=1 always employs delayed ack, 6 packets can get 1 ack delayed_ack=2 immediate ack after 2nd packet, 2 packets per ack (Compatibility Mode) delayed_ack=3 should auto detect when to employ delayed ack, 4packets per ACK. (DEFAULT) If the value is set to 0, the DELAY of ACK is prohibited. If the value is set to 1, the delay of ACK is always delayed.

Why does TCP_NODELAY solve the problem?

TCPNODELAY disables the Nagle algorithm and sends the next packet even if the last ACK did not arrive, thus breaking the impact of Delayed ACK. Generally in network programming, it is strongly recommended to enable TCPNODELAY to improve the response speed.

Of course, the problem could have been solved by configuring the Delayed ACK related system, but due to the inconvenience of changing the machine configuration, this approach is not recommended.

conclusion

This article is from a simple HTTP call, delay is relatively large and caused by a problem troubleshooting process. In the process, the relevant problems are analyzed from the outside in, and then the problem is located and the solution is validated. Finally, Nagle and Delayed ACK in TCP transmission are thoroughly explained, and the case of the problem is more thoroughly analyzed.

Read three things ❤️

If you found this post helpful, I’d like to invite you to do three small favors for me:

  1. Likes, retweets, and your “likes and comments” are what drive me to create.

  2. Follow the public account “Java rotten pigskin”, irregularly share original knowledge.

  3. In the meantime, look forward to a follow-up article at ing🚀

Author: polyester raw source: club.perfma.com/article/200…