preface

It turns out that reading the source code of the Linux kernel does have great benefits, especially when it comes to problem solving. The moment you see the error, the phenomenon/cause/solution will pop into your mind. Even some marginal phenomena can be quickly recognized as why. I have read some source code of Linux TCP protocol stack, there is a very smooth feeling when solving the following problem.

At the scene of the Bug

First of all, it’s not a hard problem to solve, but it’s an interesting phenomenon to see. First describe the phenomenon, the author wants to research from the DUBBO protocol tunnel gateway pressure test (the gateway design is also very interesting, ready to put in the blog behind). Let’s take a look at the topology of the pressure test:

In order to test the standalone performance of the author’s gateway, only one gateway is reserved at each end, namely, gateway1 and gateway2. Pressure to a certain extent will start to report errors, resulting in pressure test stop. It was natural to think that the gateway could not handle it.

Gateway situation

I checked Gateway2’s machine, and nothing went wrong. Gateway1, on the other hand, had a lot of 502 errors. 502 is a Bad Gateway, Nginx classic error, the first thought is Gateway2 overwhelmed by Nginx kicked off in the Upstream.

Gateway2 uses only one core on a 4-core 8G machine. There is no bottleneck at all. Is there a problem with IO? A look at the poor network card traffic dispelled this speculation.

The CPU utilization of the machine on which NGINX operates is close to 100%

At this point, it is interesting to find that Nginx is running full of CPU!

Once again, I went to the top of the machine where Nginx is located, and found that the 4 workers of Nginx occupy one core and eat up all the CPU.

What, known as strong performance of NGINX unexpectedly so weak, said the good event driven EPOLL edge trigger pure C to build it? It must be the wrong posture!

Remove NGINX direct communication without stress

Since guesswork is the bottleneck for Nginx, let’s get rid of Nginx. Gateway1 and Gateway2 were connected directly, and the TPS in the pressure test soared, and the CPU in Gateway2 only ate 2 cores at most, with no stress at all.

Go to Nginx and look at the logs

Since I do not have access to the Nginx machine, I did not pay attention to its logs at the beginning. Now I will contact the corresponding operation to have a look. A large number of 502 errors have been found in the accesslog, which is indeed Nginx. Looking at the error log again, I found a large number of

Cannot assign requested address 

Because the author has read the TCP source code, the instant reaction comes over, is the port number exhausted! Because NGINX upstream and Backend connections are short by default, a large number of TIME_WAIT connections are generated when a large number of requests come in.

These time_waits occupy the port number and take about a minute to be collected by the Kernel.

cat /proc/sys/net/ipv4/ip_local_port_range
32768    61000 

That is, generating 28232(61000-32768) TIME_WAIT sockets in one minute would result in the port number being exhausted, which is 470.5tps (28232/60), which is an easily achievable pressure test. In fact, this restriction is on the Client side. There is no such restriction on the Server side because the Server port number only has a named port number such as 8080. In the upstream Nginx plays is the Client, and play is Nginx Gateway2 was

Why is Nginx 100% CPU

And the author also quickly figured out why Nginx is eating up the machine’s CPU, the problem comes out of the port number search process.

Let’s look at some of the most performance-consuming functions:

int __inet_hash_connect(...) {static u32 int; static u32 int; U32 offset = int + port_offset; u32 offset = int; . inet_get_local_port_range(&low, &high); // Remaining 61000-32768 Remaining = (high-low) + 1...... for (i = 1; i <= remaining; i++) { port = low + (i + offset) % remaining; Check */..... Check */.... goto ok; }... ok: hint += i; . }

Looking at the code above, if there is no port number available all the time, then you need to repeat remaining times to declare that the port number is exhausted, which is 28232 times. In the normal case, since hints exist, each search starts at the next port number to be assigned, and a single digit search will find the port number. As shown in the figure below:

Therefore, when the port number is exhausted, the Worker process of Nginx will be immersed in the above for loop and will eat up all the CPU.

Why is it OK for Gateway1 to call Nginx

Simply because I set Keepalived to Gateway1 when calling Nginx, and so use a long connection, there is no limit to the port number running out.

If there are multiple machines behind Nginx

Since the CPU is 100% due to the port number search, and whenever a port number is available, thanks to hints, the number of searches may be the difference between 1 and 28232.

This is because the port number restriction is for a specific remote server:port. So, as long as Nginx’s Backend has multiple machines, or even multiple different port numbers on the same machine, as long as it does not exceed a critical point, Nginx is not under any pressure.

Increase the port number range

The less brainy solution, of course, would be to make the port range larger so that more TIME_WAIT can be resisted. Tcp_max_tw_bucket is the maximum number of time_waits in the kernel. If the port range -tcp_max_tw_bucket is greater than a certain value, then the port port is always available. In this way, we can avoid breaking down the critical point again when the critical value is increased.

cat /proc/sys/net/ipv4/ip_local_port_range
22768    61000
cat /proc/sys/net/ipv4/tcp_max_tw_buckets
20000 

Open tcp_tw_reuse

Linux already has a solution to this problem in the form of the tcp_tw_reuse parameter.

echo '1' > /proc/sys/net/ipv4/tcp_tw_reuse 

In fact, TIME_WAIT is too long because it takes 1 minute to recover. This 1 minute is actually the 2MSL time specified in TCP protocol, whereas Linux is fixed at 1 minute.

#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT
                  * state, about 60 seconds    */ 

The reason for 2MSL is that removing the remaining packets on the network has an effect on new sockets of the same quintuple, which means that it is risky to reuse the quintuple within 2MSL(1min). To solve this problem, Linux takes a number of measures to prevent this, so that TIME_WAIT within 1s can be reused in most cases. The following code checks to see if this TIME_WAIT is reused.

__inet_hash_connect |->__inet_check_established static int __inet_check_established(......) {... /* Check TIME-WAIT sockets first. */ sk_nulls_for_each(sk2, node, &head->twchain) { tw = inet_twsk(sk2); If (INET_TW_MATCH(sk2, net, hash, acookie, saddr, daddr, ports, dif)) { if (twsk_unique(sk, sk2, twp)) goto unique; else goto not_unique; }}... }

The core function is twsk_unique, and its logic is as follows:

int tcp_twsk_unique(......) {... if (tcptw->tw_ts_recent_stamp && (twp == NULL || (sysctl_tcp_tw_reuse && get_seconds() - tcptw->tw_ts_recent_stamp > TP-> WRITE_SEQ = TCPTW-> TW_SND_NXT + 65535 +2 {// Set WRITE_SEQ to SND_NXT +65536+2 to ensure that the data transfer rate <=80Mbit/s will not be rewind . return 1; } return 0; }

The logic of the above code looks like this:

With the TCP_TIMESTAMP and TCP_TW_REUSE enabled, the port can be re-used in Connect as long as the last timestamp >1s was previously recorded with the Socket in the TIME_WAIT state of the port. At the same time, in order to prevent potential sequence number conflicts, WRITE_SEQ is directly added to 65537, so that in the case of single Socket transmission rate less than 80Mbit/s, there will be no sequence number overlap (conflict).

Meanwhile, the timing of the TW_TS_RECENT_STAMP setting is shown in the figure below:

If a TIME_WAIT Socket is in a TIME_WAIT state, it will affect the time at which the port for the TIME_WAIT is available. When this parameter is turned on, the acceptable TPS per Upstream jumps from 470.5TPS(28232/60) to 28232TPS (28232/60) by a factor of 60.

If performance is not enough, you can continue to improve TPS with the above port range increase and the tcp_max_tw_bucket decrease. However, the tcp_max_tw_bucket decrease may run the risk of sequence number overlap, since sockets will not be reused beyond the 2MSL stage.

Do not turn on tcp_tw_recycle

Enabling the tcp_tw_recyle parameter will have a significant impact in NAT environments. It is recommended not to enable the tcp_tw_recyle parameter. For details, see my blog post:

https://my.oschina.net/alchemystar/blog/3119992 

Upstream: “Upstream

In fact, the above series of problems are caused by Nginx having a short connection to Backend. Since 1.1.4, Nginx has implemented long connection support for backend machines. Upstreams enable long connections using this configuration:

Upstream backend {server 127.0.0.1:8080; # It should be particularly noted that the keepalive directive does not limit the total number of connections to upstream servers that an nginx worker process can open. The connections parameter should be set to a number small enough  to let upstream servers process new incoming connections as well. keepalive 32; keepalive_timeout 30s; # Set maximum idle time of backend connection to 30s}

In this way, the front end and the back end are long connections, and everyone can play happily again.

Risk points arising therefrom

CPU overload occurs due to running out of a single remote IP :port. NGINX needs to be careful when configuring Upstream. Suppose a situation, PE expansion of a NGINX, in order to prevent problems, first with a Backend to see the situation, this time if the amount is relatively large, the breakdown critical point will cause a large number of errors (and the application itself is no pressure, after all, the critical value is 470.5TPS(28232/60)), Even requests from other domains on Nginx will get no response due to CPU exhaustion. A few backends/tcp_tw_reuse might be a good choice.

conclusion

No matter how powerful the application is, it is still loaded on the kernel and cannot escape from the Linux kernel. So it makes sense to tune the parameters of the Linux kernel itself. If read some kernel source code, no doubt on our troubleshooting online problems have a great power, but also can guide us to avoid some pits!