preface

As it turns out, reading the Linux kernel source code can be of great benefit, especially when dealing with problems. The moment you see the error, you can flash the phenomenon/cause/solution in your mind. Even some corners of the phenomenon can quickly respond to why. I have read some source code for the Linux TCP protocol stack and have a very smooth feeling when solving the following problem.

At the scene of the Bug

First of all, it’s not a hard problem to solve, but it’s an interesting phenomenon. To describe the phenomenon, I am going to pressure test my own dubbo protocol tunnel gateway (this gateway design is also interesting and will be posted in a later blog). First look at the topology of the pressure test:

! [solution Bug — Nginx 502 Bad Gateway] (https://p6-tt.byteimg.com/origin/pgc-image/501064d41dcf4f40bea8e753e7debe4d?from=pc)

In order to test the performance of the author’s single gateway, only one gateway is reserved at both ends, namely gateway1 and gateway2. When the pressure reaches a certain level, errors begin to be reported, resulting in the stop of the pressure measurement. It’s natural to think that the gateway can’t carry it.

Gateway situation

Went to the Gateway2 machine and there were no errors. While Gateway1 has a large number of 502 errors. “502” is a Bad Gateway (” Bad Gateway “). “502” is a Bad Gateway (” Bad Gateway “).

! [solution Bug — Nginx 502 Bad Gateway] (https://p1-tt.byteimg.com/origin/pgc-image/d2f327ec6de943e89778dc2db44d609e?from=pc)

Then, let’s take a look at the load of Gateway2 and check the monitoring. It is found that Gateway2 uses only one core on the 4-core 8G machine, and there is no bottleneck at all. Is there something wrong with IO? A look at the poor little network card traffic dispels this conjecture.

The CPU utilization of the Nginx machine is close to 100%

Nginx is running out of CPU!

! [solution Bug — Nginx 502 Bad Gateway] (https://p6-tt.byteimg.com/origin/pgc-image/dfd9dcdc19da41ee955b6e1efd17dfea?from=pc)

Nginx has 4 workers, each occupying one core, and the CPU is full.

! [solution Bug — Nginx 502 Bad Gateway] (https://p1-tt.byteimg.com/origin/pgc-image/1400f2d38f4a49709f3b83b9d75ca10a?from=pc)

What, the supposedly powerful Nginx is so weak, the promised event-driven, epoll edge trigger, pure C build? It must be the wrong position!

There is no pressure to get rid of Nginx and communicate directly

Since guesswork is the bottleneck of Nginx, let’s get rid of Nginx. Gateway1 and Gateway2 are directly connected, so the TPS in the pressure test increases, and Gateway2’s CPU only eats 2 cores at most, so there is no pressure.

! [solution Bug — Nginx 502 Bad Gateway] (https://p6-tt.byteimg.com/origin/pgc-image/3d3843b0a36d4425b30853bdca23192b?from=pc)

Go to Nginx and take a look at the logs

Since I do not have the permission of Nginx machine, I did not pay attention to its log at the beginning. Now I will contact the corresponding operation and maintenance to have a look. A large number of 502 errors were found in the accesslog, which is indeed Nginx. I looked at the error log again and found a large number of

Cannot assign requested address
Copy the code

Because the author read TCP source code, a moment of reaction, is exhausted port number! Because Nginx upstream and Backend are short connections by default, a large number of TIME_WAIT connections are made when a large amount of request traffic comes in.

! [solution Bug — Nginx 502 Bad Gateway] (https://p6-tt.byteimg.com/origin/pgc-image/741cf50e0232448aa89f05e9f3180988?from=pc)

These time_waits occupy port numbers and generally take about 1 minute to be reclaimed by the Kernel.

! [solution Bug — Nginx 502 Bad Gateway] (https://p6-tt.byteimg.com/origin/pgc-image/87f0d029d0a549ba86dada05ea85685b?from=pc)
cat /proc/sys/net/ipv4/ip_local_port_range32768	61000
Copy the code

In other words, 28232(61000-32768) TIME_WAIT sockets in one minute will cause port number exhaustion, which is 470.5TPS(28232/60), which is an easily accessible pressure measurement. In fact, this limitation is on the Client side. There is no such limitation on the Server side, because the Server port number only has a named port number such as 8080. In upstream, Nginx acts as a Client and Gateway2 acts as an Nginx

! [solution Bug — Nginx 502 Bad Gateway] (https://p3-tt.byteimg.com/origin/pgc-image/cf95760ad33443bf901539da3c70f3de?from=pc)

Why is Nginx CPU 100%

And I quickly figured out why Nginx was eating up the machine’s CPU, the problem came from the port number search process.

! [solution Bug — Nginx 502 Bad Gateway] (https://p6-tt.byteimg.com/origin/pgc-image/e8bfea7fc4e44320a739eeffe5334f1e?from=pc)

Let’s take a look at the most performance-intensive part of the function:

int __inet_hash_connect(...) {// add the static u32 hint; U32 offset = hint + port_offset; // hint helps search not from 0, but from the next port number to be allocated. . inet_get_local_port_range(&low, &high); // Remaining is 61000-32768 remaining = (high-low) + 1...... for (i = 1; i <= remaining; i++) { port = low + (i + offset) % remaining; /* Port occupies check */.... goto ok; }... ok: hint += i; . }Copy the code

If remaining does not have any port numbers, it loops through remaining 28,232 times to declare remaining empty. In the normal case, since there are hints, each search starts with the next port number to be allocated, and the port number is found in single-digit numbers. As shown below:

! [solution Bug — Nginx 502 Bad Gateway] (https://p6-tt.byteimg.com/origin/pgc-image/b785282e55784eaea55a3fd7f54ad70b?from=pc)

So when the port number runs out, the Nginx Worker process immerses itself in the for loop and eats up the CPU.

! [solution Bug — Nginx 502 Bad Gateway] (https://p1-tt.byteimg.com/origin/pgc-image/11870a77d6bb4b94b645abb31330c257?from=pc)

Why is there no problem with Gateway1 calling Nginx

Gateway1 calls Nginx with Keepalived, so there is no limit on port number exhaustion.

! [solution Bug — Nginx 502 Bad Gateway] (https://p6-tt.byteimg.com/origin/pgc-image/ac76c34b0df744f89f4a4e349fe8c230?from=pc)

There are multiple machines behind Nginx

The CPU is 100% because of the port number search, and hint can be the difference between 1 and 28232 if any port number is available.

! [solution Bug — Nginx 502 Bad Gateway] (https://p1-tt.byteimg.com/origin/pgc-image/cce2e8db73c84b1ea619e91f94a7ac90?from=pc)

Because the port number restriction is specific to a particular remote server:port. So, as long as Nginx’s Backend has multiple machines, or even multiple different port numbers on the same machine, there is no pressure on Nginx as long as the critical point is not exceeded.

! [solution Bug — Nginx 502 Bad Gateway] (https://p3-tt.byteimg.com/origin/pgc-image/ca8014b01d4b4b92a81f11859ac65241?from=pc)

Enlarge the port number range

A more mindless solution would of course be to increase the range of port numbers to resist more TIME_WAIT. Tcp_max_tw_bucket is the maximum number of time_waits in the kernel. As long as the port range -tcp_max_TW_bucket is greater than a certain value, a port port is always available. In this way, it can be avoided to continue to break down the critical point when the critical value is adjusted up again.

cat /proc/sys/net/ipv4/ip_local_port_range22768	61000cat /proc/sys/net/ipv4/tcp_max_tw_buckets20000
Copy the code

Open tcp_tw_reuse

Linux already has a solution to this problem: the tcp_tw_reuse parameter.

echo '1' > /proc/sys/net/ipv4/tcp_tw_reuse
Copy the code

In fact, TIME_WAIT is too much because it takes 1 minute to recycle. This 1 minute is actually 2MSL in TCP protocol and 1 minute in Linux.

#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT				  * state, about 60 seconds	*/
Copy the code

The reason for 2MSL is to exclude packets that are still on the network from having an effect on a new Socket with the same quintuple, which means it is risky to reuse the quintuple within 2MSL(1min). To solve this problem, Linux takes a number of steps to prevent this, making TIME_WAIT reusable within 1s in most cases. This code checks whether this TIME_WAIT is reused.

__inet_hash_connect |->__inet_check_establishedstatic int __inet_check_established(......) {... /* Check TIME-WAIT sockets first. */ sk_nulls_for_each(sk2, node, &head->twchain) { tw = inet_twsk(sk2); If (INET_TW_MATCH(sk2, net, hash, acookie, saddr, daddr, ports, etc.) if (INET_TW_MATCH(sk2, net, hash, acookie, saddr, daddr, ports, etc. dif)) { if (twsk_unique(sk, sk2, twp)) goto unique; else goto not_unique; }}... }Copy the code

The core function is twsk_unique, and its logic is as follows:

int tcp_twsk_unique(......) {... if (tcptw->tw_ts_recent_stamp && (twp == NULL || (sysctl_tcp_tw_reuse && get_seconds() - tcptw->tw_ts_recent_stamp > 1))) {// set write_seq to snd_nxt+65536+2 // this ensures that tp->write_seq = TCPTW ->tw_snd_nxt + 65535 +2 in case of data transfer rate <=80Mbit/s . return 1; } return 0; }Copy the code

The logic of the above code is as follows:

! [solution Bug — Nginx 502 Bad Gateway] (https://p6-tt.byteimg.com/origin/pgc-image/4bb71f935bb7404696990e9175db7e19?from=pc)

With tcp_TIMESTAMP and tcp_tw_reuse enabled, Connect can reuse a port as long as the last timestamp recorded by the Socket using the port’s TIME_WAIT state is greater than 1s, shortening the previous minute to 1s. At the same time, to prevent potential serial number conflicts, add write_seq to 65537. In this way, serial number overlaps (conflicts) will not occur when the single-socket transmission rate is less than 80Mbit/s. Meanwhile, the timing of tw_TS_recENT_stamp setting is as follows:

! [solution Bug — Nginx 502 Bad Gateway] (https://p6-tt.byteimg.com/origin/pgc-image/41aebeb1b2044b39845ed4251031490a?from=pc)

Therefore, if a Socket enters the TIME_WAIT state and packets are always sent, the port corresponding to the TIME_WAIT state will be unavailable. With this parameter enabled, Nginx increased TPS per Upstream from 470.5TPS(28232/60) to 28232TPS (60 times higher) due to shortening from 1min to 1s. Tcp_max_tw_bucket can be set to a smaller tcp_MAX_TW_bucket and a larger port range. However, tcp_MAX_TW_bucket can be set to a smaller tcp_MAX_TW_bucket.

Do not enable tcp_TW_RECYCLE

You are advised not to enable tcp_TW_recyle because it has great impact on NAT environments.

“Nginx upstream”

In fact, the above problems are caused by Nginx’s short connection to Backend. Nginx has implemented long connection support for backend machines since 1.1.4. Upstream to enable long connection:

Upstream Backend {server 127.0.0.1:8080; # It should be particularly noted that the keepalive directive does not limit the total number of connections to upstream servers that an nginx worker process can open. The connections parameter should be set to a number small enough  to let upstream servers process new incoming connections as well. keepalive 32; keepalive_timeout 30s; # set the maximum idle time of the backend connection to 30s}Copy the code

So the front end and back end are long connection, everyone can play happily again.

! [solution Bug — Nginx 502 Bad Gateway] (https://p6-tt.byteimg.com/origin/pgc-image/490b183c134f462c921f9e05bb89775d?from=pc)

The resulting risk points

Running out of a single remote IP :port can cause CPU exhaustion. Be careful when configuring upstreams on Nginx. In this case, if the amount of Nginx is too large, the breakdown critical point will cause a large number of errors (the application itself has no pressure, after all, the critical value is 470.5TPS(28232/60)). Even requests from other domains on the same Nginx will go unanswered because of CPU exhaustion. A few more Backend/ tcp_TW_reuse may be a good choice.

conclusion

No matter how powerful the application is, it is still loaded on top of the kernel and cannot escape from the cage of the Linux kernel. So it makes sense to tune the parameters of the Linux kernel itself. If read some kernel source code, no doubt to our investigation online problems have a great help, but also can guide us to avoid some pits!