Solution to bugs – Performance bottlenecks caused by NAT

I recently solved a very tortuous problem, starting with packet capture and tracing the subtle differences between different kernel versions, before explaining all the phenomena perfectly. Here will write the whole process of blog record, hope to be able to help readers. (It may be a bit long, but it’s worth it.)

Environment introduction

Let’s start with the offending environment. The call topology is shown below:

Call the topology



Multiple machines of the partner use NAT to map multiple source IP addresses to the same egress IP 20.1.1.1, while our Intranet maps multiple Nginx to the same destination IP 30.1.1.1. Thus, between the firewall and the LVS, all requests are always accessed through the same IP address pair (20.1.1.1,30.1.1.1).

The destination port number is always 443.

Short connection – HTTP1.0

Because the other side is using a short connection and Nginx interaction, and the protocol is HTTP-1.0. So our Nginx actively closes the connection after each request completes, resulting in a lot of TIME_WAIT.

Note that TIME_WAIT depends on who closes the Socket first, the Server or the Client. If Nginx is the first Server to be shut down, TIME_WAIT must also occur.

Kernel Parameter Configuration

The kernel parameters are as follows:

cat /proc/sys/net/ipv4/tcp_tw_reuse 0
cat /proc/sys/net/ipv4/tcp_tw_recycle 0
cat /proc/sys/net/ipv4/tcp_timestamps 1
Copy the code

Set tcp_TW_recycle to 0. In this way, connection failures caused by TCP_timestamps and tcp_TW_RECYCLE in NAT cases can be effectively solved. See my previous blog for details:

https://my.oschina.net/alchemystar/blog/3119992
Copy the code

At the scene of the Bug

Ok, now that we’ve introduced the environment, we can formally describe the Bug scene.

A large number of connection creation exceptions occur on the Client, but the Server cannot detect them

The appearance is that the partner’s application has a large number of connection creation exceptions, but there is no exception log on the Server side for these exceptions, as if the request never occurred.

LVS monitor curve

After the problem occurred, the author reviewed the corresponding monitoring curves of LVS, among which the realization of one curve was very strange. As shown below:

What’s going on? Looks like you can’t set up a connection? But even though the business has a lot of errors, it still has high traffic. If you look at the log, the requests per second should be up to 550! Contradicts the fact that there are only 30 new connections per second on this curve!

It happens very close to the time of day

After a few days of observation. Found that every day at about 10 o ‘clock began to report errors, at the same time around 12 o ‘clock slowly recovered.

It felt like a 10 o ‘clock activity every day, causing traffic to exceed the system bottleneck and exposing problems. After 11:40, the traffic slowly decreases and the system slowly recovers. Isn’t LVS enough? Only 550 TPS? Just fell apart?

Is it a network problem?

Is it the legendary network problem? A look at the surveillance shows that traffic is up, but it’s only about 1/8 of the bandwidth, far from blowing up the network.

Are caught

No matter three seven twenty-one, catch the bag first!

Caught the results

Here the author gives a typical packet capture result:

The serial number time Source address The destination address The source port number Destination port number information
1 60 09:57:30. 30.1.1.1 20.1.1.1 443 33735 [FIN,ACK]Seq=507,Ack=2195,TSval=1164446830
2 09:57:30. 64 20.1.1.1 30.1.1.1 33735 443 [FIN,ACK]Seq=2195,Ack=508,TSval=2149756058
3 09:57:30. 64 30.1.1.1 20.1.1.1 443 33735 [ACK]Seq=508,Ack=2196,TSval=1164446863
4 09:59:22. 06 20.1.1.1 30.1.1.1 33735 443 [SYN]Seq=0,TSVal=21495149222
5 09:59:22. 06 30.1.1.1 20.1.1.1 443 33735 [ACK]Seq=1,Ack=1487349876,TSval=1164558280
6 09:59:22. 08 20.1.1.1 30.1.1.1 33735 443 [RST]Seq=1487349876

The packet capture result is as shown in the following figure. At the beginning, the Socket 33735->443 is waved four times. Nearly two minutes later, the same port 33735 was used to establish a connection with 443. 443 sent an inexplicable Ack to 33735, resulting in 33735 sending RST!

How do phenomena arise?

First of all, the most suspicious thing is why did you send a mysterious Ack back? This Ack was calculated by WireShark. For our convenience, WireShark adjusts the Ack value based on Seq=0. Actually, the real Seq is a random number! Is it possible that WireShark is miscalculating in some cases? Or take a look at the original raw data, so I will wireshark

Relative sequence numbers
Copy the code

It’s canceled. The result of canceling the packet capture is immediately interesting! The packet capture result after adjustment is as follows:

The serial number time Source address The destination address The source port number Destination port number information
1 60 09:57:30. 30.1.1.1 20.1.1.1 443 33735 [FIN,ACK]Seq=909296387,Ack=1556577962,TSval=1164446830
2 09:57:30. 64 20.1.1.1 30.1.1.1 33735 443 [FIN,ACK]Seq=1556577962,Ack=909296388,TSval=2149756058
3 09:57:30. 64 30.1.1.1 20.1.1.1 443 33735 [ACK]Seq=909296388,Ack=1556577963,TSval=1164446863
4 09:59:22. 06 20.1.1.1 30.1.1.1 33735 443 [SYN]Seq=69228087,TSVal=21495149222
5 09:59:22. 06 30.1.1.1 20.1.1.1 443 33735 [ACK]Seq=909296388,Ack=1556577963,TSval=1164558280
6 09:59:22. 08 20.1.1.1 30.1.1.1 33735 443 [RST]Seq=1556577963

Look at the table, the Seq and Ack values in the four waves are exactly the same as the incorrect Ack values in the three waves. That is, the quintuple did not disappear after four collections, but was still alive for 111.5 seconds! According to the TCPIP state transition diagram, this is only true for TIME_WAIT states.

We can look at the Linux kernel source code for TIME_WAIT processing:

Switch (tcp_timeWAIT_state_process (inet_twSK (sk), SKB, th)) { SYN_RECV case TCP_TW_SYN: { struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev), &tcp_hashinfo, iph->saddr, th->source, iph->daddr, th->dest, inet_iif(skb)); if (sk2) { inet_twsk_deschedule(inet_twsk(sk), &tcp_death_row); inet_twsk_put(inet_twsk(sk)); sk = sk2; goto process; Case TCP_TW_ACK: tcp_v4_timeWait_ack (sk, SKB); case TCP_TW_ACK: tcp_v4_timewait_ack(sk, SKB); break; Case TCP_TW_RST: tcp_v4_send_reset(sk, SKB); inet_twsk_deschedule(inet_twsk(sk), &tcp_death_row); inet_twsk_put(inet_twsk(sk)); goto discard_it; // If it is TCP_TW_SUCCESS, the packet is discarded without any response. Case TCP_TW_SUCCESS:; } goto discard_it;Copy the code

TCP_TW_ACK: TCP_TW_ACK: TCP_TW_ACK: TCP_TW_ACK: TCP_TW_ACK: TCP_TW_ACK: TCP_TW_ACK: TCP_TW_ACK There is also TCP_TW_SYN, which indicates that the quintuple can be reused immediately in TIME_WAIT state, skipping 2MSL to SYN_RECV state!



The state migration is in the tcp_timeWAIT_state_process function. Let’s focus on the branch we want to look at:

enum tcp_tw_status tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb, const struct tcphdr *th) { bool paws_reject = false; . paws_reject = tcp_paws_reject(&tmp_opt, th->rst); if (! paws_reject && (TCP_SKB_CB(skb)->seq == tcptw->tw_rcv_nxt && (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq || th->rst))) { ...... // Duplicate ACK,discard this packet return TCP_TW_SUCCESS; } // If (th-> SYN &&! th->rst && ! th->ack && ! paws_reject && (after(TCP_SKB_CB(skb)->seq, tcptw->tw_rcv_nxt) || (tmp_opt.saw_tstamp && (s32)(tcptw->tw_ts_recent - tmp_opt.rcv_tsval) < 0))) { ...... // return TCP_TW_SYN, allowing reuse of TIME_WAIT quintuple to re-establish a connection. } // If (PAws_reject) NET_INC_STATS_BH(TWSK_net (TW), LINUX_MIB_PAWSESTABREJECTED); if (! Th -> PAWS) {// If you do not pass the PAWS check, And this section contains an ack, extend TIMEWAIT duration to / / we caught the results did not ack, only the SYN, so will not extend the if (paws_reject | | th - > ack) inet_twsk_schedule (tw, &tcp_death_row, TCP_TIMEWAIT_LEN, TCP_TIMEWAIT_LEN); Return TCP_TW_ACK; // return TCP_TW_ACK. }}Copy the code

If the PAWS(Protect Againest Wrapped Sequence Numbers prevent wrapping) check mechanism is in effect, the LINUX_MIB_PAWSESTABREJECTED parameter will increase. The corresponding command in Linux is:

netstat -s | grep reject
216576 packets rejects in established connections because of timestamp
Copy the code

As expected, there are a lot of errors in Nginx statistics. And according to the author’s observation, the time when the statistical parameters increase rapidly is the time when the problem occurs, which is around 10:00-12:00 every morning. About 1W statistical parameters will be added each time. Tcp_paws_reject tcp_PAws_reject

static inline int tcp_paws_reject(const struct tcp_options_received *rx_opt, int rst) { if (tcp_paws_check(rx_opt, 0)) return 0; If (RST && get_seconds() >= rx_opt->ts_recent_stamp + TCP_PAWS_MSL) return 0; if (RST && get_seconds() >= TS_recent_stamp + TCP_PAWS_MSL) return 0; return 1; } static inline int tcp_paws_check(const struct tcp_options_received *rx_opt, Int paws_win) {// If the SYN timestamp recorded in ts_recent is smaller than the current timestamp (TSval), Paws_win = 0 if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win) return 1; If the ts_recent timestamp is true, it has been more than 24 days since the ts_recent timestamp was last obtained. If (Unlikely (get_seconds() >= rx_opt->ts_recent_stamp + TCP_PAWS_24DAYS)) return 1; return 0; }Copy the code

The tsval is 2149756058 for the four waves, and 21495149222 for the next syn three-way handshake.

The serial number time Source address The destination address The source port number Destination port number information
2 09:57:30. 64 20.1.1.1 30.1.1.1 33735 443 [FIN,ACK]TSval=2149756058
4 09:59:22. 06 20.1.1.1 30.1.1.1 33735 443 [SYN]TSVal=21495149222

So THE PAWS check does not. So why is this SYN timestamp smaller than the previous wave timestamp? NAT, of course, virtualizes the IP addresses of multiple machines into the same IP address. But timestamps for multiple machines (i.e., time from startup to now, non-wall time) are shown below:

But there is a question, I remember that TIME_WAIT, also known as 2MSL, is defined for 60s in Linux code. Why did the capture result survive for nearly 2 minutes?

Duration of TIME_WAIT

So I began to read about TIME_WAIT timer source code, specific to see the author of another blog:

The duration of the look from the Linux source TIME_WAIT state https://my.oschina.net/alchemystar/blog/4690516Copy the code

The conclusion is as followsIn the state with a large number of time_waits, TIME_WAIT can survive for 112.5s, nearly two minutes, which is consistent with our packet capture result. Of course, this calculation is only for Linux 2.6 and 3.10 kernels, but there is another change for the 3.10.1127 kernel version maintained by Red Hat, which leads to a phenomenon that I find quite bizarre, and which I will refer to later.

Condition of problem occurrence

As explained above, this problem can only occur if an out-of-order timestamp SYN is encountered while reusing the Socket while TIME_WAIT is still on the Server. Since all machine timestamps before NAT are different, there is a high probability that the timestamp will be out of order!

When to reuse TIME_WAIT sockets

As I know, the firewall port number selection logic is RoundRobin, that is, from 2048 to 65535, and then around to 2048, as shown in the following figure:

Why is there no problem with the pressure test

But when we do the online pressure test, the apparent rate is much higher than 560tps, so why is there no such problem? Simply because of the TCP_SYN_SUCCESS branch, since our manometer has no NAT, the timestamp is always monotonically increasing under single IP, even after >560TPS, TCP_SYN_SUCCESS = SYN_RECV = SYN_RECV = SYN_RECV

How to explain the monitoring curve of LVS?

And so on, 564 TPS? This is basically the same as LVS ‘steep drop in TPS! After the port number is reused, LVS will not create new connections (actually session entries in LVS)? So that the statistical parameters do not increase? So I went directly to LVS source:

Tcp_conn_schedule | - > ip_vs_schedule new conn table item / * if successful, Is the existing connection count + + * / | - > ip_vs_conn_stats in our entry function IP \ _vs \ _in the static unsigned int ip_vs_in (unsigned int hooknum, struct sk_buff *skb, const struct net_device *in, const struct net_device *out, int (*okfn) (struct sk_buff *)) { ...... Cp = pp->conn_in_get(af, SKB, pp, &iph, iph.len, 0, &res_dir); if (likely(cp)) { /* For full-nat/local-client packets, it could be a response */ if (res_dir == IP_VS_CIDX_F_IN2OUT) { return handle_response(af, skb, pp, cp, iph.len); } } else { /* create a new connection */ int v; Conn++ if (! pp->conn_schedule(af, skb, pp, &v, &cp)) return v; }... }Copy the code

Obviously, if the current quintuple entry exists, then the entry will be reused. If the current quintuple entry does not exist, then a new entry will be created. Entries need to disappear after the Fintimeout of LVS (120s in my environment). In this way, when the port number is reused, since <112.5s, LVS will directly reuse the entry without any change in the statistical parameters, resulting in the following curve.

When the traffic becomes smaller and the port number cannot be reused, the curve rises vertically. Consistent with the author’s conjecture. That is, the number of new connections under this fixed service does not increase when the quintuple fixed quad is >529tps(63487/120).

In the figure, 560-529=>21+ connections are created, which is caused by the VIP of another service. On this VIP, there is no port reuse due to the small number of connections. However, LVS counts the total number of connections, so there will always be a small number of new connections after port numbers start to be reused.

It is worth noting that after the port number is reused, LVS will directly use this mapping entry when forwarding. Therefore, the same quintuple to LVS will be forwarded to the same Nginx instead of carrying out WRR(Weight Round Robin) load balancing, showing certain affinity. As shown below:

Performance bottlenecks of fixed IP address pairs in NAT

Ok, now we can draw a conclusion. When the source and destination IP addresses are fixed and the destination port number is fixed, the only variable of a quintuple is the source port number. The maximum number of source ports is 65535. If reserved ports (0-2048) are counted (assuming the firewall reserves 2048), a maximum of 63487 ports can be used.

Because each port number is used, a 112.5s TIME_WAIT is generated under high load. In the case of 63487/112.5, or 564TPS(using short connections), the Socket in TIME_WAIT will be reused. Add in PAWS checksums and you get a lot of connection creation exceptions!

This assertion is consistent with the application error reporting and LVS monitoring curves observed by the authors.

LVS curve anomaly events are close to the error reporting time

Because LVS starts to drop vertically at 529TPS and port number reuse at 564TPS, the TPS required by the two are very close, so generally when LVS curve anomaly occurs, it is basically the time when errors start to be reported! However, LVS curve anomalies can only indicate that the entry is being reused, not that there is a problem, because some kernel parameters can be adjusted so that port numbers are reused without error!

In the case of port number multiplexing, the number of new LVS connections cannot represent the real TPS.

Try to repair

Set the tcp_tw_max_bucket

First, I tried to limit the maximum number of time_waits on Linux where Nginx runs

echo '5000'  > /proc/sys/net/ipv4/tcp_tw_max_bucket
Copy the code

This is based on the simple idea that the fewer TIME_WAIT states there are, the lower the probability that a TIME_WAIT state Socket will be hit. After setting, the number of errors did decrease a lot. However, because the port number keeps winding after TPS exceeds the limit, errors are still reported all the time, and there will be no fundamental improvement.

If tcp_TW_MAX_bucket is set to 0, there should theoretically be no problem. But I definitely scrapped TCP’s well-designed TIME_WAIT state, which I thought was too risky and didn’t try.

Try extending the source address

This problem is caused by the fact that only 2048-65535 port numbers are available when the quintuple is limited to 4 bytes and only the source port number is variable. Therefore, if we liberalize the limitation of source IP addresses, such as increasing the number of source IP addresses to three, we can undoubtedly expand TPS by three times.

Similarly, a similar effect can be achieved by expanding the destination address.

However, according to networkers, partners only have one IP when they come out through their firewall, and one IP cannot be mapped to multiple IP addresses on our firewall, so it is impossible to extend the source address without changing their network Settings. To expand the destination IP address, modify the network Settings of the partner. In the spirit of service, I tried other solutions.

Expansion Nginx? No effect

When I didn’t understand the weird curve of LVS at first, I didn’t know that LVS would show “affinity” in the case of port reuse. Therefore, if Nginx is expanded, the probability of having the TIME_WAIT quintuple will be reduced according to load balancing principles, so WE try to double the Nginx. However, due to the affinity of LVS under port number multiplexing, the TIME_WAIT section is increased instead!

Nginx expansion singularity

After trying to understand the “affinity” of LVS, I had a psychological expectation that the expansion of Nginx would lead to more errors, but was slapped in the face by reality! The amount of errors reported is basically the same as before. Even more strangely, we found that the inactive connection number monitoring (i.e., non-ESTABLISHED) state presents a load imbalance after the port number is reused, as shown in the figure below.

After a visit to the new Nginx, I found that there were only a small number of errors caused by PAWS. The growth rate of the new Nginx was very slow, only more than 100 per hour. The old Nginx had more than 10 million in an hour!

So if you look at this wrong proportion, it’s easy to see why this curve looks like this. Due to the affinity of LVS, when the port number is reused, the old Nginx will most likely fail. Therefore, after Fintimeout expires, when a load balancing is selected again, the new Nginx will succeed according to statistical parameters, but if it still falls on the old Nginx, it will still fail, and so on. There’s a natural preference process that creates this curve.

Of course the actual process is a little bit more complicated than that, a few more steps, but that’s the general idea.

At the end of port reuse, no matter which Nginx you drop on, it will succeed, so the load balance will gradually balance again.

Why is the new Nginx doing so well?

The new enlarged Nginx performs exceptionally well under this TPS, so why? I have been thinking about it all day. After sleeping on it, comparing the kernel parameters of the two, it suddenly became clear. The kernel version of Nginx has been changed to 3.10! I quickly compared the original 2.6 kernel with the 3.10 kernel, but nothing came of it… The mind has stagnated

Linux official 3.10 and Red Hat branch 3.10.1127 differences

Wait, our online kernel version is 3.10.1127, which is not the official kernel. Is the code different? So I immediately downloaded 3.10.1127 source code. This comparison, finally let the author find the reason, see the following code!

void inet_twdr_twkill_work(struct work_struct *work) { struct inet_timewait_death_row *twdr = container_of(work, struct inet_timewait_death_row, twkill_work); bool rearm_timer = false; int i; BUILD_BUG_ON((INET_TWDR_TWKILL_SLOTS - 1) > (sizeof(twdr->thread_slots) * 8)); while (twdr->thread_slots) { spin_lock_bh(&twdr->death_lock); for (i = 0; i < INET_TWDR_TWKILL_SLOTS; i++) { if (! (twdr->thread_slots & (1 << i))) continue; while (inet_twdr_do_twkill_work(twdr, i) ! Rearm_timer = true; rearm_timer = true; if (need_resched()) { spin_unlock_bh(&twdr->death_lock); schedule(); spin_lock_bh(&twdr->death_lock); } } twdr->thread_slots &= ~(1 << i); } spin_unlock_bh(&twdr->death_lock); } // Add a rearm_timer here and set the timer to 1s after // If (rearm_timer) mod_timer(& TWDR -> TW_timer, jiffies + HZ) if (rearm_timer) mod_timer(& TWDR ->tw_timer, jiffies + HZ) }Copy the code

As shown in the code, 3.10.1127 speeds up the time round processing of TIME_WAIT by converging the 7.5s extra wait to 1s extra wait. The time round after correction is shown as follows:

The duration of TIME_WAIT is reduced from 112.5 seconds to 60.5 seconds (formula 8.5*7+1).

So, in this state, our port reuse critical TPS reached (655-2048)/60.5= 1049Tps, because the online traffic did not reach this TPS. Therefore, the newly expanded Nginx does not cause port overuse in TIME_WAIT. So the number of errors is not increasing! Of course, the number of errors is no less because of the old Nginx.

However, thanks to the magic of selective load balancing, the longer the port reuse time, the fewer errors per second will be reported! Until all LVS entries refer to the new Nginx cluster, there will be no more errors!

– TPS increases to 1049tps

Of course, according to the calculations above, TPS continues to rise to 1049, and still generates errors. The new kernel just raises the threshold, so I’m looking for a more radical solution.

And by the way,

The Linux TCP implementation handles TIME_WAIT with a time wheel, which I don’t think is a clever way to handle it. Linux itself provides a red-black tree solution for Timer handling. Put such a good scheme, but to achieve a low accuracy is also very complex time wheel. Fortunately, in Linux 4.x, the time wheel is abandoned and Linux’s own red-black tree scheme is used directly. It feels so much more natural!

Close the tcp_timestamps

I didn’t want to change this parameter at first, because modifying it would mean turning off PAWS checksums. If there’s an out-of-order package or something, there’s one less layer of defense. But now, in order not to let the partners modify, the only option is to change this parameter. But since we are the line! So the risk is manageable.

echo '0' > /proc/sys/net/ipv4/tcp_timestamps
Copy the code

So far, the business feedback is good. Finally, the problem was solved !!!!!! I should add that turning off tcp_timestamps was just an option I made with a few restrictions. A better option would have been to extend the source or destination addresses.

conclusion

Solving this problem is really bumpy. During the troubleshooting process, I looked at the LVS source code to see how the Linux 2.6 kernel handled TIME_WAIT state, and then saw the subtle differences between the 3.10 kernel and the 3.10.1127 kernel. In order to explain all the doubts, the author is always looking for clues. While not pursuing these, the problem is still likely to be solved by various attempts. However, those strange curves have always haunted my mind and made me think about them day and night. And then, you know, it’s a great feeling to have a clue and have an Epiphany! This is also the author to solve complex problems endless power!

Welcome everyone to pay attention to my public number, there are all kinds of dry goods, and a gift package to send oh!