This restart does not solve the problem: MySQL was shut down, causing a large number of CLOSE_WAIT checks.

Recently encountered an online service socket resource is constantly full situation. Use various tools to analyze online problems and locate the problem code. Here is a review summary of the problem discovery and repair process.

Let’s look at two pictures. One is the socket state monitored when the service is normal, and the other is of course an exception.

Figure 1: Normal monitoring

Figure 2: Monitoring when exceptions occur

From the performance in the figure, that is, from 04:00, socket resources continue to rise, each trough after restart to normal value, and then continue to rise not released, and each time to reach the peak interval is shorter and shorter.

After the restart, I checked the log, but did not see panic, so I did not further check. I really thought that the restart method was good.

Show how

The service, developed using Golang, has been up and running for nearly a year and is available for other service invocations with DB/Redis/MQ underlying resources.

For the convenience of subsequent explanation, the architecture diagram of the service is illustrated.

Figure 3: Service architecture

The architecture is very simple.

The problem started at about 08:20 in the morning, when the alarm was received that the service showed 504. At this time, the first reaction was that the service had not been restarted for a long time (nearly two months), and there might be some memory leakage, so we directly restarted the service without much thought. That is, at the bottom of figure 2, service is back to normal after a restart.

At about 14:00, 504 was alerted again. At that time, I felt something was wrong, but as there was a large promotion activity on that day, I immediately restarted the service again. The alarm is generated about 1 hour later. After several restarts, the interval for the restart becomes shorter and shorter. Finding a problem here is never easy. This restart really can not solve the problem, so immediately apply for machine permission, start troubleshooting. The screenshots below are all from my reenactment demo and have nothing to do with online.

Found the problem

When problems occur, the first thing to do is to analyze and infer, then verify, and finally locate and modify. According to the performance at that time, the following conjecture was made respectively.

Ps: All subsequent screenshots are from local screenshots

Concluded that a

Socket resources are constantly playing full, and never appeared before, today suddenly appeared, suspected is not too large requests overwhelmed service

After looking at real-time QPS, this idea was abandoned. Although the quantity increased, it was still within the server’s tolerance range (far below the baseline value of pressure measurement).

Infer that two

Two machine failures occur at the same time, restart one, the other will also be relieved, as a separate deployment of services in two clusters is very strange

With the above basis, the result of the launch is that the underlying resource that the service depends on must be in addition to the problem, otherwise it is impossible for the independent cluster of services to have problems at the same time.

Because the monitor shows a socket problem, so the netstat command to check the current TCP connection status (local test, online actual value is much larger)

/go/src/hello # netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' LISTEN 2 CLOSE_WAIT 23Copy the code

It is surprising to find that most links are in CLOSE_WAIT state. The netstat -an command is then used to check.

Figure 4: Lots of CLOSE_WAIT

CLOSED indicates that the socket connection is not in use. LISTENING indicates that you are LISTENING for an incoming connection. SYN_SENT indicates that a connection is being attempted. SYN_RECEIVED performs initial connection synchronization. ESTABLISHED Indicates that the connection is ESTABLISHED. CLOSE_WAIT: The remote calculator has closed the connection and is waiting for the socket connection to close. FIN_WAIT_1: The socket connection is closed. CLOSING Closes the local socket connection, then closes the remote socket connection, and waits for confirmation. LAST_ACK Waits for confirmation after the remote calculator is closed. FIN_WAIT_2 Waits for a shutdown signal from the remote calculator after the socket connection is closed. TIME_WAIT connection closed, waiting for the remote calculator to shut down and resend.

Why is there a large number of mysql connections CLOSE_WAIT? To clarify, let’s interrupt with a little bit of TCP’s four wave.

TCP waved four times

Let’s take a look at the TCP wave four flow:

Figure 5: TCP wave four times

Describe the process in Chinese:

-Penny: I’m done. I’m ready to cancel

Server: Got it, but wait for me. I have to wrap up

After the above two steps, the server is in CLOSE_WAIT state. After a while the Server was finished

Server: I’m done. Let’s go. The Server sent the FIN

Client: Goodbye brother, this is an ACK from the Client to the server

The server is ready to run, but not the client. Why is that? The client must wait another 2ml. Why can’t the client just run away? To prevent the server from receiving the ACK message, the server resends the FIN to inquire again. If the client runs away after sending the ACK message, the server does not respond to the request again. The length of the wait is also very specific.

Maximum Segment Lifetime Indicates the Maximum Lifetime of a packet. After this Lifetime, the packet is discarded

This should not be confused with the client/server in the diagram and the client/server in the project, just remember: The active closing party sends a FIN packet (Client), and the passive closing party (Server) responds to the ACK packet. In this case, the passive closing party enters the CLOSE_WAIT state. If all is well, the passive-closed party later issues a FIN packet and migrates to the LAST_ACK state.

TCP packet capture analysis:

/go # tcpdump -n port 3306
#There were three handshakes
11:38:15.679863 IP 172.18.0.5.38822 > 172.18.0.3.3306: Flags [S], seq 4065722321, win 29200, options [mss 1460,sackOK,TS val 2997352 ecr 0,nop,wscale 7], length 0
11:38:15.679923 IP 172.18.0.3.3306 > 172.18.0.5.38822: Flags [S.], seq 780487619, ack 4065722322, win 28960, options [mss 1460,sackOK,TS val 2997352 ecr 2997352,nop,wscale 7], length 0
11:38:15.679936 IP 172.18.0.5.38822 > 172.18.0.3.3306: Flags [.], ack 1, win 229, options [nop,nop,TS val 2997352 ecr 2997352], length 0

#Mysql disconnects11:38:45.693382 IP 172.18.0.3.3306 > 172.18.0.5.38822 Flags [F.], seq 123, ack 144, win 227, options [nop,nop,TS val 3000355 ecr 2997359], Length 0 # MySQL load balancer send fin packet 11:38:45.740958 IP 172.18.0.5.38822 > 172.18.0.3.3306: Flags [.], ACK 124, win 229, options [NOp, NOp,TS val 3000360 ECR 3000355], length 0 # . I should have sent fin to him, but I didn't, so close_wait appeared. So why is that?Copy the code

src > dst: flags data-seqno ack window urgent options

SRC > DST Indicates that the flags from the source IP address to the destination IP address are the flags in TCP packets. S is the SYN flag. F(FIN), P(PUSH), R(RST) “.”(unmarked) data-seqno indicates the sequence number of the data in the packet. Ack indicates the sequence number expected next time. Window indicates the size of the receiving cache window

In combination with the above information, I will use the following words: The MySQL load balancer sent a FIN packet to my service, and I responded. At this time, I entered the CLOSE_WAIT state. However, as the passive closing party, I did not send the FIN, so my server remained in the CLOSE_WAIT state and could not enter the CLOSED state finally.

Then I infer that there are several possible reasons for this situation:

  1. The load balancer stopped unexpectedly.

    That's not gonna happen. He had a problem with a lot of services, not just me

  2. The MySQL load balancer timeout setting is too short, causing the MySQL load balancer to close the TCP connection before the business code is finished processing

    This is unlikely as the service does not have any time consuming operations. Of course the load balancer configuration is checked and 60s is set.

  3. MySQL connection cannot be released

    At present, it seems that there should be a code quality problem. In addition, there is an anomaly in the data this time, which triggered a point that was not tested before. At present, it seems that this is probably the reason

Find the cause of the error

Since the business logic of the code is not written by me, I am worried that it will not be able to see the problem for a while, so I directly use perf to draw all the call relationships using the flame diagram. Since above we infer that the code does not release the mysql connection. Nothing more than:

  1. It is true that close was not called
  2. There was a time consuming operation (flame chart is very visible), causing a timeout
  3. Mysql transactions such as rollback or commit are not processed correctly

Because the fire map contains so much information, I have folded up some unnecessary information to make it clear.

Figure 6: Problematic flame diagram

The flame diagram clearly shows that the transaction is started, but it does not see a Commit or Rollback operation in the rest of the diagram. This will definitely cause operational problems. And then it became clear that the problem was:

Inside the mainController.update method, without further ado, check directly inside the update method. The following code was found:

func (c *MainController) update() (flag bool) { o := orm.NewOrm() o.Using("default") o.Begin() nilMap := getMapNil() if NilMap == nil {// only check if it is nil, NilMap [10] = 1 nilMap[20] = 2 if nilMap == nil && Len (nilMap) == 0 {o.rolback () return false } sql := "update tb_user set name=%s where id=%d" res, err := o.Raw(sql, "Bug", 2).Exec() if err == nil { num, _ := res.RowsAffected() fmt.Println("mysql row affected nums: ", num) o.Commit() return true } o.Rollback() return false }Copy the code

At this point, all the analysis is over. Looking at getMapNil returns nil, but the following criteria are not rolled back.

If nilMap == nil {o.rolback () return false}Copy the code

conclusion

The whole analysis process took quite a while. The most important is that subjective consciousness is too strong, feel that the operation of a year without a problem why will suddenly go wrong? Hence the initial suspicion that something is wrong with SRE, DBA, various infrastructures (one always suspects others first). So it took a long time.

The correct analysis is:

  1. When a problem occurs, you should check the log immediately. Indeed, the log does not find a problem.
  2. Monitoring clearly shows that sockets are growing and should be used immediatelynetstatCheck the situation to see which process is in the pot;
  3. According to thenetstatCheck, usetcpdumpCapture the packet and analyze why the connection willPassive disconnect(TCP knowledge is important);
  4. If you are familiar with the code, you should directly examine the business code, if not, you can use itperfPrint out the call link of the code;
  5. From here you should be able to quickly locate the problem, whether it’s analyzing code or flame charts.

Why CLOSE_WAIT? I think most of you already know this, but let me make it very simple:

Because that line of code does not roll back the transaction, the server does not initiate a close. Therefore, MySQL load balancer actively triggered the close operation when it reached 60 seconds, but it was found through TCP packet capture that the server did not respond, because the transaction in the code was not processed, so a large number of ports and connection resources were occupied. Packet capture data when pasting a wave:

#Mysql disconnects11:38:45.693382 IP 172.18.0.3.3306 > 172.18.0.5.38822 Flags [F.], seq 123, ack 144, win 227, options [nop,nop,TS val 3000355 ecr 2997359], Length 0 # MySQL load balancer send fin packet 11:38:45.740958 IP 172.18.0.5.38822 > 172.18.0.3.3306: Flags [.], ACK 124, win 229, options [NOp, NOp,TS val 3000360 ECR 3000355], length 0 #Copy the code

I hope this article will help you troubleshoot online problems. To help you understand, a correct and incorrect flame diagram is attached below. You can compare them for yourself.

  • Fire diagram in correct condition
  • Fire diagram of error condition

I referred to an article that offered two questions about this situation, which I think are very meaningful. Think about it for yourself:

  1. Why does a machine with several hundred CLOSE_waits fail to continue access? Don’t we often say that a machine has 65535 file descriptors available?
  2. Why do I have load balancing when CLOSE_WAIT is issued almost at the same time on both machines deploying the service?

Reference article:

  1. See also CLOSE_WAIT
  2. TCP 4-times close

My official account is dayuTalk

Contact email: [email protected]

GitHub:github.com/helei112g

I write mine and you reward yours