Problem description

The cause was that the client could not be connected every day or two when the server was running. When I logged in to the server to check, I found that the server did not hang up. However, I checked the log and found that the writing of the log had been interrupted for a period of time

Thinking of the February

Top found that the CPU remained more than 150%, but the log was not written. In practice, however, logs write at least some statistics periodically, so the logging thread traces that fopen() returns NULL and checks that errno is 24: there are too many open files. Thousands of connections in CLOSE_WAIT state are detected by the lsof -p process number. Ulimit -n then finds that the maximum number of files a single process can open is 1024, so the problem becomes obvious. The first problem to be solved here is that the number of open files is too small. It needs to be appropriately enlarged. After all, such a small number of open files is too wasteful in the current hardware configuration. The most important issue is why CLOSE_WAIT is so common.

Since the server was up and running before and had never had this problem, it was suspected that it was a configuration problem at first. Lsof results are as follows:

TCP localhost:40824->localhost:http (CLOSE_WAIT)
Copy the code

Therefore, this is a close_wait for HTTP connections, where HTTP is used as a push notification service and nginx is also set up to handle PHP requests. Because close_wait is in the game server process and is a passive closed party, it is determined that nginx has closed the connection. Check the nginx configuration, the keepalive_timeout value is set to 60s, but when the user data is small, it is easy to cause the connection is not active. This problem occurs when nginx actively closes the connection. The reason why this problem did not occur before is that the number of files that can be opened was set relatively large at the beginning, and there were many active users who kept activating the connection.

Obviously, there should not be a lot of CLOSE_wait even if the nginx timeout is very short, so the game server itself does not have active close issues as the root cause.