The background,

I used websocket service to push real-time data for H5 page, but one day the product gave me feedback that the real-time refresh function of the page was invalid, so I began to conduct investigation and treatment, and recorded the process.

This service is monitored every minute to check whether the Websocket is working properly. If not, the process will be killed and restarted.

After receiving the feedback on the same day, I checked the service and found that the monitoring program was normal and websocket would be restarted. However, the master node of WebSocket would hang up again within 30 seconds after each restart.

That’s it. Let’s go through the process.

Why does the master node fail?

The following three conditions are listed on the official website, which will lead to failure of service:

(1) The system is overloaded and swoole fails to apply for memory and hangs up. (2) A segment error occurs at the bottom of swoole. (3) The Server occupies too much memory and is killed by the kernel or mistakenly killed by some programsCopy the code

However, according to the current environment, it does not conform to the above situation, so the specific cause of this problem has not been found for the time being.

3. Solve the problem according to the log

(1) Nginx error. Log

13247#0: *176909901 connect() failed (111: Connection refused) while connecting to upstream,
Copy the code

Taking a look at the Nginx configuration, you can see that the configuration was small to begin with, so I enlarged several configurations

worker_processes 1; Worker_rlimit_nofile 1024; // Number of worker processes worker_rlimit_nofile 1024; // Change the maximum number of open files for worker processes. worker_connections 1024; // Maximum number of concurrent connections per worker process (including all connections)Copy the code

(2) There are many errors in swoole’s own log:

ERROR   swServer_master_onAccept (ERROR 502): accept() failed. Error: Too many open files[24]
Copy the code

(3) In the program will output:

WARN    swServer_start_check: serv->max_conn is exceed the maximum value[1024].
Copy the code

The ulimit-n setting is too low. The ulimit-n setting is too low.

Max_connection cannot exceed the operating systemulimit-n, otherwise a warning message will be displayed and reset toulimit- the value of nCopy the code

By combining (2) and (3), we can conclude that it is above this ulimit-n, which has been modified before but has not actually taken effect.

ulimit- n specifies the number of documents at the same time can open up to vim/etc/security/limits the conf -- -- -- -- -- -- -- permanent modificationsulimit-n 1024 ----------------------- This parameter is changed immediately, but is invalid after the restartCopy the code

4. Follow-up questions

(1) After the traffic comes up, it is found that redis occasionally failed to link the error, the search reason is that a large number of links are established so that the ports on the machine are in use, which can be solved by adjusting the kernel parameters.

Vim /etc/sysctl.conf edit the file and add the following content: net.ipv4.tcp_tw_reuse = 1 // Enable reuse. Allow time-Wait Sockets to be re-used for new TCP connections. Default is 0, indicating closure. Net.ipv4. tcp_TW_recycle = 1 // Fast recycling of time-wait Sockets in TCP connections is enabled. The default value is 0, indicating that the fast recycling of time-Wait Sockets is disabled. Then run /sbin/sysctl -p for the parameters to take effect.Copy the code

(2) After subscribing to Redis, information will not be received after a period of time. The cause is temporarily unknown and resolved by adding a link timeout to catch the exception and re-establish the subscription request.

ini_set('default_socket_timeout', 10);
Copy the code

Five, the afterword.

At the same time, the process of solving the problem will be reviewed again, and the idea of solving this problem will be clearer later.