Author: Ren Kun

Now living in Zhuhai, I have successively worked as a full-time Oracle and MySQL DBA, and now I am mainly responsible for MySQL, MongoDB and Redis maintenance.

Source of this article: original contribution

* Produced by the open source community of ecoson, the original content is not allowed to be used without authorization, please contact the small edition and indicate the source.


1 background

At 22:30 on June 5, I received several alarm messages. There was a master-slave delay in SHARD2 of a Mongo cluster online, and it recovered automatically after a few minutes. Originally wanted to pretend not to see, but after a while and the alarm delay, although this is also automatic recovery, but do not go up to have a look at the in the mind a little uneasy. Priority =2 and Priority =1. Make sure the primary is always on node 1 when the network is open.

2 screen

Log in to Grafana to see the monitoring information of the main library. CPU and memory usage has increased significantly during this time

QPS decreases linearly to 0 around 22:30

The number of connections continues to skyrocket

From this, it can be roughly inferred that the connection storm caused the stress of the master database and caused the delay of master-slave replication. Please check mongod.log for details. The advantage of Mongo log files is that the record information is very detailed, while the disadvantage is that the record information is too much and the file occupies a lot of space. For this purpose, Mongo provides special switch log instruction, but this instruction is only responsible for switching but not cleaning the history log file, and Mongo itself cannot set the job to achieve regular automatic switch, so it needs to write a separate script.

Create a mongo account backup with the hostManager role, write the following shell script, execute once an hour.

[root@ mongod]# more logrotate_mongo.sh #! /bin/sh MONGO_CMD=/usr/local/mongodb/bin/mongo KEEP_DAY=7 #flush mongod log datadir=/data/mongodb port=27017 role=mongod ‐d "${destdir}"]; destdir=/data/backup/mongolog/${role} if [! ‐d "${destdir}"]; Then mkdir ‐p ${destdir} fi $MONGO_CMD ‐authenticationDatabase admin ‐ubackup ‐p"*****" ‐eval "Db. RunCommand ({1} logRotate:)" ‐ ‐ port $port mv ${datadir} / mongod log.?????? ‐ ‐?? T??? ‐ ‐?? ${destdir} / find $destdir ‐ name "Mongod log. *" ‐ mtime + ${KEEP_DAY} ‐ exec rm ‐ rf {} \;

The archived log file is as follows. The log amount generated during 22:00 to 23:00 was 1 order of magnitude more than usual, which can also be inferred in reverse that Mongo had abnormal pressure during this period.

‐rw‐r‐ 1 root root 11647760 Jun 5 18:00 mongod. Log.2021‐06‐05T10‐00‐02 ‐rw‐r‐ 1 root root 12869316 Jun 5 19:00 ‐06‐05T11‐00‐ RW ‐ R ‐ 1 root 15625361 Jun 5 20:00 mongode.log.2021‐06‐05T12‐00‐ RW ‐ 1‐ mongode.log.2021‐ RW ‐ 1‐ mongode.log ‐06‐05T13‐00‐02 ‐rw‐r‐ 1 root root 71500070 Jun 5 22:00 mongod. Log.2021‐06‐05T13‐00‐02 ‐rw‐r‐ 1 root root 71500070 Jun 5 22:00 mongod ‐06‐05T14‐00‐40 ‐ RW ‐ R ‐ 1 root 215216326 Jun 5 23:00 mongode.log.2021 ‐06‐05T15‐00‐02

Querying the log at the corresponding point in time, a large number of worker thread creation failed

‐06‐05T22:28:00.821+0800 I ‐ [thread2] failed to create service entry worker thread for 172.31.0.65:39890 ‐06‐05T22:28:00.821+0800 I NETWORK [thread2] connection accepted from 172.31.0.66:45090 #4051839 (32622 connections ‐06‐05T22:28:00.821+0800 I ‐ [thread2] pthread_create failed: Resource permanently unavailable 2021‐06‐05T22:28:00.821+0800 I ‐ [thread2] failed to create service entry worker thread ‐06‐05T22:28:00.821+0800 I NETWORK [thread2] connection accepted from 172.31.0.65:39892 for 172.31.0.66:45090 2021‐06‐05T22:28:00.821+0800 I NETWORK [thread2] connection accepted from 172.31.0.65:39892 ‐06‐05T22:28:00.821+0800 I ‐ [thread2] pthread_create failed: #4051840 (32622 connections now open) 2021‐06‐05T22:28:00.821+0800 I ‐ [thread2] pthread_create failed: Resource temporarily unavailable

Log from slave libraries for the same time period:

2021‐06‐05T22:30:07.325+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 172.31.0.183:27017; ExceededTimeLimit: Ct't get a connection within the time limit 2021‐06‐05T22:30:07.425+0800 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section

The master is almost unresponsive, and no connections from the slave’s timer heartbeat can be created, triggering a master-slave election in which the current slave is temporarily selected as the master. When the old master responds and rejoins the cluster, it regains control.

MTools is used to parse the log, and the output results also verify this point. Under normal circumstances, 183 master library and 114 slave library are used.

[root@vm172‐31‐0‐183 mongod]# mloginfo mongod.log.2021‐06‐05T15‐00‐02 ‐rsstate RSSTATE date host state/message Jun 05 22:28:05 172.31.0.114:27017 PRIMARY #114 Secondary #114 has been downgraded to a SECONDARY library [root@vm172‐31‐0‐114 mongod]# mloginfo mongot.log.2021‐06‐05T14‐48‐24 ‐rsstate RSSTATE date host state/message Jun 05 22:31:50 172.31.0.183:27017 ROLLBACK #183 joins the cluster as a slave, performs ROLLBACK and recover processes, Jun 05 22:34:48 172.31.0.183:27017 RECOVERING Jun 05 22:35:52 172.31.0.183: after 27017 SECONDARY # 183 of the state into a SECONDARY, Primary #183 has returned to the PRIMARY database because it has a higher priority

The next problem is to identify which application servers are responsible for the connection storm. The shard nodes are all connected by mongos and can only see the logs of the mongos instance. Again, using the mtools tool, the output is as follows: This is the connection creation information between 21:00 and 22:00

2021‐06‐05T14‐00‐01 ‐connections...... Connections Total opened: 8904 total closed: 5645 no unique IPS: 6 Socket Exceptions: 649 127.0.0.1 opened: 2683 closed: 2683 172.31.0.21 opened: 1531 closed: 733 172.31.0.28 opened: 1525 closed: 676 172.31.0.81 opened: 1520 closed: 704 172.31.0.78, opened: 1495 closed: 699 172.31.0.134, opened: 150 closed: 150

This is the connection creation information between 22:00 and 23:00. Obviously there is a problem with this time period

2021‐06‐05T15‐00‐01 ‐connections...... Connections Total opened: 58261 total closed: 58868 no unique IPS: 7 Socket Exceptions: 3708 172.31.0.78 opened: Closed: 14576 172.31.0.21, opened: 13898 closed: 14284 172.31.0.28, opened: 13835 closed: 13617 172.31.0.81, opened: 13633 closed: 13535 127.0.0.1, opened: 2681 closed: 2680 172.31.0.134, opened: 170 closed: 172.31.0.4, opened: 3 closed: 0

This confirms that the event was caused by an application server connection malfunction, and all that remains is to pass this information on to the developer for further review.

3 summary

This case is generally easy to diagnose, but it only needs two tools, one is a shell script to switch logs regularly, and the other is mtools written by the official engineers of Mongo. They can help us quickly review the logs and extract the critical information needed for diagnosis, which greatly improves productivity.