Abstract: In our daily use of database, monitoring system, as an important auxiliary system for troubleshooting and alarm faults, has an important role in problem diagnosis, troubleshooting and analysis for DBA, operation and maintenance, business development students. In addition, the quality of a monitoring system also greatly affects whether the fault can be located accurately, and whether the problem can be repaired correctly to avoid the next failure.


In our daily use of database, monitoring system, as an important auxiliary system for troubleshooting and alarm faults, plays an important role in problem diagnosis, troubleshooting and analysis for DBA, operation and maintenance, and business development students. In addition, the quality of a monitoring system also greatly affects whether the fault can be located accurately, and whether the problem can be repaired correctly to avoid the next failure. Monitoring granularity, monitoring index integrity and real-time monitoring are three important factors to evaluate a monitoring.
In terms of monitoring granularity, many current systems can only achieve minute-level monitoring, or half-minute level monitoring. Such a monitoring granularity has become increasingly inadequate in the current high-speed software environment. There is nothing to be done about the sudden mass of anomalies. However, increasing the monitoring granularity will bring about a doubling increase in the amount of large data and a doubling reduction in the acquisition frequency, which will be a great test for resource consumption.


In terms of the integrity of monitoring indicators, most of the current systems adopt the collection method of predefined indicators. This way has a great disadvantage, that is, if you do not realize the importance of a certain indicator at the beginning and miss, but it is the key indicator of a fault, this time the fault is very likely to become a “headless injustice”.


And in the real-time nature of surveillance — “no one cares about the past, they care about the present.”


The above three capabilities, as long as a good, can be called a good monitoring system. The second-level monitoring system Inspector developed by Ali Cloud can achieve the true second-level granularity of 1 point per second, collect all indicators without any omissions, and even automatically collect and display real-time data for indicators that have never appeared before. 1 second 1 point monitoring granularity, so that any jitter database is no hiding; Full index collection, giving dba enough comprehensive and complete information; And real-time data display, can be the first time to know the occurrence of the fault, also can be the first time to know the recovery of the fault.


Today, for the mongodb database, to talk about when the DB access timeout, if the use of second monitoring system
Does the inspector check faults?


case 1


There was an online service using mongodb replicas and read/write separation on the business side. Suddenly one day, a large number of online read traffic times out, and it is obvious from the inspector that the latency from the library is extremely high at that time
The high latency of the slave library means that the speed of the slave oplog replay thread cannot catch up with the writing speed of the master library. If the response speed of the slave library is not as fast as that of the master library under the condition that the master and slave configurations are consistent, it can only mean that the slave library is carrying out some high-consumption operations besides normal business operations at that time.


After checking, we found that the cache of db at that time had a surge:


As can be seen from the monitoring, cache usage rapidly increased from around 80% to 95% of the EVict trigger line. Meanwhile, the dirty cache also increased to reach the evict trigger line.


For wiredTiger, when cache usage reaches the trigger line, WT considers that the EVICT thread is too late for evICT page, so the user thread will join the EVICT operation, which will cause a large timeout. This idea can also be verified through the Application EVict Time index:


It is clear from the figure above that the user thread was spending a lot of time doing EVICT, resulting in a lot of timeouts for normal access requests


Then, the service side checks that the cache is full due to a large number of data migration jobs. Therefore, after limiting the flow of migration jobs and enlarging the cache, the whole DB runs smoothly.


case 2


One day, a business using Sharding cluster on the line suddenly reported another access timeout error, and then recovered quickly after a short time. Judging from experience, there were probably some lock operations, resulting in access timeout.


The inspector shows that a shard lock queue is high at the time of the failure:


So basically confirmed our previous conjecture that lock causes access timeout. So what exactly is causing the lock queue to spike?


Soon, through a check of the current command, we found that the authentication command on the shard suddenly increased:


By checking the code, we found that although Mongos and Mongod used keyfile for authentication, they were actually authenticated by SCRAM protocol of SASL command. During authentication, there would be a global lock, so a large amount of authentication at that time led to a spike in the global lock queue, and then the access timeout
Therefore, we finally changed the number of client connections to reduce the global lock timeout caused by the sudden surge of authentication.
Through the above two cases, we can see that small enough monitoring granularity and comprehensive enough monitoring indicators are very important for troubleshooting faults. Real-time performance also plays an obvious role in the monitoring wall scenario.


Finally, second-level monitoring has been opened on the mongodb console of Ali Cloud. Cloud mongodb users can independently enable monitoring and experience the high-definition experience brought by second-level monitoring.


The original address


To read more articles, please scan the following QR code: