The phenomenon of

On a certain afternoon, a large number of interface delay alarms are generated in the test environment of a service. From the monitoring view, the interface P99 takes more than 5s, and the interface delay is getting longer and longer

As the business side continues to retry over time, the database threads fill up, the number of waiting threads spikes, and a large number of threads block

Preliminary screening process

1. Through APM link monitoring, we find that the main time of all links is concentrated in the process of DataSource#getConnection

2, the initial suspicion, may be due to the existence of SLOW SQL query, but later found that the root cause is not this, because the monitoring shows that the slowest SQL99 line at that time is 56.99ms, QPS is less than 0.01 (basically can rule out the cause of slow SQL query)

3. GC analysis: There are 289 GCS in half an hour, and the longest STW is 280ms. (During garbage collection, all threads in the Java application except the garbage collector thread are suspended, meaning that during this time, all threads that the user is working properly are suspended. It was suspected that STW280ms was affecting the DataSource#getConnection, but it is not, because STW is larger than 100ms to only 1.73%, which is not high.)

4. Interface QPS analysis: inService single instanceUnder the premise of, the test student in the cycle running case led to the request number of these three interfaces all reached 3000 times /min, and the average 1s required to process 3000/60=50 requests, while the average time of these three interfaces on line was 20-30ms. At that time, it was suspected that the request load of these three interfaces affected the normal operation of related threads of other interfaces. But that doesn’t explain how long Java threads are waiting for MySQL connections

Initial solution: As the phone was flooded and wechat was flooded with business, in order to solve the problem as soon as possible, we tried to expand the container to three units, and the problem was miraculously alleviated

Further investigation

1. In the preliminary analysis, the preliminary conclusion is that 3000 /min requests of a single interface in a single instance affect the normal query of other interfaces, but this cannot explain the problem of long time consuming of DataSource#getConnection. Therefore, further investigation is needed to investigate the cause

2. In order to reproduce the problem, I used Jmeter to request the two interfaces that took the longest time when the problem occurred at 3000Qpm. However, the problem did not reproduce, and the interface responded quickly, with the average response time within 10ms

3. To further reestablish the problem, I restored the Apollo configuration to the point in time when the service was abnormal (test-on-borrow and test-on-return were changed to false). Then restart Jmeter and the problem reappears, DataSource#getConnection for 3s

Arthas = Arthas = Arthas = Arthas = Arthas = Arthas = Arthas trace com.alibaba.druid.pool.DruidDataSource getConnection ‘#cost > 100’ -n 5

Through the layers of analysis, we finally came to the com. Alibaba. The druid. Util. MySqlUtils# getLastPacketReceivedTimeMs this method

Combined with the analysis of the code, we found the Druid 1.1.22 version of the source, the existence of com. Mysql. JDBC. MySQLConnection class loading, Now we currently using mysql – connector 8 x version does not exist in this class (8. X is used com. Mysql. Cj. MysqlConnection), Lead to com. Alibaba. Druid. Util. Utils# loadClass frequently don’t have a class to load, about a affect the overall performance of the service (returning)

  • Druid already has an Issue with this Issue, and it is officially marked as a Bug. See: Com.alibaba :druid:1.1.20 MysqlUtils writes MysqlUtils to mysql-connector-java 5.1, causing thread blocking and limited performance
  • Commit EF2f87b6: When using mysql-connector 8.x, fix that no class exists for checking connection validity

The test-on-borrow and test-on-return configurations of Druid were changed to false.

Here’s why: When testOnBorrow and testOnReturn is True, will be called method enclosing testConnectionInternal, And this. TestConnectionInternal contains the com. Alibaba. Druid. Util. MySqlUtils# getLastPacketReceivedTimeMs this method call, and then trigger the Bug mentioned in point 4, Interested students can browse the source code

conclusion

  • Problem returning: Druid 1.1.22 version to a nonexistent com. Mysql. JDBC. MySQLConnection kind of repeated loading, cause the thread block, the performance is limited
  • Optimization of item:
    • Mysql recommends setting the test-on-borrow and test-on-return configurations to false. See DruidDataSource reference configuration
    • If you are using mysql-Connector 8.x, you are advised to upgrade the Druid version to 1.2.1 or later