This article is published in vivo Internet technology wechat public number link: mp.weixin.qq.com/s/-OcCDI4L5… Authors: Huang Weibing, Chen Jinxia

1. Tomcat container version 9.0.26 Deadlock

1.1 Symptom

1.1.1 The context in which Deadlock occurred

An interface /get. Do pressure test, 3 minutes later, the number of successful transactions TPS from 1W to 0.



1.1.2 A large number of CLOSE_WAIT occur on the Tomcat server

The number of TCP CLOSE_WAIT status on the server under pressure ranges from 200 to 2W.



1.2 Initial location: start with thread stack information

Found 1 DEADLOCK using JStack to print Tomcat stack information

Found one Java-level deadlock:
=============================
"http-nio-8080-exec-409":
waiting to lock monitor 0x00007f064805aa78 (object 0x00000006c0ebf148, a java.util.HashSet),
which is held by "http-nio-8080-ClientPoller"
"http-nio-8080-ClientPoller":
waiting to lock monitor 0x00007f05e8061058 (object 0x00000007bfe40a70, a java.lang.Object),
which is held by "http-nio-8080-exec-205"
"http-nio-8080-exec-205":
waiting to lock monitor 0x00007f0614018448 (object 0x00000006c0e8e088, a java.util.HashSet),
which is held by "http-nio-8080-BlockPoller"
"http-nio-8080-BlockPoller":
waiting to lock monitor 0x0000000001ed06e8 (object 0x00000007bfe110f8, a java.lang.Object),
which is held by "http-nio-8080-exec-380"
"http-nio-8080-exec-380":
waiting to lock monitor 0x00007f064805aa78 (object 0x00000006c0ebf148, a java.util.HashSet),
which is held by "http-nio-8080-ClientPoller"Copy the code

1.2.1 Quick Fixes

After internal discussion, the current Version of Tomcat may have bugs. Without affecting the project progress, simple modification scheme demoted Tomcat 9.0.26 used by SpringBoot to Tomcat 8. Pressure test again after degradation, no problem was found. It is almost certain that Tomcat 9.0.26 should have a Deadlock issue.

1.3 Further tracing of problems

1.3.1 Feedback to the Apache community

To confirm the problem, we tried submitting Bug feedback to Tomcat.



From the stack information, it is 3 kinds of threads 5 threads due to the lock sequence is not, so that the deadlock occurred. The process of graphically locking the top is shown below.



1.4 Cause analysis

The deadlock process has been identified, but where did it go wrong? This requires digging deep into the source layer to locate the problem. You need to download the OpenJDK source code first, and then Tomcat 9.0.26 source code. Based on the stack information, locate the appropriate code location. We figure out the Tomcat 9.0.26 deadlock process in the following figure.



A good understanding of the figure above requires some understanding of NIO. NIO in Tomcat is about understanding the NIO Endpoint.

Poller is a wrapper around Selector, while the executing thread named exec-xx is a wrapper around Channel. In NIO a Channel registers with a Selector and then records its correspondence through a SelectionKey. At this point, the main characters are on stage.

Poller’s run method, as a background thread, has been polling for the SelectionKey. It also needs to unregister the SelectionKey in the cancelledKey when polling. Exec-xx will first judge the connection state in processing, such as failure, exception and other cases will call the close method of Channel to close the connection.

The close of a Channel actually just adds the SelectionKey to the cancelledKey. Both need to be locked first, but not in the same order, resulting in deadlocks.

1.4.1 Communication with Tomcat developers

After submitting the Bug, I got a response from Remy Maucherat, who first mentioned the deadlock inside NIO. Then we mentioned that deadlocks inside NIO are due to Poller. Run and Poller. CanceledKey being spawned during concurrency.

Remy Maucherat quickly made a fix by moving the close in Poller. CanceledKey to finally, so that Poller.

After the fix, we pressed again with the replacement code, and the deadlock problem did not occur. Remy Maucherat also mentioned a fix for this issue in the latest OpenJDK, but only in JDK 11 and 14.

Details of the communication are shown below.



1.4.2 Verification of fixes on Github

Github.com/apache/tomc…



1.5 Verification

Using github.com/apache/tomc… Provide fixed code, repackage tomcat-embed-core.jar to replace 9.x. X pressure test again, TPS stable at about 1.5W.



To this problem is basically clear positioning, and has been repaired. Remy Maucherat also replied “The fix will be in Tomcat 9.0.31+”.

The latest version of Tomcat is Tomcat 9.0.30. You need to wait patiently for the update of version 31. Tomcat 8 is recommended.

Ii. Related links and references

  1. OpenJdk source code download

  2. Tomcat source

  3. BUG investigation and repair of Tomcat with high concurrency triggered by Mtop during network outage

  4. An in-depth look at the NIO model in Tomcat

For more content, please pay attention to vivo Internet technology wechat public account

Note: To reprint the article, please contact our wechat account: LABs2020.