This article is participating in the Java Theme Month – Java Debug Notes EventActive link

[Resolved] Server failure due to thread deadlock in production – Online problem Location and solution

An overview of the

Recently, someone in the team encountered a thread deadlock situation, here to introduce the situation, and how to solve it

The problem

First of all, I will tell you how I know there is a problem. The online application has a load of 4 machines, and some users have reported that some pages are in the supercard, or they just can’t get in.

positioning

In fact, at that time is also quite meng force, test side with the account enter, quite smooth ah. Later, according to the links provided by several users, when the page card is discovered, the machine requested by the link is the IP of the first machine.

So the problem is much easier.

First of all, through the link directly to see the log on the machine, a few methods of request, the basic is timeout. You can view no more problems through error logs.

Take a look at the thread status in the jstack Dump log file and download the machine’s JStack logs for an analysis.

Here share a analysis website: company. Making. IO/threaddump -…

Jstack logs can be more visually displayed online.

After analysis, the log is simple:

/ /... Omit "CAR_lib_sync86 ": Awaiting notification on [0x00000007239a07c0] "car_lib_sync87": awaiting notification on [0x00000007239a07c0]" Car_lib_sync87 ": awaiting notification on [0x00000007239a07c0] "car_lib_sync88": awaiting notification on [0x00000007239a07c0] "car_lib_sync89": awaiting notification on [0x00000007239a07c0] "car_lib_sync9": awaiting notification on [0x00000007239a07c0] "car_lib_sync90": awaiting notification on [0x00000007239a07c0] "car_lib_sync91": awaiting notification on [0x00000007239a07c0] "car_lib_sync92": awaiting notification on [0x00000007239a07c0] "car_lib_sync93": awaiting notification on [0x00000007239a07c0] "car_lib_sync94": awaiting notification on [0x00000007239a07c0] "car_lib_sync95": awaiting notification on [0x00000007239a07c0] "car_lib_sync96": awaiting notification on [0x00000007239a07c0] "car_lib_sync97": awaiting notification on [0x00000007239a07c0] "car_lib_sync98": awaiting notification on [0x00000007239a07c0] "car_lib_sync99": awaiting notification on [0x00000007239a07c0] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 198 Threads with this stack: "DubboServerHandler-172.16.35.143:20880-Thread-1 ": Awaiting Notification on [0x0000000725EE76F0] "DuBBoServerHandler-172.16.35.143:20880-Thread-10 ": awaiting Notification on [0x0000000725EE76F0]" DuBBoServerHandler-172.16.35.143:20880-Thread-10 ": Awaiting Notification on [0x0000000725EE76F0] "DuBBoServerHandler-172.16.35.143:20880-Thread-100 ": awaiting Notification on [0x0000000725EE76F0]" DuBBoServerHandler-172.16.35.143:20880-Thread-100 ": Awaiting Notification on [0x0000000725EE76F0] "DuBBoServerHandler-172.16.35.143:20880-Thread-101 ": awaiting Notification on [0x0000000725EE76F0]" DuBBoServerHandler-172.16.35.143:20880-Thread-101 ": awaiting notification on [0x0000000725ee76f0] //... omitCopy the code

Here you can see that the car_lib_sync* threads are all waiting for a thread, pretty straightforward

The thread is deadlocked. I have to say that when you start a thread pool, give it a prefix name.

In this case, I just search for “ar_lib_sync” in the project

See the following code:

@Slf4j
@Service
public class CarThreadPool extends ThreadPoolExecutor {
    public CarThreadPool(a){
        super(300.500.1000L, TimeUnit.MILLISECONDS, new ArrayBlockingQueue<>(1000), new ThreadFactoryImpl("car_lib_sync"), newThreadPoolExecutor.CallerRunsPolicy()); }}Copy the code

And then look at the caller, dude, I go straight dude. Among other things, the 300 initialization thread is already a bit out of line.

The root cause

Class B is called from class A

@Autowired
private CarThreadPool carThreadPool;

Future<Boolean> validateTask = threadPool.submit(() -> {
	/ /...
	b.a();
});

Boolean isok = validateTask.get(10, TimeUnit.SECONDS);
Copy the code

Method A in class B:

@Autowired
private CarThreadPool carThreadPool;

threadPool.submit(() -> {
	/ /...
});
Copy the code

The cause of the deadlock is becoming clear.

The root cause is that the class A call behavior belongs to the parent task, and it also contains multiple sub-tasks B.a, and the parent task and the sub-task use the same business thread pool. A deadlock occurs when a thread pool is full of executing parents and all parents have unfinished children.

I’m going to draw a little diagram here.

Assume that the number of threads in the core is 2, and A1 and A2 are being executed. In addition, B.a. 1 and B.a. 3 executions are complete. However, B.a2 and B.a3 are waiting in the queue for scheduling, but the number of threads is insufficient. However, the A1 and A2 parent tasks cannot be completed because the sub-tasks have not been completed. So a deadlock occurs.

It’s a matter of probability when it happens online, but it doesn’t have to happen.

But if you know the problem, you can solve it.

To solve

The solution is crude and the fix is simple due to the urgent release of the online. All classes, each of which has a separate thread pool. This approach isolates parent-child tasks.

Also, the number of threads in each thread pool is adjusted to 10.

private static final ExecutorService CAR_THREAD_POOl = new ThreadPoolExecutor(10.10.60, TimeUnit.SECONDS,
            new LinkedBlockingQueue<>(), new ThreadPoolExecutor.CallerRunsPolicy());
Copy the code

And use unbounded queues.

We will not discuss other superior solutions, such as the perfect solution to the online deadlock problem

You have a better way, welcome to the comments section discussion.

conclusion

This situation is entirely due to developers’ lack of understanding of thread pools.

  • The number of threads does not need to be set too large, appropriate is the best, recommended CPU core *2 + 1

  • Do not use the same thread pool for parent-child calls