Write down the troubleshooting process of an online fault

Background: After the new function development test was completed, we were ready to release online. When the third machine was released, the monitoring showed that the CPU of one machine suddenly soared to 300%, and the number of Dubbo active threads directly soared to 1000+, so we had to stop the release, immediately rolled back the machine with the problem, and then returned to normal. I continued to watch the other two machines that had been released, and in the end, none of them survived, but they were all rolled back.

Here is my troubleshooting process:

Monitoring Log Analysis

Firstly, the application logs of the fault point in time are checked. It is found that many methods take a long time, especially the filterMission method, which takes up to 30 + seconds. FilterMission is the interface with the highest QPS in the current service (200 million calls per day), so the possibility of failure is also high. Therefore, the implementation of filterMission was reviewed again. There was no complex computing logic, no processing logic of large amounts of data, and no lock contention. This requirement did not involve the transformation of filterMission, so we eliminated the failure caused by filterMission.

Logs also show that a large number of requests time out, which only indicates that the system is overloaded and cannot locate the problem.

Code Review

I could not find the reason from the application log, so I had to do a code review again. First check whether there is the use of synchronous code logic in the system, mainly in order to exclude the possibility of deadlock; Examine code with complex operational logic; At the same time, the modified code and the previous version were compared, and no problems were found.

Thread Dump analysis

At this point, the problem can not be analyzed from the log, blind code can not locate the specific problem, now can only release a machine, when there is a problem, let the operation and maintenance of the application thread stack dump out, analyze the JStack file. Before we start analyzing dump files, let’s start with the basics.

Thread state

Description of each state in the figure:

New: The New state, which exists when the thread object is created;

Runnable: ready-to-run; thread.start(); Runnable: ready-to-run; thread.start()

Running: The CPU is rescheduled when thread.yield () is called or the time slice is executed. Note: After the thread.yield () call, the Thread frees the CPU, but CPU rescheduling may cause the Thread to regain the time slice.

Waiting: threads that call thread.join(), Object.wait (), and locksupport.park () are in this state, indicating that the thread is Waiting for a resource or condition to wake it up. Thread.join () and object.wait() require notify()/notifyAll() of object or an interrupt to wake up, locksupport.park () require locksupport.unpark () to wake up. These methods put threads into a Runnable state to participate in CPU scheduling.

Thread.join () : waits for the thread to terminate. The main thread will continue to execute until the thread completes. Object.wait () will not be called until thread.isalive () returns true. If thread.isalive () returns true, object.wait() will not be called. If the waiting state is awaked, thread.isalive () will be checked again. If true, object.wait() will be called again until thread terminates. Thread.isalive () returns false.

Object.wait () : A thread can only call its wait() if it holds the lock on an object. After entering the wait state, the thread releases the lock on the object, but still holds the lock on other objects. If other threads want to call notify() and notifyAll() to wake up the thread, they also need to hold the lock on the object.

Locksupport.park () : Suspends the current thread and does not participate in thread scheduling unless locksupport.unpark () is called to re-participate in thread scheduling.

Timed_Waiting: Call thread.sleep (long), locksupport.parknanos (long), Thread.join (long), or obj.wait(long) to put the Thread into this state. The difference between Timed_Waiting and Waiting is that the Waiting time is limited.

Thread.sleep() : Stops the current Thread for a specified number of milliseconds without releasing resources such as locks it holds.

Blocked: a thread is waiting to acquire a lock. When a thread enters a code block or method protected by synchronized and does not acquire the lock, it enters this state. Or the thread is waiting for I/O. Note that the Lock object in Java does not cause the thread to enter this state.

Dead: A thread enters the Dead state after the execution is complete or an uncaught exception is thrown.

Some of the above states are just for convenience and do not actually exist, such as RUNNING /sleeping. Java explicitly defines thread state values as follows:

NEW, RUNNABLE, BLOCKED, WAITING, TIMED_WAITING, and TERMINATED

Analyze jStack logs

Dubbo thread in WAITING state

"DubboServerHandler-172.24.16.78:33180-thread-1220" #1700 Daemon prio=5 OS_PRIo =0 TID =0x00007f3394988800 nID =0x4aae waiting on condition [0x00007f32d75c0000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000866090c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)Copy the code

The log shows that the thread dubboServerHandler-172.24.16.78:33180-thread-1220 is in WAITING state. The main reason is that the thread is fetching tasks from the thread pool queue, but the thread pool is empty. Locksupport. park is finally called to put the thread into a wait state, waiting for notification that the queue is not empty.

When will a new thread be created to handle requests? When a new request arrives, if the number of threads in the pool does not reach corePoolSize, the thread pool creates a new thread to handle the request.

According to the jstack log, there are 195 dubbo threads from ScheduledThreadPoolExecutor task in a WAITING state, arguably the dubbo thread handles the client request, only will not deal with scheduling tasks, why to withdraw task scheduling task thread? Here temporarily throws this question, I also do not know the answer, hopes to have the big god to help answer.

A thread with another WAITING state is WAITING.

"DubboServerHandler-172.24.16.78:33180-thread-1489" #1636 daemon prio=5 os_prio=0 tid=0x00007f33b0122800 nid=0x48ec waiting on condition [0x00007f32db600000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000089d717a8> (a java.util.concurrent.SynchronousQueue$TransferStack)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:458)
	at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
	at java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:924)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Copy the code

This portion of the Dubbo thread is mainly due to being suspended while fetching tasks from ThreadPoolExecutor for execution (309 threads). After processing the first request normally, these threads return to the thread pool to wait for new requests. Because newFixedThreadPool is used here as the dubbo request processing pool, each new request defaults to creating a new thread to process unless the pool limit is reached. New requests are not queued until the maximum number of threads in the thread pool is reached, and threads blocked in getTask() are reused.

In addition, there are a number of Dubbo threads that are BLOCKED, see the log:

"DubboServerHandler-172.24.16.78:33180-thread-236" #1727 daemon prio=5 OS_PRIo =0 TID =0x00007f336403b000 nID = 0x4B8b waiting for monitor entry [0x00007f32d58a4000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.checkRollover(RollingFileManager.java:149) - waiting to lock <0x0000000085057998> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager) at org.apache.logging.log4j.core.appender.RollingRandomAccessFileAppender.append(RollingRandomAccessFileAppender.java:88) at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:155) at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:128) at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:119) at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84) at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:390) at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:375) at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:359) at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:349) at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63) at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146) at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1993) at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1852) at org.apache.logging.slf4j.Log4jLogger.info(Log4jLogger.java:179) at com.alibaba.dubbo.common.logger.slf4j.Slf4jLogger.info(Slf4jLogger.java:42) at com.alibaba.dubbo.common.logger.support.FailsafeLogger.info(FailsafeLogger.java:93) at com.dianwoba.universe.dubbo.filter.ResponseFilter$1.run(ResponseFilter.java:116) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at  java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)Copy the code

Waiting for Monitor Entry: Indicates that the current thread is in an EntryList queue waiting for the monitor lock.

Note: Synchronized semantics in Java are mainly implemented based on Java object headers and monitor. Each monitor can only be owned by one thread at a time, and other threads can only wait to acquire the monitor, where monitor has two queues: WaitSet and EntryList. When a thread calls Object.wait() when it has Monitor, it will release monitor and wait in WaitSet queue. When other threads call notify()/notifyAll() of Object to wake up the thread, it will re-compete with Monitor. The state of the thread is WAITING, and the state of the WaitSet is “in object.wait ()”. When a thread attempts to enter a synchronized code block or method and fails to obtain monitor, it enters the EntryList queue. At this point, the thread is BLOCKED, and the waiting state in the EntryList is “Waiting for Monitor Entry”.

According to the jstack log, there are 377 dubbo threads waiting for locking resources x0000000085057998 “0”, from the stack, these threads are in competition RollingRandomAccessFileManager monitor, Let’s see which thread owns the monitor and look at the log:

"DubboServerHandler-172.24.16.78:33180-thread-429" #553 daemon prio=5 OS_PRIO =0 TID =0x00007f339c09f800 nID =0x4467 waiting for monitor entry [0x00007f331f53a000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.checkRollover(RollingFileManager.java:149) - locked <0x0000000085057998> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAccessFileManager) at org.apache.logging.log4j.core.appender.RollingRandomAccessFileAppender.append(RollingRandomAccessFileAppender.java:88) at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:155) at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:128) at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:119) at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84) at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:390) at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:375) at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:359) at org.apache.logging.log4j.core.config.LoggerConfig.logParent(LoggerConfig.java:381) at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:376) at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:359) at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:349) at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63) at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146) at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1993) at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1852) at org.apache.logging.slf4j.Log4jLogger.warn(Log4jLogger.java:239) at com.alibaba.dubbo.common.logger.slf4j.Slf4jLogger.warn(Slf4jLogger.java:54) at com.alibaba.dubbo.common.logger.support.FailsafeLogger.warn(FailsafeLogger.java:107) at com.alibaba.dubbo.rpc.filter.TimeoutFilter.invoke(TimeoutFilter.java:48) at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91) at com.alibaba.dubbo.rpc.protocol.dubbo.filter.TraceFilter.invoke(TraceFilter.java:78) at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91) at com.alibaba.dubbo.rpc.filter.ContextFilter.invoke(ContextFilter.java:60) at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91) at com.alibaba.dubbo.rpc.filter.GenericFilter.invoke(GenericFilter.java:112) at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91) at com.alibaba.dubbo.rpc.filter.ClassLoaderFilter.invoke(ClassLoaderFilter.java:38) at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91) at com.alibaba.dubbo.rpc.filter.EchoFilter.invoke(EchoFilter.java:38) at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:91) at com.alibaba.dubbo.rpc.protocol.dubbo.DubboProtocol$1.reply(DubboProtocol.java:108) at com.alibaba.dubbo.remoting.exchange.support.header.HeaderExchangeHandler.handleRequest(HeaderExchangeHandler.java:84) at  com.alibaba.dubbo.remoting.exchange.support.header.HeaderExchangeHandler.received(HeaderExchangeHandler.java:170) at com.alibaba.dubbo.remoting.transport.DecodeHandler.received(DecodeHandler.java:52) at com.alibaba.dubbo.remoting.transport.dispatcher.ChannelEventRunnable.run(ChannelEventRunnable.java:82) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)Copy the code

It has locked the resource “0x0000000085057998”, but is still BLOCKED. Does it smell like a deadlock? But this is not a deadlock, and the most likely reason is:

The thread is not getting the resources it needs to run.
The JVM is running fullGC

It can be seen from the logs that most threads are blocked in the process of printing monitoring logs, so many requests have timed out. The CPU usage is high, and the processing process of the thread holding the lock is very slow. As a result, more and more threads are blocked in the lock competition, and the overall performance deteriorates.

To this, still can’t find the cause of the problem, once again observe resource usage situation, found that when problems arise, memory usage continues to grow, and there is no decline, then find operations to dump a GC logs, found the JVM fullGC has been doing, and after each GC memory basic didn’t change, that happened in the program memory leaks. The last place the memory leak was located was a paging query interface, the SQL statement missed limit,offset, enough initial idea.

This is the whole table data at one time (3 million), and 3 million records of the loop processing again, this memory does not burst strange. Because of this reason, the memory has not been released for a long time, and the JVM cannot reclaim the memory during fullGC. As a result, fullGC continues to be performed, and the CPU is exhausted and unable to process application requests.

In summary, the culprit was a memory leak that caused a sudden CPU spike, an increase in Dubbo active threads, lock contention, and request processing timeouts. According to above analysis, but also exposed the problem of the unreasonable another: dubbo number of thread pool configuration is too large, the maximum value is 1500, that is to say, the final will have 1500 threads in a thread pool to process the user request, when the concurrency value is high, the number of threads increases, the CPU for frequent context switching, system throughput is not too high. This is an optimization point.

This paper records the process, on the one hand, in order to record the pit we have stepped on, and improve our fault analysis and processing ability. When the need for problems, do not worry, learn to analyze the problem, step by step to find the problem. In particular, if you encounter online problems, you must monitor them in the application because you cannot debug them. When problems occur, you must combine logs to analyze them. Service problems combine service logs, performance problems combine memory dump logs, thread dump logs, and so on.

Welcome to point out the error of this article, reprint please indicate the original source my.oschina.net/7001/blog

Write down the troubleshooting process of an online fault

Monitoring Log Analysis

Code Review

Thread Dump analysis

Thread state

Analyze jStack logs

Related Posts

Configuring the OSPF Routing Protocol using the Ensp Simulator (1)

Provide security for microservices using OAuth 2 and JWT

ThreadLocal source code analysis