It feels like the title of a marketing number. It was indeed a production environment accident caused by using @async without knowing the springBoot default thread pool configuration. Admit your mistake and stand at attention.

Too long to see what this article says

  • @async default thread pool with Spring Boot annotations. Queue_capacity and max_pool_size in the thread pool parameters are upper bounds on int. The working principle of the thread pool is that when the number of threads in the pool is equal to the number of core threads and the task queue is insufficient, the new task is queued preferentially rather than starting the thread to solve the task. Obviously, there is no way we can fill this pool of task queues. This can lead to the problem of only 8 core threads working throughout the lifetime of the pool, resulting in a backlog of tasks.

  • Add some tools for troubleshooting online problems

The problem code

User //openid//user_name is accept satisfyMsgService.sendActivitySatisfyMsg(ActivityWrapper.buildActivityMessageBean (activity, StrUtil.EMPTY, StrUtil.EMPTY, userId, acceptUserName, acceptOpenId, maAppId, inviteSceneId, PinYinConstants.ZERO, PinYinConstants.ZERO , channelId, msgType, sceneId, yyChannelId)); } / / 5 assembly registration event publisher. PublishEvent (ActivityWrapper. ActivityEnrollEventForSend (activityEnrollDTO, userId, activity.getPicbookCollectionId(), activity.getPid(), appId, inviteUserId, discountFlag, yyChannelId)); @Async public void sendActivitySatisfyMsg(ActivityMessageBean activityMessageBean) { }Copy the code

This code designs two asynchronous mechanisms for decoupling (message Async @async and EventPublisher @eventpublisher). And it triggers asynchrony three times. However, both asynchronous mechanisms use thread pools provided by Spring by default. The thread pool configuration is debugged as follows.

We see that the default queue_capactity has an int upper bound of 2147483647, which means that this is a task queue that can be considered unbounded.

Because the working principle of the thread pool is as follows: when [the task queue is insufficient] and [the number of threads in the thread pool has reached the number of core threads], the task will be queued first rather than the new starting thread to process the task. The result is a thread pool with only eight threads working throughout its lifetime, which leads to a backlog of tasks.

Problem description

  • After the interface activityService event publishing mechanism code, the listener does not respond to events. Last response time for the 2020-10-18 listeners ActivityEnrollListerner 22:03:02. 121, es index e90d9e88ebe165594841e932c5fd41df. The last response time of the listener ActivityInviteListener was 2020-10-18 22:03:07.831. Es index f4fc7f83f424318cb591f7e4dec7d05c.

  • Impact function: the listener should be triggered for data entry after the user registers. After the event listener stops responding, the user is registered but not in the library. As a result, when the user clicks the picture book to read, the status is still unregistered and the user cannot read it normally. The number of affected users from 2020-10-18 22:03 to the repair period is 1327.

Problem cause and code analysis

  • Read the 2020-10-18 22:03:02. 41 last response and the 2020-10-18 22:03:09 listener. 382 listeners not response for the first time [es index 653 a4d9ab61c3e144937b193da9da947] from [10586273] [10587014] server log, no errors and exceptions were found.

  • Ali machine/RDS/memory/hard disk resource monitoring in Grafana is normal in this period. [CPU usage is less than 20%. Memory is normal. Network access is normal.]

  • Skywalking instance monitor normal JVM stack contents are normal.

  • In Skywalking, service monitoring is normal and no error log is reported during this period

  • After walking around and finding nothing wrong, I set my eyes on the @async source code. By the org. Springframework. Aop. Interceptor. AsyncExecutionInterceptor 】 the code analysis. The thread pool located in the red box in the code is indeed a borderless pool. There is a task backlog in the guess pool.

Validation problem

To verify this, I wrote a demo

1: Custom thread pools [replicated and improved with different configurations of parameters]

@author yanghaolei * @date 2020-10-22 10:13 */ @configuration public class ThreadConfig { @Bean("msgThreadTaskExecutor") public ThreadPoolTaskExecutor getMsgSendTaskExecutor(){ ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor(); taskExecutor.setCorePoolSize(10); taskExecutor.setMaxPoolSize(25); taskExecutor.setQueueCapacity(800); taskExecutor.setAllowCoreThreadTimeOut(false); taskExecutor.setAwaitTerminationSeconds(60); taskExecutor.setThreadNamePrefix("msg-thread-"); /** * When the task cache queue in the thread pool is full and the number of threads in the thread pool reaches maximumPoolSize, the task rejection policy is adopted if additional tasks arrive. * ThreadPoolExecutor. AbortPolicy: discard task and throw RejectedExecutionException anomalies. * ThreadPoolExecutor DiscardPolicy: discard task too, but does not throw an exception. * ThreadPoolExecutor. DiscardOldestPolicy: discard queue in front of the task, and then to try to perform a task (repeat) * ThreadPoolExecutor CallerRunsPolicy: Retry adding the current task, automatically repeating the execute() method, Until success. * / taskExecutor setRejectedExecutionHandler (new ThreadPoolExecutor. CallerRunsPolicy ()); taskExecutor.initialize(); return taskExecutor; }}Copy the code

2: Controller [distribute 100 tasks]

@GetMapping("/test/async") public void testAsync() { CountDownLatch countDownLatch = new CountDownLatch(100); System.out.println(" main thread execution starts "); for (int i = 0; i < 100; i++) { testService.doAsync(countDownLatch,i); } try { countDownLatch.await(); } catch (InterruptedException e) { e.printStackTrace(); } system.out. println(" main thread execution end......" ); }Copy the code

3: Service [business code]

@Async("msgThreadTaskExecutor") public void doAsync(CountDownLatch countDownLatch, int i) { System.out.println(Thread.currentThread().getName() + "running"); try { //todo Thread.sleep(2000); Log.debug (" Asynchronous method executed......" + "+ I + "); } catch (Exception e) { e.printStackTrace(); } countDownLatch.countDown(); System.out.println(Thread.currentThread().getName() + "over"); }Copy the code
Conclusion:
  • After simulating the parameters similar to those in the night of 10-18, we found that the time of task accumulation and processing of 100 tasks reached 12000ms. This is the time when listener abandonment and unresponsiveness are more likely to occur. Also more can explain the cause of the accident.

  • Based on the results of this experiment. We set a short queue capacity. Online Environment 800. The advantage is that when a task exceeds the queue capacity, the machine will start a new thread to handle the new task until the thread pool reaches its maximum thread size. This can better deal with the problem of task accumulation. The downside is increased CPU consumption, since thread destruction requires the garbage collection GC to be started. However, since the CPU can be monitored, there are both advantages and disadvantages, but in this case, the thread pool resources that are not easy to monitor are converted into CPU resources that are easy to monitor.

  • The follow-up was also learned from Alibaba’s technical specifications and communication with other dissatisfied technical partners. I found that everyone’s suggestion is still relatively uniform, that is to ask [@async and other asynchronous code need to customize the thread pool], try not to use the default pool. Frequent GC caused by improper pool resources and pool management, and oom caused by frequent GC is also one of the most common causes of accidents.

analyse
  • From the code we have improved the configuration of the thread pool. Queue_capacity and max_POOL_size have been reconfigured.

  • As traffic increases rapidly, so does the probability of facing unknowable risks. The accident cause the value of us $20 billion (maybe now is not, tumbled thirty percent yesterday, harm, 🐶), and 500 short-seller war regardless of the outcome and shares rose 5 times back and forth, cut American capitalism leeks, namely the light of the new nation after rui xing, nasdaq-listed company lost $2 w and harrowing experience.