The background,

For historical reasons, there is currently a service dedicated to processing MQ messages, mq uses Aliyun RocketMQ, SDK version 1.2.6 (2016). With the development of the service, there are more and more consumers on the application, approaching 200+. As a result, the ECS where the application is located has a high load for a long time and alarms are frequently reported.


Two, phenomenon analysis

The LOAD of the ECS server where the application is located is high for a long time (there is only one service on the ECS), but the utilization of CPU, I/O, and memory resources is low. The following figure shows the system load:

ECS configuration: 4 cores 8 GB number of physical cpus =4 Number of cores in a physical CPU =1 single-core multi-PROCESSORCopy the code

In terms of system load, multi-core cpus have the same effect as multi-cpu. When considering system load, divide the system load by the total number of cores. As long as the load of each core does not exceed 1.0, normal operation is indicated. In general, if the load is smaller than N for an N-core CPU, the system load is normal.

Apply the preceding rules: Observe load_15m and load_5m, and the load is between 3-5. This indicates that the medium – and long-term load of the system is at a high level. Looking at load_1m again, you can see that it fluctuates a lot and is much larger than the number of CPU cores for many periods of time. Busy in the short term, tense in the medium to long term, is likely to be the start of a congestion.


Third, cause positioning

Check the cause of the high load

Tips: If the system load is high, it does not mean that CPU resources are insufficient. A high Load indicates that too many queues need to be run. But the tasks in the queue can actually be CPU-consuming, I /0 consuming and other factorsCopy the code

Overload operation

  • User processes =8.6%
  • Kernel processes =9.7%
  • Free = 80%
  • Percentage of CPU time consumed by I/O waits =0.3%

In the preceding figure, CPU usage, memory usage, and I/O usage are not high. When CPU usage is low, load is high. Therefore, high LOAD may be caused by insufficient CPU resources.

Run the vmstat command to check the overall system running status, such as process, memory, and I/O, as shown in the following figure:

CPU register is the CPU built-in memory with small capacity but extremely fast speed. The program counter, in turn, is used to store the location of the instruction that the CPU is executing, or the location of the next instruction that will be executed. These are the environments that the CPU must depend on before any task can run, and are therefore called CPU contexts. A CPU context switch saves the CPU context of the previous task (i.e., the CPU registers and program counters), loads the context of the new task to those registers and program counters, and finally jumps to the new location indicated by the program counter to run the new task.Copy the code

Check the general direction: frequent interrupts and thread switching (since only one Java service exists on this ecS, focus on this process)

You can view the total CPU context switches using the vmstate command. You can view the thread level context switches using the pidstat -wt 1 command.

  1. The first is that Java threads are unusually high
  2. The second is a thread that regularly appears with 100+ context switches per second. (Specific reasons will be analyzed later)

To verify the origin of these Java threads, check the number of threads under the application process cat /proc/17207/status

Troubleshooting direction:

  • Too many threads
  • Some threads have too many context switches per second

First, check the main reason, that is, some threads have too many context switches and pull the stack information of the process on the line. Then, find the id of the thread whose context switches reach 100+/ s, convert the thread ID to hexadecimal and retrieve it in the stack log

From the picture above you can see process status TIME_WAITING, code, com. Alibaba. Ons. Open. Trace. Core. Dispatch. Impl. AsyncArrayDispatcher, check a few more other thread context switch frequently, The stack information is basically the same.

Again, check the problem of too many threads. Analyzing the stack information will find that there are a large number of ConsumeMessageThread threads (communication, listening, heartbeat and other lines are ignored first).

A search in the RocketMQ source code by thread name will basically locate the following section of code


4. Code analysis

  1. ConsumeMessageThread_ is managed by the thread pool. Take a look at the key parameters of the thread pool. The core number of threads enclosing defaultMQPushConsumer. GetConsumeThreadMin (), the maximum number of threads enclosing defaultMQPushConsumer. GetConsumeThreadMax LinkedBl (), unbounded queue ockingQueue
The default size of the LinkedBlockingQueue is integer.max. No new worker threads will be created until the task fills this capacity. Therefore, the maximum number of threads has no effect.Copy the code

Looking at the message-Consumer configuration for the number of core threads and the maximum number of threads, there is no special configuration at the code level, so the system defaults are used, as shown in the following figure

At this point, we can roughly locate the reason why there are too many threads:

Because the number of consuming threads (ConsumeThreadNums) is not specified, the default number of core threads is 20 and the maximum number of threads is 64. Each consumer initializes to create a pool with a core of 20 threads. The probability is that each consumer will have 20 threads consuming messages, resulting in a spike in the number of threads (20* the number of consumers), but most of these consuming threads are in a sleep/wait state. Little impact on context switching.

Code level investigation: this code cannot be found in rocketMQ source code. The application uses Ali Cloud SDK, and it will be found that this code belongs to the track back module when retrieving, viewing the context and calling the link in SDK.

Combined with the code to analyze the trajectory return module process (AsyncArrayDispatcher), summarized as follows:

Locate the code in the thread stack log in the SDK source code as follows:

Tracecontextqueue.poll (5, timeUnit.milliseconds); TraceContextQueue is a bounded blocking queue. When polling, if the queue is empty, it blocks for a certain period of time, which causes threads to frequently switch between running and time_wait.

Poll (5,TimeUnit.MILLISECONDS) is used instead of take(). I think it is to reduce network IO.

The poll() method returns the first element of the queue and removes it; If the queue is empty, null is returned and the current thread is not blocked. The dequeued logic calls the dequeue() method. In addition, it has an overloaded method, poll(Long Timeout, TimeUnit Unit), which waits a certain amount of time if the queue is emptyCopy the code

TraceContextQueue uses an ArrayBlockingQueue, a bounded blocking queue that contains an array of elements, concurrently accessed through locks, and arranged according to FIFO principles.

The so-called fair means that the blocked threads access the queue in the order in which they block. The non-fair means that when the queue is available, all the blocked threads can compete for access to the queue. The thread that may block first can access the queue last.Copy the code

Since each consumer has only one trace distribution thread open, there is no competition for this part.

Let’s look at the blocking implementation of ArrayBlockingQueue again

As you can see from the above section of code, blocking is finally implemented through the park method, and unsafe.park is a native method

The park method blocks the current thread and returns only if one of four things happens

  • When unpark corresponding to park is executed or has been executed.
  • When the thread is interrupted
  • Wait for the number of milliseconds specified by the time parameter
  • Occurrence of abnormal phenomena

So far, the system thread switching and interrupt frequent reasons summarized as follows:

  • In the track back module of Aliyun SDK, a consumer has a distribution thread, a track queue and a track data back thread pool. The distribution thread gets it from the track queue, and if it fails to get it, it will be blocked for 5ms, and then it will be put into the track data back thread pool, and then the data will be reported. Too many distribution threads frequently switch between the RUNNING and TIME_WAIT states. As a result, the system load is high.
  • Because the code level does not set the maximum or minimum number of threads for each consumer message consumption, each consumer will start 20 core threads for message consumption, resulting in excessive thread consumption system resources and empty run.

5. Optimize the plan

Combined with the above reasons, targeted optimization

  • At the code level, set the thread number configuration item for each Consumer. The Consumer can set the number of core threads according to the actual situation such as the service carried by the consumer to reduce the overall number of threads and avoid a large number of threads running empty.
  • The above analysis uses ons 1.2.6 version of Aliyun, which has been iterated to version 1.8.5. Through analyzing the source code of track return module of version 1.8.5, it is found that a switch is added to track return, and single thread (singleton) is used to configure track return. That is, all consumers are processed using one dispatch thread, one trace bounded queue, and one trace reporting thread pool, which can be considered as a post-validation version.