01

Small tests of thread count and CPU utilization

Despite some operating systems, computer principle, a basic theory (don’t confuse is rigorous, to be well understood) : a CPU core, per unit time can only execute one thread’s instruction, so in theory, I need a thread only executes instructions, you can run a full utilization of the core.

To verify this, write an example of an endless loop:

** Test environment: **AMD Ryzen 5 3600, 6-core, 12-Threads.

Public class CPUUtilizationTest {public static void main(String[] args) {// Dead loop, nothing while (true){}}}Copy the code

After running the example, take a look at the current CPU utilization:

As you can see from the graph, my number 3 core utilization is already full. How many more threads should I open based on the above theory?

public class CPUUtilizationTest { public static void main(String[] args) { for (int j = 0; j < 6; j++) { new Thread(new Runnable() { @Override public void run() { while (true){ } } }).start(); }}}Copy the code

In this case, CPU utilization of 1/2/5/7/9/11 cores is fully utilized.

What about 12 threads? Is that going to run all the cores to full utilization? The answer must be yes:

What happens if I increase the number of threads in the example above to 24?

As you can see from the figure above, the CPU utilization is still 100% for all cores, but the load has increased from 11.x to 22.x, indicating that the CPU is busier and the tasks of the threads cannot be executed in a timely manner.

Load Average explanation reference:

https://scoutapm.com/blog/understanding-load-averages
Copy the code

Modern cpus are basically multi-core, such as the AMD 3600i tested here, with 6 cores and 12 threads (hyperthreading). We can simply think of it as a 12-core CPU. So my CPU can do 12 things at once without interrupting.

If the number of threads to execute is greater than the number of cores, then it needs to be scheduled by the operating system. The operating system allocates CPU time slices to each thread and then switches them continuously to achieve the effect of “parallel” execution.

But is it really faster? As you can see from the above example, a single thread can run to full utilization of a core.

If each thread is “bossy”, executing instructions without giving the CPU idle time, and the number of threads executing at the same time is greater than the number of CPU cores, the operating system will execute thread switching execution more frequently to ensure that each thread can be executed.

However, there is a cost to switching, each switching will be accompanied by register data update, memory page table update and other operations.

Although the cost of a single switch is trivial compared to that of an I/O operation, if there are too many threads, threads switch too frequently, or even switch time per unit of time is longer than the execution time of the program, too many CPU resources can be wasted on context switching instead of executing the program, which is not worth the loss.

The example above is a bit extreme and would be unlikely under normal circumstances.

Most programs have some I/O operations when they run, such as reading and writing files, receiving and sending network messages, etc. These I/O operations need to wait for feedback.

For example, during network read/write operations, packets need to be sent or received. During the waiting process, the thread is in the waiting state and the CPU does not work.

The operating system then dispatches the CPU to execute instructions from other threads, making perfect use of the idle CPU period and increasing CPU utilization.

In the example above, the program loops over and over doing nothing, and the CPU has to keep executing instructions, leaving little free time.

What happens if you insert an I/O operation, and the CPU is idle during the I/O operation?

Let’s look at the results for a single thread:

public class CPUUtilizationTest { public static void main(String[] args) throws InterruptedException { for (int n = 0; n < 1; N++){new Thread(new Runnable() {@override public void run() {while (true){ For (int I = 0; i < 100_000_000l; i++) { } try { Thread.sleep(50); } catch (InterruptedException e) { e.printStackTrace(); } } } }).start(); }}}Copy the code

Wow, core 9, the only one with utilization, is 50% utilization, which is half as low as 100% without sleep.

Now adjust the number of threads to 12 and see:

The CPU usage of a single core is around 60, not far from the single-thread result. The CPU usage of a single core is not full, but the number of threads is increased to 18:

At this point, the single core utilization is close to 100%. Therefore, when I/O operations in the thread do not occupy CPU resources, the operating system can schedule the CPU to execute more threads at the same time.

Now increase the frequency of I/O events and let’s see, reduce the number of loops to half, 50_000_000, also 18 threads:

At this point, the utilization of each core is only about 70 percent.

02

Summary of thread count and CPU utilization

The above example is just an aid to better understand the thread count/program behavior /CPU state relationship.

To summarize briefly:

  • A single extreme thread (when continuously performing “computations” operations) can run to the maximum utilization of a single core, and a multi-core CPU can execute at most as many “extreme” threads simultaneously as the number of cores.
  • If each thread is so “extreme” and the number of threads executing at the same time exceeds the number of cores, it will result in unnecessary switches, causing too much load and only making execution slower.
  • Pause operations, such as I/O, occur when the CPU is idle. The operating system schedules the CPU to execute other threads to improve CPU utilization and execute more threads at the same time.
  • The higher the frequency of I/O events, or the longer the wait/pause time, the longer the IDLE time of the CPU, the lower utilization, and the more threads the operating system can schedule the CPU to execute.

03

Formula for thread count planning

All of this is to help you understand, but now let’s look at the definition in the book.

“Java Concurrent Programming Field” introduced a formula for calculating the number of threads:

If you want the program to run to the target CPU utilization, the formula for the number of threads required is:

The formula is very clear, so let’s try it in the example above.

** If I want the target utilization to be 90% (multi-core 90), then the number of threads required is: ** Number of cores 12 Utilization 0.9(1+50(sleep time)/50(cycle time 50_000_000)≈22.

Now set the thread count to 22 and see what happens:

CPU utilization is now around 80+, which is close to expectations, but it’s not surprising that the actual utilization is lower due to the number of threads, the overhead of some context switches, and the lack of rigor in the test cases.

To invert the formula, you can also calculate CPU utilization by the number of threads:

Number of threads 22/(number of cores 12*(1+50(sleep time)/50(cycle time 50_000_000))≈0.9.

While the formulas are good, it is generally difficult to get accurate wait times and computation times in real programs because the programs are too complex to just “compute.”

A piece of code will have a lot of memory read and write, calculation, I/O and other complex operations, it is difficult to accurately obtain these two indicators, so it is too ideal to calculate the number of threads only by formula.

04

The number of threads in a real program

What is the appropriate number of threads (thread pool size) in a real application, or in some Java business systems?

** There is no fixed answer, first set expectations, such as what I expect CPU utilization, load, GC frequency and so on, and then test to adjust to a reasonable number of threads.

For a normal, Springboot-based business system, the default is Tomcat container +HikariCP connection pool +G1 collector, if a multi-threaded (or thread pool) business scenario is also required to execute the business process asynchronously/in parallel.

If I plan the number of threads according to the above formula, the error will be very large.

Tomcat has its own thread pool, HikariCP has its own background thread, the JVM has some compiled threads, and even G1 has its own background thread.

These threads are also running on the current process, on the current host, and consume CPU resources.

Therefore, under the interference of the environment, it is difficult to accurately plan the number of threads by the formula alone, so it must be verified by testing.

The process is like this:

  • Check whether other processes interfere with the current host.
  • Analyze whether there are other running or potentially running threads on the current JVM process.
  • Set goals, target CPU utilization: What is the maximum I can tolerate my CPU going up? Target GC frequency/pause time: After multi-threaded execution, GC frequency will increase, what is the maximum allowable frequency and what is the pause time? Execution efficiency: for example, in batch processing, how many threads do I need to open per unit of time to complete processing in time…
  • If the number of threads is too large, some nodes on the link may have limited resources, which may result in a large number of threads waiting for resources (for example, three-party interfaces limit traffic, connection pools are limited, and the middleware cannot support due to excessive pressure).
  • Continuously increase/decrease the number of threads to test, test to the highest requirements, and finally get a “meet the requirements” number of threads.

And and and and and! The idea of the number of threads varies from scenario to scenario:

  • MaxThreads in Tomcat is different with Blocking AND no-blocking I/O.
  • Dubbo is still a single connection by default. There is also a distinction between I/O threads (pool) and business threads (pool). I/O threads are generally not a bottleneck, so not too much, but business threads can easily be called a bottleneck.
  • Redis 6.0 is also multi-threaded, but it is only I/O multi-threaded, and the “business” processing is still single-threaded.

So, don’t worry about how many threads you have. There is no standard answer, you must combine the scenario, with the goal, through testing to find the most appropriate number of threads.

Some students may have a question: “Our system does not have any pressure, does not need that appropriate number of threads, just a simple asynchronous scenario, does not affect other functions of the system.”

Very normal, a lot of internal business systems, do not need what performance, stable use to meet the demand can be. So my recommended number of threads is: CPU cores.

05

The appendix

Number of CPU cores obtained by Java:

Runtime.getruntime ().availableProcessors()// Get the number of logical cores, such as 6 cores and 12 threads, then return 12Copy the code

Linux obtain CPU cores:

# = total number of nuclear physics cpus X per physical CPU of nuclear # to the total number of logical CPU number X = physical CPU each physical CPU kernel number X hyper-threading # to check the number of physical CPU cat/proc/cpuinfo | grep "physical id |" Sort | uniq | wc - l # to check the number of each physical CPU core (nuclear) cat/proc/cpuinfo | grep "CPU cores" | uniq # to check the number of logical CPU cat/proc/cpuinfo | grep "processor"| wc -lCopy the code