Many of you have probably seen the theory of a thread count setting:

  • CPU intensive program – Number of cores + 1
  • I/O intensive program – number of cores x 2

Really? Really? Does anyone really plan the number of threads according to this theory?

Small tests of thread count and CPU utilization

Aside from operating system and computer principles, here’s a basic theory: a CPU core can execute only one thread of instruction per unit of time

So in theory, I can run a core utilization just by executing instructions.

To verify this, write an example of an endless loop:

Test environment: AMD Ryzen 5 3600, 6-core, 12-Threads

Public class CPUUtilizationTest {public static void main(String[] args) {// Dead loop, nothing while (true){}}}Copy the code

After running the example, take a look at the current CPU utilization:

As you can see from the graph, my number 3 core utilization is already full

How many more threads should I open based on the above theory?

public class CPUUtilizationTest { public static void main(String[] args) { for (int j = 0; j < 6; j++) { new Thread(new Runnable() { @Override public void run() { while (true){ } } }).start(); }}}Copy the code

In this case, CPU utilization of 1/2/5/7/9/11 cores is fully utilized.

What about 12 threads? Is that going to run all the cores to full utilization? The answer must be yes:

What happens if I increase the number of threads in the example above to 24?

As you can see from the figure above, the CPU utilization is still 100% for all cores, but the load has increased from 11.x to 22.x, indicating that the CPU is busier and the tasks of the threads cannot be executed in a timely manner.

Modern cpus are basically multi-core, such as the AMD 3600i tested here, with 6 cores and 12 threads (hyperthreading). We can simply think of it as a 12-core CPU. So my CPU can do 12 things at once without interrupting.

If the number of threads to execute is greater than the number of cores, then it needs to be scheduled by the operating system. The operating system allocates CPU time slices to each thread and then switches them continuously to achieve the effect of “parallel” execution.

But is it really faster? As you can see from the above example, a single thread can run to full utilization of a core. If each thread is “bossy”, executing instructions without giving the CPU idle time, and the number of threads executing at the same time is greater than the number of CPU cores, the operating system will execute thread switching execution more frequently to ensure that each thread can be executed.

However, there is a cost to switching, each switching will be accompanied by register data update, memory page table update and other operations. Although the cost of a single switch is trivial compared to that of an I/O operation, if there are too many threads, threads switch too frequently, or even switch time per unit of time is longer than the execution time of the program, too many CPU resources can be wasted on context switching instead of executing the program, which is not worth the loss.

The example above is a bit extreme and would be unlikely under normal circumstances.

Most programs have some I/O operations when they run, such as reading and writing files, receiving and sending network messages, etc. These I/O operations need to wait for feedback. For example, during network read/write operations, packets need to be sent or received. During the waiting process, the thread is in the waiting state and the CPU does not work. The operating system then dispatches the CPU to execute instructions from other threads, making perfect use of the idle CPU period and increasing CPU utilization.

In the example above, the program loops over and over doing nothing, and the CPU has to keep executing instructions, leaving little free time. What happens if you insert an I/O operation, and the CPU is idle during the I/O operation? Let’s look at the results for a single thread:

public class CPUUtilizationTest { public static void main(String[] args) throws InterruptedException { for (int n = 0; n < 1; N++){new Thread(new Runnable() {@override public void run() {while (true){ For (int I = 0; i < 100_000_000l; i++) { } try { Thread.sleep(50); } catch (InterruptedException e) { e.printStackTrace(); } } } }).start(); }}}Copy the code

Wow, core 9, the only one with utilization, is 50% utilization, which is half as low as 100% without sleep. Now adjust the number of threads to 12 and see:

The CPU usage of a single core is around 60, not far from the single-thread result. The CPU usage of a single core is not full, but the number of threads is increased to 18:

At this point, the single core utilization is close to 100%. Therefore, when I/O operations in the thread do not occupy CPU resources, the operating system can schedule the CPU to execute more threads at the same time.

Now increase the frequency of I/O events and let’s see, reduce the number of loops to half, 50_000_000, also 18 threads:

At this point, the utilization of each core is only about 70 percent.

Summary of thread count and CPU utilization

To better understand the thread count/program behavior /CPU state relationship, the above example is a quick summary:

  • A single extreme thread (when continuously performing “computations” operations) can run to the maximum utilization of a single core, and a multi-core CPU can execute at most as many “extreme” threads simultaneously as the number of cores
  • If each thread is so “extreme” and the number of threads executing at the same time exceeds the number of cores, it will result in unnecessary switches, causing too much load and only making execution slower
  • Pause operations, such as I/O, occur when the CPU is idle. The operating system schedules the CPU to execute other threads to improve CPU utilization and execute more threads at the same time
  • The higher the frequency of I/O events, or the longer the wait/pause time, the longer the IDLE time of the CPU, the lower utilization, and the more threads the operating system can schedule the CPU to execute

Formula for thread count planning

All of this is to help you understand, but now let’s look at the definition in the book. A formula for calculating the number of threads is introduced:

If you want the program to run to the target CPU utilization, the number of threads required is expressed as follows:

The formula is clear, so let’s try the example above:

If I expect a target utilization of 90% (multi-core 90), the number of threads required is:

The core number 12

Utilization rate of 0.9

(1 + 50(sleep time)/50(cycle 50_000_000 time)) ≈ 22

Now set the thread count to 22 and see what happens:

The CPU utilization is now around 80+, which is close to expectations. Due to the excessive number of threads, the overhead of context switching, and the lack of rigor in the test cases, the actual utilization is somewhat lower.

To invert the formula, you can also calculate CPU utilization by the number of threads:

Number of threads 22 / (number of cores 12 * (1 + 50(sleep time)/50(cycle time 50_000_000)) ≈ 0.9

While the formulas are good, it is generally difficult to get accurate wait times and computation times in real programs because the programs are too complex to just “compute.” A piece of code will have a lot of memory read and write, calculation, I/O and other complex operations, it is difficult to accurately obtain these two indicators, so it is too ideal to calculate the number of threads only by formula.