Threads and Hardware

The number of CPU cores and hyperthreading can both affect the performance of software threads, and by doubling the number of physical cores, you can also double or more the performance of a program, but hyperthreading does not.

The rest of the examples in this article will run on a 4 physical core, 8 logical core CPU to show the difference between a hyperthreaded CPU and a non-hyperthreaded CPU. (This is not to say that hyperthreading is not important, it is a free 20-40% performance or throughput improvement, and it can be much higher under harsh conditions, such as when a lot of your code is logically independent, see the Hyperthreading Technology Wiki for details). On the Java programming side, hyperthreading should always be considered a CPU in the de facto sense.

Thread pools and ThreadPoolExecutor

In Java, threads can be managed by writing their own code or by using thread pools. Java servers typically use one or more thread pools to process client requests, and other Java applications can use the ThreadPoolExecutor class to perform tasks in parallel.

Some Web frameworks also make use of ThreadPoolExecutor, such as Spring’s ThreadPoolTaskExecutor, which is a reencapsulation of ThreadPoolExecutor. To allow the Spring in the form of a Bean configuration ThreadPoolExecutor, and a few core parameters – corePoolSize maxPoolSize, keepAliveSeconds support runtime update, This class is therefore ideal for developing applications with functions such as real-time thread management and monitoring.

Of course, there are frameworks that do not use ThreadPoolExecutor to manage threads, mostly because they predate the ThreadPoolExecutor class.

While the implementation of thread pools may vary slightly from framework to framework, the basic concepts are generally the same.

The single most critical factor in using a thread pool is the size of the thread pool. The performance of a thread pool varies with its size, and in some cases an excessively large size can degrade its performance.

Various types of thread pool works generally similar, tasks to be submitted to one or more of the queue, and then a certain number of threads to extract the task from the queue and executes, if is the Web server task execution results will be sent back to the client, other information may be stored in the local memory, persisted in the DB and so on. After the thread completes a task, it retrieves other tasks in the queue and executes them, waiting for the next task to arrive if there are no other tasks that need to be executed.

A thread pool has a minimum number of threads and a maximum number of threads. The smallest number of threads are persistent, waiting for tasks to be assigned to them. Because creating a thread is quite expensive operation, let the part of the thread can accelerate a long-standing task execution process, on the other hand, it is also because the thread needs to allocate system resources, including the host memory, so if there are too many threads, is a waste of system resources, they take up the resource can be used for other applications. The maximum number of threads is used as a necessary threshold to prevent too many tasks from being executed at once.

Set the maximum number of threads

How should the optimal maximum number of threads be set for a given hardware condition and workload? This is not a simple calculation, just like GC tuning, depending on the characteristics and hardware conditions of the workload. Of course, there is also a more prominent influence point, which is the blocking frequency of tasks.

The rest of the discussion revolves around a CPU with four physical cores.

Obviously, the maximum number of threads needs to be set to at least 4. Of course, some threads in the JVM may be doing other things, but often they do not occupy an entire core unless concurrent/parallel garbage collectors such as the G1 GC, ZGC, or Shenandoah GC are used, which require enough threads to reclaim memory.

Does it make sense to have more than four threads? To take a simple example, all tasks are CPU computations, there is no network IO, and there is no lock contention. I used recursion to calculate the Fibonacci sequence (I’m using recursion here to make the program run long enough), performed the same task 16 times and counted the elapsed time.

public static void main(String[] args) throws InterruptedException {
    ExecutorService e = Executors.newFixedThreadPool(1);
    List<FiboTask> tasks = new ArrayList<>();
    for (int i = 0; i < 16; i++) {
        tasks.add(new FiboTask());
    }
    long t = System.currentTimeMillis();
    for (FiboTask ft : tasks) {
        e.submit(ft);
    }
    e.shutdown();
    e.awaitTermination(1, TimeUnit.DAYS);
    System.out.println(System.currentTimeMillis() - t);
}
Copy the code
  • The results
Number of threads Time (ms) baseline
1 70505 100%
2 37990 53.88%
4 18646 26.45%
8 18911 26.82%
16 19294 27.37%

As you can see, although the test results are slightly off due to the influence of other programs, they generally indicate that running the program with 2 and 4 threads in full parallel takes 50% and 25% of the time, respectively. In practice, however, such linear scaling is impossible for several reasons: Threads typically have to coordinate to pull a task out of the queue. With four threads, the CPU usage reaches 100%. Even excluding the influence of other user programs, the system processes themselves take up some CPU, so JVMS tend not to use all of the CPU cycles. You can also see that in this test, even with far more threads than the number of CPU cores, the negative impact on performance was actually quite small.

Let’s see what happens when hyperthreading is enabled on a 2 physical core CPU (using Docker to limit the available CPU of the container). With hyperthreading enabled, you now have 2 physical cores and 4 logical cores. The test results are as follows:

Number of threads Time (ms) baseline
1 70671 100%
2 37528 53.10%
4 35094 49.66%
8 35977 50.91%
16 36113 51.10%

Up to two threads, the performance improvement is linear, but thereafter the improvement is minimal. However, the benefits of hyper-threading are obvious when threads have IO operations or are waiting for locks.

As mentioned earlier, performance bottleneck analysis is one of the keys to performance tuning. In the example above, the bottleneck is clearly on the CPU, and it makes no sense to use more than four threads.

This example is extreme, threads will normally have IO operations, especially the Web server, a thread may be database operation, also can do the disk write, etc., in this case, the CPU is not always the bottleneck, the bottleneck can occur in the external, such as a database server or disk performance.

If the bottleneck is indeed external, then it is not a good time to expand the thread pool. As a simple example, we now have a client for making HTTP requests and a Web server with the application deployed. We won’t consider the capacity of the server, we use a thread on the client to the server request, maybe this time the server CPU is 25%, but the client’s CPU is almost completely free, and then put the client’s threads to 4, when the server CPU usage has reached 100%, On the client, it might be 20%.

Looking only at the client side, it is true that the client side is quite a waste of resources, but does this mean that we can add more threads to the client side? Let’s take a specific test and see how it goes:

Number of client threads Average response time (MS) baseline
1 232 100%
2 277 119.40%
4 287 123.71%
8 297 128.02%
16 354 152.59%
32 539 232.33%

By the time the number of threads increased to 32, the response time of the service was severely slowed down. Thus, in this case, if the server becomes the bottleneck, it is quite disadvantageous for the client to continue adding threads. The same is true when interacting with a database, and this is especially true when the database becomes a bottleneck.

This is one of the reasons why auto-tuning of thread pools is not easy to implement, even though the thread pool has some visibility into the amount of work it handles, the hardware environment it is in, and it lacks visibility into the overall environment, including the external environment. One could argue that Java provides automatic expansion of the cache thread pool, but I don’t think it should be used in production at all, and the risk of using it is quite high.

In this test, the server was configured with 4 physical core cpus, and 16 threads were created by default when the service started, which makes sense for a Web server and, as mentioned earlier, is expected. While the call blocks and waits for a response, other threads can run other tasks, so creating more threads in the Web server is a reasonable compromise. This has a slight penalty for CPU intensive tasks, and can increase throughput for IO intensive tasks.

To sum up, maximum sizing of thread pools is more of an art than a set practice. In production, a self-tuning thread pool typically delivers 80-90% performance, and the performance penalty is not unacceptable even if you overestimate the number of threads actually required. But once thread pool size causes a problem, it can be a huge problem. In order to minimize this risk, thorough testing is essential.

Set the minimum number of threads

After exploring the setting of the maximum number of threads, the next step is the minimum number of threads. In most cases, both can be set to the same value.

The reason for setting the minimum number of threads to something else, such as 1, is to prevent the system from creating too many threads, thereby saving system resources. But in reality, we always design the system with maximum throughput in mind, so we need to create the maximum number of threads we expect. If the system cannot handle the maximum number of threads, there is no point in adjusting the minimum number of threads.

Should threads be pre-created?

By default, when you create a ThreadPoolExecutor, the number of threads is 1. Let’s say a thread pool requires 8 core threads and 16 maximum threads. In this case, the number of core threads is the minimum number of threads because they are always in the pool even if they are free, while the other 8 threads are created on demand and then kept in the pool.

In the case of the server, this means a slight delay for the first eight requests, but the delay is small enough that you can pre-create these threads. Consider this method:

/** * Starts a core thread, causing it to idly wait for work. This * overrides the default policy of starting core threads only when * new tasks are  executed. This method will return {@code false} * if all core threads have already been started. * * @return {@code true} if a thread was started */ public boolean prestartCoreThread() { return workerCountOf(ctl.get()) < corePoolSize &&  addWorker(null, true); }Copy the code

On the other hand, the bug of specifying a minimum number of threads is only nominal. The defect occurs only in the original when there are multiple tasks to be performed, then the thread pool will create a new thread to meet demand, and create a thread that could impart an adverse effect on performance is, this is also the reason of why need the thread pool, but as long as after a thread has been keep in the pool, the one-time can create cost is ignored.

For example, in a batch program such as BI, it doesn’t matter whether threads are allocated at creation time or on demand. In other programs, new threads may be allocated during the warm-up period, with negligible performance impact on the application. Taking a step back, thread creation occurs during external service delivery, and as long as a limited number of threads are created, the impact may not be noticeable.

Another tuning advantage that needs to be noted here is the idle time of the thread. Suppose you have a thread pool with a minimum number of threads 1 and a maximum number of threads 4, and then the program begins a cycle of two tasks every 15 seconds. During the first cycle, the thread pool creates a second thread. Naturally, it makes sense for the second thread to remain in the thread pool for a while, because we need to avoid the situation where the second thread is created, completes within five seconds, is idle for the next five seconds, and then exits. Because in 5 seconds, the next cycle will have started and a task will be assigned to the second thread. In general, after a minimum thread is created in a thread pool, it should remain in the pool for at least a few minutes to deal with possible subsequent spikes. If you have a reliable queue-theoretic model, you can calculate the retention time from the model. Otherwise, the retention time should be measured in minutes, at least another 10-30 minutes.

Keeping threads idle usually has little impact on your application. Generally speaking, thread objects themselves do not take up much heap space. Unless the thread holds a large amount of ThreadLocal Storage, ora large amount of memory is referenced by thread runtime objects. In both cases, freeing up threads can save a lot of heap space. Of course, neither of these things should happen by themselves. When a thread in the thread pool is idle, make sure it doesn’t reference any runtime objects. If it does, there must be a BUG somewhere. Depending on the thread pool implementation, thread-local variables may still be retained for reasons such as reuse in some cases, but the amount of memory that these local objects occupy must be limited.

There is an important exception to this rule, which is when the thread pool might grow into a giant thread pool. Given that a task queue is expected to have 20 tasks per execution cycle, 20 is the minimum recommended number of threads for this thread pool.

If the thread pool is running on a high configuration machine, it can handle a peak number of tasks of 2000. Keeping 2000 idle threads in the pool affects performance when it only performs 20 tasks. Typically, this is not a problem for small and medium-sized companies, but if it is, make sure that the thread pool has an appropriate minimum.

Number of thread pool tasks

The pending tasks of the thread pool are stored in a queue or list, and when a thread in the pool is available to execute a task, it extracts a task from the queue. This can lead to an imbalance in the rate of production and consumption, as the number of tasks in the queue can become very large. If the queue is too long, the tasks in the queue will wait a very long time until the previous task completes execution. Imagine that a Web server is under high load, and if a task is added to the queue and not executed three seconds later, the user experience is extremely poor. Therefore, thread pools typically limit the size of the queue of pending tasks. ThreadPoolExecutor does this in a variety of ways, depending on its configured data structure, and for servers there is usually a parameter that can be used to adjust this value, such as accept-count in Tomcat.

Like the maximum number of threads in a thread pool, there is no general strategy for adjusting the number of tasks. Suppose a server has a queue of 30,000 length with four available cpus. If a task takes only 50ms to execute, the queue can be consumed in 6 minutes. This may still be acceptable, but if each task takes 1 second to execute, it will take 2 hours to complete. Again, estimate and measure the actual needs of the application to determine the adjustment strategy for this value.

In any case, adding more tasks to the queue will throw an exception when the queue length limit is reached. A rejectedExecution ThreadPoolExecutor () method to handle the overflow, by default throws a RejectedExecutionException. In this case, the server needs to return a reasonable response to the client, such as a status code of 429 (too many requests) or 503 (service unavailable).

Adjust the ThreadPoolExecutor

The general policy for thread pools is to start with the minimum number of threads, and if a new task arrives when all threads are busy, start a new thread and execute the task immediately. If the maximum number of threads has been started, but they are all busy, the task is queued. Unless the queue is full, in which case the task is rejected. But the behavior of ThreadPoolExecutor can actually be different.

ThreadPoolExecutor determines when to start a new thread based on the type of queue used to hold tasks. There are three possibilities:

  • SynchronousQueue

When SynchronousQueue is used, the thread pool behaves as expected in terms of the number of threads: if all threads are busy and the number of threads in the pool is less than the maximum, a new thread is started. But the queue has no way to hold pending tasks, and if a task arrives and the largest thread is already busy, the task is always rejected. So this option is great for managing a small number of tasks, but may not be appropriate for other areas. The JDK documentation for this class suggests specifying a very large number as the maximum thread size. This might be considered if the task is completely CPU intensive, but it can be counterproductive in other cases. On the other hand, if you need a thread pool with an easily adjustable number of threads, this is also a good choice.

In this case, the core thread count is the minimum thread count — the number of threads that will remain idle even if the thread is idle. The maximum number of threads is the maximum number of threads in the pool.

  • Unbounded queues

When using an unbounded queue (such as LinkedBlockingQueue), no task is rejected because the queue length is infinite. In this case, the thread pool uses the maximum number of threads of the core thread size, at which point the maximum number of threads is ignored. This essentially mimics a traditional thread pool, where the core thread count is taken as the maximum thread count, but since the queue length is infinite, there is a risk of excessive memory consumption if tasks are submitted much faster than they are consumed.

This is also the Executors newFixedThreadPool () and newSingleThreadScheduledExecutor () method returns the thread pool type, the former’s core thread (or maximum number of threads) is used to construct the parameters of the thread pool, the latter’s core thread number is 1.

  • Bounded queues

Thread pools using bounded queues, such as ArrayBlockingQueue, employ complex algorithms to decide when to start a new thread. For example, suppose the thread pool has a core thread count of 4, a maximum thread count of 8, and a maximum length of ArrayBlockingQueue of 10. When a task arrives and is queued, the thread pool will run up to four threads. Even when the queue is full, the executor only runs on four threads. A new thread is created only when the queue is full and a new task is added to the queue.

Instead of rejecting the task because the queue is full, the thread pool starts a new thread. This thread runs the first task in the queue to make room for the incoming task.

In this example, an eighth thread in the pool reaches its specified maximum number of threads only when there are 7 tasks in progress, 10 pending tasks in the queue, and an 11th task is about to be added to the queue.

The idea is that even if there are a reasonable number of tasks in the queue waiting to run, the thread pool will only run with the configured core thread size most of the time. This allows the thread pool to act as a choke. If the backlog of requests becomes too large, the thread pool tries to run more threads to consume requests (subject to a second throttle, the maximum number of threads).

If the system has no external bottlenecks and sufficient CPU resources, then the algorithm idea is fine: add new threads to consume tasks in the queue faster.

On the other hand, the algorithm has no sense of why the queue length increases. If the cause is external, adding more threads is a mistake. This is also a mistake if the thread pool is running on a machine with insufficient CPU resources. Adding threads only makes sense when there is additional stress on the system and a backlog occurs (such as an increase in client requests). (You don’t have to wait until the queue reaches a certain threshold to add a thread, since there are additional resources available, why not create them earlier?)

There are pros and cons to the choices offered here. But when trying to maximize application performance, apply the KISS principle (Keep It Simple, Stupid). A general tip is to avoid using Executors during production to provide default, unclosure-controlled thread pools that don’t allow you to control your application’s memory usage. You need to build your own ThreadPoolExecutor, ideally with the same number of core threads and maximum threads, and use ArrayBlockingQueue to limit the number of requests waiting to be executed in memory.

Simple summary

A thread pool is a type of object pool. Threads are expensive to create, and thread pools can limit the number of threads on a system.

The thread pool must be adjusted carefully and not blindly increase the number of threads, which in some cases can degrade performance.

Using a simpler configuration for ThreadPoolExecutor generally provides the best and most predictable performance.

Resources: Oreilly.java Performance