Local variables are 5 times faster than global variables?

Hello, everyone, Lei’s performance optimization is here again!

In fact, the original intention of writing this performance optimization class article is very simple, first: there is no good series of articles on the market about performance optimization, including some paid articles; Second: I need to write some knowledge points different from others, such as everyone to write about SpringBoot, then I will not focus on SpringBoot. There are few articles on performance optimization, so that’s why I wrote it.

As to whether it can be used? Is it just needed? I think everyone has their own answer. Like a good swordsman, he will be obsessed with swords all his life, and I’m sure you, reading this, will be too.

JMH (Java Microbenchmark Harness, Java Microbenchmark Suite) is used by Oracle to test the performance of local and global variables.

<! -- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-core --> <dependency> <groupId>org.openjdk.jmh</groupId> <artifactId>jmh-core</artifactId> <version>{version}</version> </dependency>Copy the code

Then write the test code:

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.AverageTime) // Test completion time
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS) // Preheat 2 rounds, 1s each time
@Measurement(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS) // Test 5 rounds, 3s each time
@Fork(1) // start 1 thread
@State(Scope.Thread) // One instance per test thread
public class VarOptimizeTest {

    char[] myChars = ("Oracle Cloud Infrastructure Low data networking fees and " +
            "automated migration Oracle Cloud Infrastructure platform is built for " +
            "enterprises that are looking for higher performance computing with easy " +
            "migration of their on-premises applications to the Cloud.").toCharArray();

    public static void main(String[] args) throws RunnerException {
        // Start the benchmark
        Options opt = new OptionsBuilder()
                .include(VarOptimizeTest.class.getSimpleName()) // The test class to import
                .build();
        new Runner(opt).run(); // Execute the test
    }

    @Benchmark
    public int globalVarTest(a) {
        int count = 0;
        for (int i = 0; i < myChars.length; i++) {
            if (myChars[i] == 'c') { count++; }}return count;
    }

    @Benchmark
    public int localityVarTest(a) {
        char[] localityChars = myChars;
        int count = 0;
        for (int i = 0; i < localityChars.length; i++) {
            if (localityChars[i] == 'c') { count++; }}returncount; }}Copy the code

The globalVarTest method uses the global variable myChars for loop traversal, while the localityVarTest method uses the local variable localityChars for loop traversal. The JMH test results are as follows:

What the hell? Isn’t the performance of the two methods about the same? Why did you say it was five times worse?

CPU Cache

The reason why the above code performs almost as well is that the global variable myChars is cached by the CPU, and instead of being queried directly from the object’s instance domain (the actual storage structure of the object), it is queried directly from the CPU’s cache, hence the above result.

To restore true performance (local and global variables), we need to use the volatile key to modify the myChars global variable so that the CPU does not cache it. Volatile semantics disable CPU caching.

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.AverageTime) // Test completion time
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS) // Preheat 2 rounds, 1s each time
@Measurement(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS) // Test 5 rounds, 3s each time
@Fork(1) // start 1 thread
@State(Scope.Thread) // One instance per test thread
public class VarOptimizeTest {

    volatile char[] myChars = ("Oracle Cloud Infrastructure Low data networking fees and " +
            "automated migration Oracle Cloud Infrastructure platform is built for " +
            "enterprises that are looking for higher performance computing with easy " +
            "migration of their on-premises applications to the Cloud.").toCharArray();

    public static void main(String[] args) throws RunnerException {
        // Start the benchmark
        Options opt = new OptionsBuilder()
                .include(VarOptimizeTest.class.getSimpleName()) // The test class to import
                .build();
        new Runner(opt).run(); // Execute the test
    }

    @Benchmark
    public int globalVarTest(a) {
        int count = 0;
        for (int i = 0; i < myChars.length; i++) {
            if (myChars[i] == 'c') { count++; }}return count;
    }

    @Benchmark
    public int localityVarTest(a) {
        char[] localityChars = myChars;
        int count = 0;
        for (int i = 0; i < localityChars.length; i++) {
            if (localityChars[i] == 'c') { count++; }}returncount; }}Copy the code

The final test results are:

As you can see from the above results, the performance of local variables is about 5.02 times faster than that of global variables.

As for why local variables are faster than global variables? Let’s talk about that later. Let’s talk about CPU caching first.

In computer systems, the CPU Cache is a component that reduces the average time it takes a processor to access memory. In a pyramid, it is the second layer from the top down, just after the CPU register, as shown below:

CPU cache capacity is much smaller than memory, but the speed can be close to processor frequency. When the processor makes a memory access request, it first looks to see if there is any request data in the cache. If there is (hit), the data is returned without accessing memory; If not, the data in memory is loaded into the cache before being returned to the processor.

CPU cache can be divided into level 1 cache (L1), level 2 cache (L2), and some high-end CPUS also have level 3 cache (L3). The technical difficulty and manufacturing cost of these three caches are relatively decreasing, so their capacity is also relatively increasing. When the CPU wants to read a piece of data, it first looks for it in the level-1 cache, then in the level-2 cache if it doesn’t find it, and then in the level-3 cache or memory if it still doesn’t find it.

Here is a comparison of cache and memory response times at each level:

(Image credit: Cenalulu)

As you can see from the figure above, memory is much slower to respond than CPU cache.

Why are local variables fast?

To understand why local variables are faster than global variables, we need only use Javac to compile them into bytecode, which looks like this:

Javap -c VarOptimize Warning: the file./VarOptimize. Class does not contain the class VarOptimize Compiled from"VarOptimize.java"
public class com.example.optimize.VarOptimize {
  char[] myChars;

  public com.example.optimize.VarOptimize();
    Code:
       0: aload_0
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: aload_0
       5: ldc           #7                  // String Oracle Cloud Infrastructure Low data networking fees and automated migration Oracle Cloud Infrastructure platform is built for enterprises that are looking for higher performance computing with easy migration of their on-premises applications to the Cloud.
       7: invokevirtual #9                  // Method java/lang/String.toCharArray:()[C
      10: putfield      #15                 // Field myChars:[C
      13: return

  public static void main(java.lang.String[]);
    Code:
       0: new           #16                 // class com/example/optimize/VarOptimize
       3: dup
       4: invokespecial #21                 // Method "<init>":()V
       7: astore_1
       8: aload_1
       9: invokevirtual #22                 // Method globalVarTest:()V
      12: aload_1
      13: invokevirtual #25                 // Method localityVarTest:()V
      16: return

  public void globalVarTest();
    Code:
       0: iconst_0
       1: istore_1
       2: iconst_0
       3: istore_2
       4: iload_2
       5: aload_0
       6: getfield      #15                 // Field myChars:[C
       9: arraylength
      10: if_icmpge     33
      13: aload_0
      14: getfield      #15                 // Field myChars:[C
      17: iload_2
      18: caload
      19: bipush        99
      21: if_icmpne     27
      24: iinc          1.1
      27: iinc          2.1
      30: goto          4
      33: return

  public void localityVarTest();
    Code:
       0: aload_0
       1: getfield      #15                 // Field myChars:[C
       4: astore_1
       5: iconst_0
       6: istore_2
       7: iconst_0
       8: istore_3
       9: iload_3
      10: aload_1
      11: arraylength
      12: if_icmpge     32
      15: aload_1
      16: iload_3
      17: caload
      18: bipush        99
      20: if_icmpne     26
      23: iinc          2.1
      26: iinc          3.1
      29: goto          9
      32: return
}
Copy the code

The key information is in the getField keyword, where the semantics of getField are to fetch variables from the heap. As you can see from the bytecode above, the globalVarTest method uses the getField keyword to fetch variables from the heap each time inside the loop. The localityVarTest method does not use the getField keyword, but uses the off-stack operation for business processing, and fetching variables from the heap is much slower than the off-stack operation, so using global variables is much slower than using local variables. I will cover heap and stack separately in a later section on JVM optimization.

About the cache

One might argue that it doesn’t matter, since using a global variable will use a CPU Cache, so the performance is almost the same as using a local variable, so I’ll just use it anyway, because it’s almost the same.

However, lei’s advice is to never use global variables if you can use local variables, because THERE are three problems with CPU caching:

The CPU Cache uses LRU and Random cleanup algorithms. Infrequently used caches and Random portions of caches are removed. What if it happens to be the same global variable that you used?
The CPU Cache has a Cache hit ratio problem, that is, there is a certain probability that the Cache will not be accessed;
Some cpus have only two levels of cache (L1 and L2), so the available space is limited.

To sum up, we can’t trust the performance of a program to a less stable system hardware, so we can use local variables and never use global variables.

The take-home message: Write code that works for you, balancing performance, readability, and usability.

conclusion

Global variables are retrieved from the heap using the getField keyword, while local variables are retrieved from the stack because the stack operation is faster than the heap operation. It is recommended that you use local variables instead of global variables.

The battle between masters is all about the details.

The last word

Original is not easy, feel helpful, point a “like” let me know, thank you!

Follow the public account “Java Chinese community” reply “dry goods”, obtain 50 original dry goods Top list.

Local variables are 5 times faster than global variables?

CPU Cache

Why are local variables fast?

About the cache

conclusion

The last word

Related Posts

You must understand and can understand microservices series 2: service unregister

The solution of atomicity problem — lock

Common code refactoring tips