Introduction to the

It is better to travel ten thousand miles than to read ten thousand books. We have talked so much about the principles and optimization of Assembly and JVM. Today we are going to do something different. Explore how assembly can be used to understand a problem we didn’t understand before.

A strange phenomenon

Brother F, you talked so much about the performance optimization of JIT compilation in JVM. Seriously, do we really need to know these things in our work? You know what this stuff does for our work?

um… That’s a good question, and knowing how JIT compilation works and optimizations can help you write better code with a little more attention, but that’s just at the micro level.

A hardware upgrade, a cache addition, or an architectural change can be much more useful than a small code optimization if you’re scaling code up to the enterprise.

For example, if our project has a performance problem, our first response is to find out if there are any flaws in the architecture, if there are any optimizations, very little or very little going into the code level to see if there is room for optimizations in your code.

First, as long as the business logic of the code is good, it won’t run too slowly.

Second, the benefits of code optimization are too small and the amount of work is too large.

So, for this kind of chicken rib optimization, is it really necessary?

In fact, this is the same as my study of physics, chemistry and mathematics knowledge, you learn so much knowledge, in fact, in daily life really do not use. But why study?

I think there are two reasons. One is to give you a more fundamental understanding of the world and how it works. The second is to exercise their habits of thinking, learn how to solve problems.

Like an algorithm, now write a program really need to use an algorithm? Not really, but algorithms do matter, because they can affect your thinking habits.

So understanding how the JVM works, or even the use of Assemblies, is not a way to use them to optimize your code, but a way to know, oh, this is how the code works. At some point in the future, maybe I could use it.

All right, back to the point. Today I introduce a very strange example to my junior sister:

private static int[] array = new int[64 * 1024 * 1024];

    @Benchmark
    public void test1(a) {
        int length = array.length;
        for (int i = 0; i < length; i=i+1)
            array[i] ++;
    }
    @Benchmark
    public void test2(a) {
        int length = array.length;
        for (int i = 0; i < length; i=i+2)
            array[i] ++;
    }
Copy the code

Which of the above examples do you think works faster?

Of course it is the second one. The second one increases by 2 each time, so the number of times traversed is less and the execution is definitely faster.

Okay, we’ll take it with a grain of salt.

The second example, above we have +1 and +2 respectively, if we continue to +3, +4, all the way up to 128, what do you think the running time is?

It must be a linear decrease.

Well, that’s two questions, so let’s get the answers.

More highlights:

  • Blockchain from getting started to Giving up series tutorials – with ongoing updates covering cryptography, Hyperledger, Ethereum,Libra, Bitcoin and more
  • Spring Boot 2.X Series tutorials: Learn Spring Boot from Scratch in seven days – continuous updates
  • Spring 5.X Series tutorials: Everything you can think of in Spring5 – constantly updated
  • Java Programmer from Handyman to Expert to God (2020 edition) – Ongoing updates with detailed articles and tutorials

The answer to two questions

Again, we use JMH to test our code. The code is quite long, so I won’t list it here. If you are interested, you can download the running code from the code link below.

Let’s run the results directly:

Benchmark               Mode  Cnt   Score   Error  Units
CachelineUsage.test1    avgt    5  27.499 ± 4.538  ms/op
CachelineUsage.test2    avgt    5  31.062 ± 1.697  ms/op
CachelineUsage.test3    avgt    5  27.187 ± 1.530  ms/op
CachelineUsage.test4    avgt    5  25.719 ± 1.051  ms/op
CachelineUsage.test8    avgt    5  25.945 ± 1.053  ms/op
CachelineUsage.test16   avgt    5  28.804 ± 0.772  ms/op
CachelineUsage.test32   avgt    5  21.191 ± 6.582  ms/op
CachelineUsage.test64   avgt    5  13.554 ± 1.981  ms/op
CachelineUsage.test128  avgt    5   7.813 ± 0.302  ms/op
Copy the code

Well, it’s not intuitive, so let’s use a graph:

As can be seen from the chart, when the step size is between 1 and 16, the execution speed is relatively stable, around 25, and then decreases with the increase of step size.

CPU cache line

So to answer the second question first, the execution time levelled off and then fell.

Why is it so smooth within 16 steps?

The processing speed of the CPU is limited. To speed up the processing speed of the CPU, modern cpus have something called the CPU cache.

This CPU cache can be divided into L1 cache, L2 cache and even L3 cache.

The L1 cache is enjoyed separately by each CPU core. In the L1 Cache, there is another thing called a Cache line. To speed up processing, the CPU reads one Cache line at a time.

How to check the size of the Cache line?

On a MAC, we can execute: sysctl machdep.cpu

The CPU cache line is 64 bytes, and the CPU level 1 cache size is 256 bytes.

Okay, now back to the question of why steps 1 to 16 are roughly the same.

We know that an int takes up 4bytes, so 16 ints take up exactly 64bytes. Therefore, we can roughly assume that the CPU fetches the same data each time from 1 to 16 steps, which is a cache line. So, their execution speed is actually about the same.

Inc and add

Brother F, the above explanation is a bit perfect, but there seems to be a loophole. Since 1 through 16 use the same cache line, their execution time should decrease gradually. Why does 2 take longer than 1?

This is a really good question that doesn’t seem to be explained by the code and the cache line, so let’s look at it from an Assembly perspective.

Again using JMH, open the PrintAssembly option and let’s see the output.

Let’s look at the output of the test1 method:

Look again at the output of the test2 method:

What’s the difference?

The basic structure is the same, except that Test1 uses inc and test2 uses Add.

I’m not familiar with assembly language, but I guess the difference is between Inc and Add, where Add might be a bit slower because it has an extra parameter.

conclusion

An Assembly is not very useful, but it is useful for explaining mysterious phenomena.

Examples of this article github.com/ddean2009/l…

Author: Flydean program stuff

Link to this article: www.flydean.com/jvm-jit-cac…

Source: Flydean’s blog

Welcome to pay attention to my public number: procedures those things, more wonderful waiting for you!