Translate: http://igoro.com/archive/gall… And refer to the article: http://wsfdl.com/linux/2016/0…

The original code is C#, part of which I changed to Golang. My computer is Mac Pro2018.

Example 1. Memory access performance

func clearCache() { var arr [1024 * 1024]int for i := range arr { arr[i]++ } time.Sleep(time.Second) } func main() { var  arr [4 * 1024 * 1024]int for i := range arr { arr[i]++ } clearCache() // Loop 1 tt := time.Now() for i := 0; i < len(arr); i++ { arr[i] *= 3 } sub := time.Now().Sub(tt) fmt.Println("Loop 1", sub) clearCache() // Loop 2 tt = time.Now() for i := 0; i < len(arr); i += 8 { arr[i] *= 3 } sub = time.Now().Sub(tt) fmt.Println("Loop 2", sub) }

The results

Loop 1 4.956348ms
Loop 2 3.191618ms

The reason loops take the same time has to do with memory. The running time of these loops depends on memory access to arrays, not integer multiplication. Also, as I’ll explain in Example 2, the hardware performs the same main memory access for both loops.

Example 2: Impact of cache lines

Let’s explore this example in more depth. We will try other step size values, not just 1 and 16

Here is the elapsed time of this loop for unsynchronized length (K) (my computer is 64-bit, so it decreases exponentially from 6)

Note that while the step is in the range of 1 to 16, the running time of the for loop is almost unchanged. But starting at 16, every time you double, the running time is halved

The reason behind this is that today’s CPUs do not access memory byte-by-byte. Instead, they acquire memory in (typically) 64-byte blocks called cache lines. When you read a particular memory location, the entire cache row is fetched from main memory into the cache. Also, accessing other values from the same cached row is cheap!

Since 16 integers occupy 64 bytes (one cache line), for loops with steps between 1 and 16 must touch the same number of cache lines: all cache lines in the array. But once the step size is 32, we’ll only touch about every other cache line, and once it’s 64, every four.

Understanding cached rows is important for some types of program optimization. For example, the alignment of data can determine whether an operation is touching one or two cache rows. As we can see in the example above, this can easily mean that operations are twice as slow in the case of unaligned operations

Example 3: L1 and L2 cache sizes

Today’s computers come with two or three levels of caching, commonly referred to as L1, L2, and L3. If you want to know the different cache sizes, you can use the CoreInfo SysInternals tool, or use the GetLogicalProcessorInfo Windows API call. In addition to the cache size, these two methods also tell you the cache row size.

On my machine, CoreInfo reports that I have a 32KB L1 data cache, a 32KB L1 instruction cache, and a 256KB L2 data cache, and a 4M L3 cache. L1 cache is per-core, L2 cache is shared between checks:

$ sysctl -a

hw.pagesize: 4096
hw.pagesize32: 4096

hw.cachelinesize: 64
hw.l1icachesize: 32768
hw.l1dcachesize: 32768
hw.l2cachesize: 262144
hw.l3cachesize: 4194304

Let’s test these numbers experimentally. To do this, we’ll iterate through an array incremented by every 16 integers — a cheap way to modify each cached row. When we reach the last value, we loop back to the beginning. We will try to use different array sizes, and should see the performance degradation of the array size after the array overflows by a cache level.

This is the program

int steps = 64 * 1024 * 1024; // Arbitrary number of steps
int lengthMod = arr.Length - 1;
for (int i = 0; i < steps; i++)
{
    arr[(i * 16) & lengthMod]++; // (x & lengthMod) is equal to (x % arr.Length)
}

Example 4: Instruction-level parallelism

Now, let’s look at something different. Which of the two loops do you want to go faster

int steps = 256 * 1024 * 1024;
int[] a = new int[2];

// Loop 1
for (int i=0; i<steps; i++) { a[0]++; a[0]++; }

// Loop 2
for (int i=0; i<steps; i++) { a[0]++; a[1]++; }

The second loop turned out to be twice as fast as the first, at least on all the machines I tested. Why is that? This has to do with the dependencies between the operations in the body of the two loops. (My local is just as fast)

In the first body of the loop, the operations are interdependent as follows



But in the second example, we only have these dependencies:



Every part of a modern processor has a bit of parallelism: it can access two memory locations in L1 at the same time, or perform two simple arithmetic operations. In the first loop, the processor cannot take advantage of this instruction-level parallelism, but in the second loop, it can.

[Update] : A lot of people on Reddit have been asking about compiler optimizations and whether {a[0]++; [0] + +; } is optimized to {a[0]+=2; }. In fact, the C# compiler and CLR JIT don’t do this optimization — not when it comes to array access. I built all the tests in release mode (that is, using optimizations), but I looked at the JIT-TED assembly to verify that optimizations did not distort the results.

Case 5 Cache associativity

A key decision in cache design is whether each block of main memory can be stored in any cache slot, or only in some of these slots.

There are three possible ways to map a cache slot to a memory block;

  • Direct Mapped Cache: Data A has only one fixed location in the cache.
  • N-way set associative cache: A can be stored in N locations in the cache.
  • Full associative cache: Data A can be stored anywhere in the cache.

From a hardware point of view, Direct Mapped Cache is easy to design, while Full Associative Cache is complicated, especially when the cache size is large, and the hardware cost can be very high. However, in Direct Cache, the address of data stored in the mapped cache is fixed and unique, so it is prone to collisions, which ultimately slow the cache hit ratio and affect performance. Current CPUs are n-way set associative caches, usually 4,8 or 16, based on the trade-off between cost and performance.

For example, a cache with a size of 32 KB and a cache line size of 64 bytes has different hardware designs for different storage rules. The following pictures show the principle in turn.

public static long UpdateEveryKthByte(byte[] arr, int K) { Stopwatch sw = Stopwatch.StartNew(); const int rep = 1024*1024; // Number of iterations -- arbitrary int p = 0; for (int i = 0; i < rep; i++) { arr[p]++; p += K; if (p >= arr.Length) p = 0; } sw.Stop(); return sw.ElapsedMilliseconds; }

This method increments each KTH value in the array. Once it reaches the end of the array, it starts from the beginning. After running long enough (2^20 steps), the loop stops.

I run updateEverykthByte () with different array sizes (in increments of 1MB) and different step sizes. Here is the result, with blue for long runs and white for short runs:

The blue area (long running time) is where the updated values cannot be kept in the cache at the same time because we iterate over them repeatedly. The bright blue area corresponds to about 80 milliseconds of running time, and the near white area corresponds to about 10 milliseconds.

Let’s explain the blue part of the chart:

  1. Why is it vertical? The vertical line shows the step size value that touches too many memory locations in the same group (>16). For these steps, we cannot simultaneously keep the values of all contacts in the 16-way association cache on my machine.

Some incorrect step size values are powers of 2:256 and 512. For example, consider a step size of 512 on an 8MB array. An 8MB cache line contains 32 values, with a spacing of 262,144 bytes. All of these values will be updated on each pass of our loop, because 512 is divided by 262,144.

Since 32 > 16, these 32 values will continue to compete for the same 16 slots in the cache.

Some values that are not powers of 2 are just unfortunate and end up accessing a disproportionate number of values from the same set. These step size values will also be shown as the blue line.

  1. Why do vertical lines stop at 4MB array length? On arrays of 4MB or smaller, a 16-way association cache is just as good as a fully association cache.

A 16-way associative cache can hold up to 16 cache rows, which are multiples of 262,144 bytes. There is no set of 17 or more cache lines all aligned on the 262,144-byte boundary in 4MB, because 16 * 262,144 = 4,194,304.

  1. Why the blue triangle in the upper left corner? In a triangle area, we can’t keep all the necessary data in the cache at the same time… Not because of affinity, but simply because of L2 cache size limitations.

For example, consider the 16MB array at step 128. We repeatedly update every 128 bytes in the array, which means we touch every 64 bytes of memory block. To store all the other cached rows of the 16MB array, we need 8MB cache. However, my machine only has 4MB of cache.

Even though the 4MB cache on my machine is fully correlated, it still can’t hold 8MB of data.

  1. Why does the triangle fade out on the left? Note the gradient from 0 to 64 bytes – a cache line! As described in Example 1 and Example 2, additional access to the same cached row is almost free. For example, when you step 16 bytes, it takes four steps to reach the next cache row. Thus, we get four memory accesses for the price of one.

Since the number of steps is the same in all cases, a cheaper step results in a shorter run time.

As you extend the diagram, these patterns continue:

Cache affinity is interesting to understand and can certainly be demonstrated, but it tends to be less of a problem than the other issues discussed in this article. This certainly shouldn’t be at the forefront of your mind when you’re writing programs.

Example 6: Incorrect cache row sharing

On multicore machines, caching runs into another problem — consistency. Different kernels have completely or partially independent caches. On my machine, the L1 cache is separate (very common) and there are two pairs of processors, each pair sharing an L2 cache. While the details vary, modern multicore machines will have a multi-level cache hierarchy, where faster and smaller caches belong to a single processor.

When one processor modifies the value in its cache, other processors can no longer use the old value. This memory location will be invalid in all caches. Also, because the cache runs on the granularity of the cache row rather than a single byte, the entire cache row will be invalid in all caches!

To demonstrate this, consider the following example:

private static int[] s_counter = new int[1024]; private void UpdateCounter(int position) { for (int j = 0; j < 100000000; j++) { s_counter[position] = s_counter[position] + 3; }}

On my quad-core machine, if I call updateCounter from four different threads with parameters 0, 1, 2, 3, it takes 4.3 seconds to complete all the threads.

On the other hand, if I call updateCounter with parameters 16, 32, 48, 64, the operation will complete in 0.28 seconds!

Why is that? In the first case, all four values are likely to end up on the same cache row. Every time the kernel increments a counter, it invalidates the cache row that holds all four counters. All other kernels will encounter a cache miss the next time they access their counters. This threading behavior effectively disables caching and impair program performance.

Example 7: Hardware complexity