Question origin

Problems come from to see bilibili video, (https://www.bilibili.com/video/BV12p4y1W7Dz), found a question about CPU never considered, so he became interested in, specific question is as follows:

Which of the following two pieces of code, fnc1 and fnc2(I’ve simplified them here for the same effect), executes faster:

const N = 2000
var arr2D [N][N]int

func fnc1()  {
  for i := 0; i < N; i++ {
    for j := 0; j < N; j++ {
      arr2D[i][j] = 0
    }
  }
}

func fnc2()  {
  for i := 0; i < N; i++ {
    for j := 0; j < N; j++ {
      arr2D[j][i] = 0
    }
  }
}

The final answer is that fnc1 is faster than fnc2. Here I combine the two with a piece of code:

Const N = 2000 func main() {const N = 2000 func main() {const N = 2000 func main() { Var arr2D1 [N][N]int var arr2D2 [N][N]int fnc1 start1 := time.now ().unixnano () for I := 0; i < N; i++ { for j := 0; j < N; J ++ {arr2D1[I][j] = 0}} arr2 := time.now ().start1 := time.now ().start2 := 0; i < N; i++ { for j := 0; j < N; Println("fnc1 time :",ct1) FMT.Println("fnc2 time :",ct1) FMT.Println("fnc2 ")  time :",ct2) }

The output time of most FNC1 is smaller than that of FNC2. (Occasionally, FNC1 is larger than that of FNC2 because the CPU is busy with other processes at the time.)

$ go run main.go fnc1 time : 3499675 fnc2 time : 10823762 $ go run main.go fnc1 time : 3475027 fnc2 time : 307537210...

At the time, I wondered how this could be caused by the fact that the time complexity of both methods was the same. This brings us to today’s topic, CPU caching.

Introduction to CPU Architecture

When we buy a computer, we usually look at how much RAM and so on, and we also know that one of the reasons why Redis is so fast is the memory based database. As if the CPU is directly manipulating memory, memory is the fastest thing. But for specific memory, there is a much faster storage medium, CPU cache.

First, let’s look at the hierarchical structure of a computer’s memory in the image taken from Understanding Computer Systems in depth.

All modern computers use this storage structure, and the closer you get to it, the more expensive it is and the less capacity it has. In the modern computer, the fastest is the register access, register is the closest to the CPU core, but can store the least data, often used to put some function parameters; Then there are the L1 to L3 caches, which are the focus of today’s session. These caches act as the spacers between the CPU and main memory. And then there’s the most familiar main memory, which is what we call memory; Then there are the disks and remote file storage systems.

Closer in on the CPU’s storage architecture, here is a common dual-core CPU architecture diagram:

A brief introduction to each CPU component:

Control unit (CU) : is the CPU command center, he according to the user compiled program, from the memory to take out all kinds of instructions, used to direct the CPU work

ALU: Used to perform arithmetic operations and logical operations. All operations are commanded by the control unit

Register: Next to the control unit and the computing unit, it is the fastest, but also the smallest. Most registers on a 32-bit CPU can store 4 bytes, while most 64-bit registers can store 8 bytes.

L1 Cache: – Each CPU has a L1-cache. In fact, the L1-cache is divided into two caches: an instruction Cache (I-cache) and a data Cache (D-cache). This is because the two caches have different update policies, so they are separated. On Linux, it can be obtained with the following command, which requires 2 to 4 clock cycles per access.

# data cache d - cache $cat/sys/devices/system/CPU/cpu0 / cache/index0 / size # instruction cache I - $cat cache /sys/devices/system/cpu/cpu0/cache/index1/size

L2-cache: Each CPU also has an l2-cache that is larger than the l1-cache, but slower than the l1-cache because it is farther away from the CPU core, requiring 10-20 clock cycles.

$ cat /sys/devices/system/cpu/cpu0/cache/index2/size

L3-Cache: The L3-Cache is shared by all CPUs. It is larger and slower to access than the L2-Cache, requiring 20 to 60 clock cycles.

$ cat /sys/devices/system/cpu/cpu0/cache/index3/size

CPU fetching process

When the CPU needs to read some data, if there is register data, directly use register data; If there is no cache in L1-cache, there is no cache in L2-cache. If there is no cache in L1-cache, there is no cache in L2-cache. If there is no cache in L2-cache, there is no cache in L2, and there is no cache in L3.

The data in the CPU cache is the data in main memory. The rule of reading data is not to read the data by a single element, it is to read the data piece by piece. For example, when I fetch a small piece of data, he will load the memory block into the cache. A block of memory of this size is referred to as a “Cache Line”, so that it can be accessed directly from the CPU’s Cache. On Linux you can see it with the following command:

# cpu0 L1 - d - the cache cache Line, my machine is 64 bytes $cat/sys/devices/system/CPU/cpu0 / cache/index1 / coherency_line_size 64

Case in combination with

Returning to the above example, we first define a two-dimensional array that is stored contiguously in memory like this:

var arr [2][2]int

It is contiguously distributed in memory, in the order of 0_0, 0_1, 1_0, 1_1, with each element occupying 8 bytes

We can figure that out by taking the address

func main() { var arr [2][2]int fmt.Println(uintptr(unsafe.Pointer(&arr[0][0]))) fmt.Println(uintptr(unsafe.Pointer(&arr[0][1]))) fmt.Println(uintptr(unsafe.Pointer(&arr[1][0]))) Fmt.println (uintptr((safe.pointer (&arr[1][1]))))} ----------- 824634298120

So, in our case, when we use I_j, the first element 0_0 is fetched, and a 64-byte Cache Line is fetched from memory, namely 0_0 through 0_7 is put into the CPU Cache. When the next iteration reaches 0_1, the data in the CPU Cache will be directly used. This takes full advantage of the CPU cache.

const N = 2000
var arr2D [N][N]int

func fnc1() {
  for i := 0; i < N; i++ {
    for j := 0; j < N; j++ {
      arr2D[i][j] = 0
    }
  }
}

However, when J_i is used for assignment, 1_0 data will be fetched in the second iteration. At this time, the CPU will have to load data from main memory into the cache again, resulting in CPU cache penetration, which naturally affects the performance. So the time for FNC2 is longer than the time for FNC1.

read

  • How to write code that makes the CPU execute faster
  • Link to “Developing Inner Gong Cultivation”
  • Chap. 6 in understanding computer systems. Memory hierarchies