Little Chopper recently read a great article on Medium called “Go and CPU Caches” at teivah.medium.com/go-and-cpu-…

CPU cache system

Most modern computer processor architectures use the Symmetric Multiprocessing system (SMS). In this system, each core is treated as a separate processor, and multiple processors are connected to the same shared main memory and controlled by a single operating system.

To speed up memory access, processors have different levels of cache, L1, L2, and L3. The exact architecture may vary by vendor, processor model, and so on. The most common architecture today is to embed L1 and L2 caches locally in the CPU core, while L3 caches are designed to be shared across cores.

A CPU usually contains multiple cores. Each CPU core has the L1 Cache and L2 Cache. The L1 Cache is divided into dCache (data Cache) and iCache (instruction Cache).

The closer the cache is to the CPU core, the smaller the capacity, but the lower the access latency.

Of course, these exact numbers vary depending on the processor model. However, the obvious conclusion is that the processor accesses L1 cache much faster than it accesses main memory directly, and they are at least tens of times off.

When the CPU reads data from the main storage to the Cache, it does not read the data in a single byte. Instead, it copies the data in a continuous memory block. The unit of memory that copies the block is called a Cache Line. The rationale for this is known as the locality principle.

Temporal Locality: If an item of information is being accessed, it is likely to be accessed again in the near future.

Spatial locality: Information to be used in the near future is likely to be spatially adjacent to information being used now.

The L1 cache line size is generally 64 bytes, and the L2 and L3 cache line size is greater than or equal to the L1 cache line size, usually not more than twice the L1 cache line size. Also, L2 and L3 caches need to have cache lines smaller than memory pages (typically 4KB).

Take the small kitchen knife computer as an example, here is the system report

However, the L1 Cache and its line size are not shown here. We can obtain that the local Cache line size is 64 bytes by using the following command.

$ sysctl -a | egrep 'cachesize|cachelinesize'
hw.cachesize: 8589934592 32768 262144 6291456 0 0 0 0 0 0
hw.cachelinesize: 64
hw.l1icachesize: 32768
hw.l1dcachesize: 32768
hw.l2cachesize: 262144
hw.l3cachesize: 6291456
Copy the code

This means that if the processor needs to copy an INT64 Go slice into the cache, it will copy all eight elements at a time, rather than a single copy. If our program can store data in contiguous memory (such as arrays), then the cache hit rate will be high when the processor accesses the data elements. Improve program performance by reducing the frequency of reading data from memory.

The specific application of cache line in Go program

Let’s take a look at a concrete example that demonstrates the benefits of using CPU caching.

func createMatrix(size int)[] []int64 {
	matrix := make([] []int64, size)
	for i := 0; i < size; i++ {
		matrix[i] = make([]int64, size)
	}
	return matrix
}

const matrixLength = 6400

func BenchmarkMatrixCombination(b *testing.B) {
	matrixA := createMatrix(matrixLength)
	matrixB := createMatrix(matrixLength)

	for n := 0; n < b.N; n++ {
		for i := 0; i < matrixLength; i++ {
			for j := 0; j < matrixLength; j++ {
				matrixA[i][j] = matrixA[i][j] + matrixB[i][j]
			}
		}
	}
}

func BenchmarkMatrixReversedCombination(b *testing.B) {
	matrixA := createMatrix(matrixLength)
	matrixB := createMatrix(matrixLength)

	for n := 0; n < b.N; n++ {
		for i := 0; i < matrixLength; i++ {
			for j := 0; j < matrixLength; j++ {
				matrixA[i][j] = matrixA[i][j] + matrixB[j][i]
			}
		}
	}
}
Copy the code

MatrixB [I][j] = matrixB[I][j] + matrixB[I][j] = matrixB[I][j] + matrixB[I][j] MatrixB = matrixB = matrixB = matrixB = matrixB = matrixB = matrixB = matrixB = matrixB = matrixB = matrixB = matrixB = matrixB = matrixB So what happens when you add these two different ways?

BenchmarkMatrixCombination- 8 -           	      16	  67211689 ns/op
BenchmarkMatrixReversedCombination- 8 -   	       3	 480798925 ns/op
Copy the code

As you can see, the performance difference between the two approaches is close to ten times. Why is that?

Now, let’s look at a couple of pictures to get a better sense of what’s going on. The blue circles represent the current element coordinates of matrix A, and the pink circles represent the current element coordinates of matrix B. MatrixA [I][j] = matrixA[I][j] + matrixB[j][I] = matrixA[I][j] + matrixB[j][I]

Note: in this figure, we represent the matrix with the x-coordinate and y-coordinate, and (0,0) represents the upper-left corner coordinate of the matrix. In real computer storage, all the rows of a matrix are allocated into one contiguic piece of memory. But just to make it more intuitive, let’s do it mathematically.

In addition, in subsequent examples, we set the matrix size as a multiple of the cache row size. Therefore, the cache line is not “overloaded” on the next line.

How do we go through the two matrices?

The blue circle moves to the right until it reaches the last column, then moves to the next row of position (5,0), and so on. Instead, the red circle moves down and then to the next column.

When the pink circle is at the coordinate (0,4), the processor will cache the row where the pointer is located (in this diagram, we assume that the size of the cached row is 4 elements). Therefore, when the pink circle reaches the coordinate (0,.5), can we assume that the variable at that coordinate already exists in L1 cache? It really depends on the size of the matrix.

If the size of the matrix is small enough compared to the size of the L1 Cache, then the answer is yes, the element at (0,5) is already in the L1 Cache. Otherwise, the cache line will be cleared out of L1 before the coordinate (0,5) is accessed. At this point, there will be a cache miss and the processor will have to access the variable in another way (such as from L2). At this point, the state of the program will look like this:

So, what size should the matrix be in order to take full advantage of L1 Cache? Let’s do some math.

$ sysctl hw.l1dcachesize
hw.l1dcachesize: 32768
Copy the code

In the case of the small kitchen knife machine, the data cache size of L1 is 32kb. But L1 cache line size is 64 bytes. As a result, I can store up to 512 cache lines in the L1 data cache. What if we changed the matrixLength to 512? Here are the benchmark results.

BenchmarkMatrixCombination-8           	    3379	    360017 ns/op
BenchmarkMatrixReversedCombination-8   	    1801	    585807 ns/op
Copy the code

Although we have narrowed the performance gap between the two test cases considerably (the second was approximately 700% slower when measured with the 6400 size matrix), we can still see that there is a significant gap. So what’s the problem?

In our benchmark, we are dealing with two matrices, so the CPU must store the cache rows for both. Under perfectly ideal conditions (no other programs were running at the time of the stress test, but this is certainly not possible), the L1 cache would be half full for the first matrix and half full for the second matrix. We then compress the two matrices by a factor of four, so that the matrixLength is 128 (in the text, 256 elements are nearly equal, but in the small kitchen knife machine, 128 elements are nearly equal), and see how the benchmark test works.

BenchmarkMatrixCombination- 8 -           	   64750	     17665 ns/op
BenchmarkMatrixReversedCombination- 8 -   	   57712	     20404 ns/op
Copy the code

At this point, we finally reached the point where the two results were (nearly) equal.

From the above, we should know how to reduce the impact of CPU cache when working with large volume matrices. Here is a technique called Loop Nest Optimization that maximizes CPU cache utilization by traversing a matrix in blocks of a fixed size each time.

In the following example, we define a matrix block with a 4 by 4 element size. In the first matrix, we go from (4,0) to (4,3) and then switch to the next row. Iterate from (0,4) to (3,4) in the second matrix, and then switch to the next column.

When the pink circle traverses the first column, the processor stores all the corresponding rows in L1. Therefore, traversing the rest of the rectangular block is directly accessed from L1, which can significantly improve access speed.

We implemented the above technology through Go. First, we must choose the block size carefully. In the previous example, the memory size of a row of elements in our matrix is equal to the capacity of the cached row. It shouldn’t be any smaller than that, otherwise we’ll be storing element data in the cache row that won’t be accessed, wasting the cache row space. In the Go benchmark, we stored the element type int64 (8 bytes). Because the cache row size is 64 bytes, or 8 element size. So, the size of the rectangular block should be at least 8.

func BenchmarkMatrixReversedCombinationPerBlock(b *testing.B) {
	matrixA := createMatrix(matrixLength)
	matrixB := createMatrix(matrixLength)
	blockSize := 8

	for n := 0; n < b.N; n++ {
		for i := 0; i < matrixLength; i += blockSize {
			for j := 0; j < matrixLength; j += blockSize {
				for ii := i; ii < i+blockSize; ii++ {
					for jj := j; jj < j+blockSize; jj++ {
						matrixA[ii][jj] = matrixA[ii][jj] + matrixB[jj][ii]
					}
				}
			}
		}
	}
}
Copy the code

At this point, the matrixLength is 6400, which compares to the initial direct traversal.

BenchmarkMatrixReversedCombinationPerBlock- 8 -   	       6	 184520538 ns/op
BenchmarkMatrixReversedCombination- 8 -           	       3	 480904016 ns/op
Copy the code

As you can see, by adding small traversal rectangular blocks, our overall traversal speed is already three times that of the original version, and taking full advantage of the CPU cache feature has the potential to help us design more efficient algorithms.

Cache Coherency and False Sharing problems

Note that the first example presents a single-threaded program, and when using multiple threads, we have the problem of cache pseudo-sharing. First, to understand pseudo sharing, you need to understand cache consistency.

Suppose you have A dual-core CPU with two different threads running in parallel on the two cores. They read two different data pieces, A and B, from memory at the same time. If the two data pieces are contiguous (or very close to each other) in physical memory, then you will have VAR1 and VAR2 in the L1 Cache of both cores.

In order to improve the efficiency of data access, each CPU core has a small but fast cache system built into it to hold the most frequently accessed data. Because the CPU is so slow to access memory directly, when data is modified, the processor changes only the contents of the cache first and does not immediately rewrite the changes back into memory, which can cause problems.

In the example above, if thread 1 and thread 2 in two different cores attempt to modify data, for example, thread 1 modifies data A and thread 2 modifies data B, the data will be inconsistent between the caches and between the cache and memory. When the data B seen in thread 1 is no longer the same as the data A seen in thread 2 (or if there are more threads on the core, they are still fetching the old data from memory), there is dirty data, which creates A huge problem for the program. Therefore, it is necessary to maintain multi-core cache consistency.

The naive solution to cache consistency is simple: as soon as a data change occurs on a multi-core shared cache line, all CPU cores are notified to update their caches, or to abandon their caches and wait for the next access to read from memory.

However, it is clear that such constraints can have a performance impact. There are many protocols for maintaining cache consistency, the most famous of which is the MESI cache consistency protocol used in Intel cpus.

The msci agreement

To understand THE MESI protocol, it is important to know that all data transfers between cache and memory and between cache take place on a shared data bus, which is visible to all CPU cores.

MESI is a listening protocol. Cahce not only works with the bus when it communicates with memory, but it also constantly listens for data exchanges on the bus and keeps track of what other caches are doing. So when a cache reads or writes memory on behalf of its CPU core, or makes changes to data, the other CPU cores are notified, and they use this to keep their own cache in sync.

The four separate letters of MESI represent the four states of a Cache line. Each Cache line can only be one of four states. Taking two bits in a cache line means the following.

  • Modified: Data in this state is only cached in the core processor and has been Modified, but has not been updated to memory.
  • Exclusive: Data in this state is only cached in the core processor and has not been modified to match memory.
  • Shared: Data in this state is cached in all multi-core processors.
  • Invalid: This cache on this CPU is no longer valid.

Again, let’s look at how the processor uses MESI to ensure cache consistency.

  1. Suppose thread 1 reads data A first, and because A and B are physically adjacent to each other in memory, data B will also be loaded into Core 1’s cache row, marking this cache row Exclusive.

  1. Thread 2 then reads data B, and it pulls data A and B out of memory into the cache line. Since the current cache line already exists in Core 1, the processor will mark the two cache lines as Shared at this point.

  1. Thread 1 on Core1 wants to modify data A, and it finds that the current cache row is Shared, so it sends A message to Core 2 via the data bus, telling Core 2 to mark the corresponding cache row as Invalid, and then it changes data A. Also mark the current cache line on Core 1 as Modified.

  1. After that, thread 2 wants to modify data B, but at this point the current cache line in Core2 is already in the Invalid state, and since the corresponding cache line in Core 1 also has data B, the cache line is in the Modified state. So, Core2 tells Core1 to write the current Cache row back to memory via the memory bus. Core2 then reads the current Cache row size from memory into the Cache, and then modiifies B. The current Cache row is marked Modified. Finally, Core1 is told to mark the corresponding cache line as Invalid.

So, you can see that if the threads on Core 1 and Core2 continue to make changes to data A and B alternately, steps 3 and 4 will be repeated. In this case, Cache does not work as a Cache.

The variables A and B do not have any relationship to each other, but because they belong to A cache line, any data in that cache line will affect each other if they are modified. Therefore, the phenomenon of CPU Cache invalidity due to multiple core threads simultaneously reading and writing different variables of the same Cache Line is pseudo sharing.

Memory to fill

Is there any way to get around this pseudo-sharing? The answer is: Memory Padding. It does this by filling enough space between the two variables to ensure that they belong to different cache lines. Now, let’s look at a specific example.

// If M is large enough, there will be a case where goroutine 1 has been executed and goroutine 2 has not been started
const M = 1000000

type SimpleStruct struct {
	n int
}

func BenchmarkStructureFalseSharing(b *testing.B) {
	structA := SimpleStruct{}
	structB := SimpleStruct{}
	wg := sync.WaitGroup{}

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		wg.Add(2)
		go func(a) {
			for j := 0; j < M; j++ {
				structA.n += 1
			}
			wg.Done()
		}()
		go func(a) {
			for j := 0; j < M; j++ {
				structB.n += 1
			}
			wg.Done()
		}()
		wg.Wait()
	}
}
Copy the code

In this example, we instantiate two structs, structA and structB, consecutively, so they should be allocated consecutively in memory. After that, we create two Goroutines to access each structure object.

Variable n on structA is accessed by goroutine 1, and variable n on structB is accessed by goroutine 2. Then, since the addresses of the two structures are contiguous in memory, the two N’s will exist in the two CPU cache lines (assuming that the two Goroutines are scheduled to be allocated to threads on different CPU cores, which is not guaranteed, of course), and the pressure test results are as follows.

BenchmarkStructureFalseSharing-8   	     538	   2245798 ns/op
Copy the code

Next we use memory padding: we populate the structure with a placeholder object CacheLinePad the size of the cache line.

type PaddedStruct struct {
	n int
	_ CacheLinePad
}

type CacheLinePad struct {
	_ [CacheLinePadSize]byte
}

const CacheLinePadSize = 64
Copy the code

We then instantiate the two constructs and continue to access the two variables in separate Goroutines.

// If M is large enough, there will be a case where goroutine 1 has been executed and goroutine 2 has not been started
const M = 1000000

func BenchmarkStructurePadding(b *testing.B) {
	structA := PaddedStruct{}
	structB := SimpleStruct{}
	wg := sync.WaitGroup{}

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		wg.Add(2)
		go func(a) {
			for j := 0; j < M; j++ {
				structA.n += 1
			}
			wg.Done()
		}()
		go func(a) {
			for j := 0; j < M; j++ {
				structB.n += 1
			}
			wg.Done()
		}()
		wg.Wait()
	}
}
Copy the code

In the CPU Cache, memory should be distributed as shown in the figure below, because there is enough memory to fill between the two variables, so they will only exist in the Cache lines of the different CPU cores.

The following is a comparison of the pressure measurement results of the two methods

BenchmarkStructureFalseSharing-8   	     538	   2245798 ns/op
BenchmarkStructurePadding-8        	     793	   1506534 ns/op
Copy the code

As you can see, padding in this example is about 30% faster than the original implementation, which is a space-for-time approach. It is important to note that while memory filling does increase execution speed, it also leads to more memory allocation and waste.

conclusion

Mechanical empathy is an important concept in software development. It comes from the famous words of three-time World champion FORMULA One driver Jackie Stewart:

You don’t have to be an engineer to be a racing driver, but You do have to have Mechanical Sympathy. You don’t have to be an engineer, but you do have to have mechanical empathy.)

Just as understanding how a car works can make you a better driver, understanding how computer hardware works can make programmers write better code. You don’t necessarily need to be a hardware engineer, but you do need to understand how hardware works and design software with that in mind.

In order to bridge the performance gap between CPU and main memory, modern computers have introduced multi-level cache system. With a Cache, the CPU does not have to interact directly with main memory, but instead interacts with the more responsive L1 Cache. According to the principle of locality, the exchange data unit between cache and memory is a cache row, and the size of the cache row is generally 64 bytes.

Because of the existence of cache lines, we need to write programs with a higher cache hit ratio and reduce the frequency of exchanging data from main memory, thus improving the efficiency of program execution. At the same time, in order to ensure the consistency of the cache, the processor introduces THE MESI protocol, so that there is a pseudo sharing problem of CPU cache invalidation. Finally, we introduce a space-for-time memory filling practice, which improves the efficiency of program execution, but also causes more memory waste.