Like attention, no more lost, your support means a lot to me!

🔥 Hi, I’m Chouchou. GitHub · Android-Notebook has been included in this article. Welcome to grow up with Chouchou Peng. (Contact information at GitHub)


preface

  • CPU cache is the basic principle of computer composition, but also more commonly used knowledge, interview may also have a certain extension;
  • In this article, I will summarize the issues of CPU caching & cache consistency & pseudo-sharing. Please be sure to like and follow if you can help, it really means a lot to me.

directory


1. CPU Level-3 cache

  • Background: CPU processor speed does not match memory access speed and disk I/O speed (by several orders of magnitude);
  • Objective: To improve CPU throughput;
  • Solution: Add a cache layer to coordinate the speed difference between the two, namely: add a layer of “cache” between CPU and memory, cache access speed as close as possible.

The data loading process is as follows:

  • 1. Load programs and data from disk into memory;
  • 2. Load programs and data from memory into cache (three-level cache, data loading order: L3->L2->L1);
  • 3. The CPU loads the data in the cache into the register (L0) and performs the calculation;
  • 4. The CPU synchronizes data back to the cache and back to memory after a certain period of time.

Experience shows that the CPU often needs to read the same data block repeatedly, and the increase of cache capacity can greatly improve the hit ratio of read data inside the CPU, thus improving CPU throughput. However, due to CPU chip size and price considerations, the cache is not very large. Modern CPU chips use a three-level cache:

— Picture quoted from Internet

Registers (also known as L0 caches) followed by L1, L2, and L3 caches, and finally memory, local disk, and remote storage. The more space you have from top to bottom, the slower it is, but the cheaper it is. It should be noted that in modern CPUS, L0, L1, L2, and L3 are all integrated into one CPU. L0, L1, and L2 are independent of each processing core, while L3 is shared by multiple processors of a CPU.

— Picture quoted from Internet


2. Cache consistency problem

Modern cpus typically have multiple cores and each core has its own separate cache (L1, L2). When multiple cores work on the same data at the same time, cache inconsistencies occur if core 2 reads the data before core 1 synchronizes the updated data back into memory.

For example, if thread A and thread B execute i++ on A variable at the same time, there could be cache inconsistencies:

  • 1. Core A and core B load the value of I from memory and cache it in their respective caches, where both copies are 0;
  • 2. Core A adds one, the copy value becomes 1, and finally writes back to main memory, where the value is 1;
  • 3. Core B adds one, the copy value becomes 1, and finally writes back to main memory, where the value is 1;
  • 4. The final value of main memory is 1, not 2 as expected.

To resolve cache inconsistencies, there are generally two solutions:

  • 1. Lock the bus

Early cpus solved cache inconsistencies by locking the bus. The lock bus locks the entire memory. During the lock bus, other processors cannot access the memory, which can seriously degrade CPU performance.

  • 2. Cache Consistency Protocol (MESI)

Cache consistency protocol provides an efficient memory data management scheme. The memory locking scheme guarantees the consistency of the entire memory, while the cache Consistency protocol scheme is essentially similar to the consistency protection range, which is reduced from the entire memory to a single cache line (the cache line is the basic unit of cache).

To put it simply: When the CPU core is about to write data, if it discovers that the variable of the operation is a shared variable (that is, a copy of the variable also exists in other cores), it notifies the other cores that the variable’s “cache line” is invalid and needs to be read from memory again.

Specifically, the MESI protocol defines cached data as having four states:

state describe
E (Exclusive) Exclusive/mutually exclusive
S (Shared) Shared
M (Modify) Modify the
I (Invalid) invalid

Detailed working principle:

  • 1. Core A loads variable I from memory, sets the cache line to E (exclusive), and then checks memory operations on variable I through bus sniffing;
  • 2. Core B loads variable I from memory, and the bus sniffing mechanism sets the cache line of core A and core B to S (shared);
  • 3. Core A modifies variable I, cache line is set to M (modified), and core B is notified to modify cache behavior I (invalid). If high concurrency exists, it is handed over to the bus for adjudication;
  • 4. Core A synchronizes the modified data back to the memory and sets the variable to E (exclusive);
  • 5. Core B refreshes the cache line and sets the cache line of core A and core B to S (shared).

Confusing: THE MESI protocol is the CPU protocol, and the JMM has its own protocol to ensure cache consistency.


3. False Sharing

In CPU caches, the basic unit of Cache management is not a “byte” but a “Cache Line.” The size of the cache line depends on the CPU and is generally 64 bytes.

The design of the cache line comes from the fact that “after the CPU reads one piece of data, it often needs to read nearby data repeatedly.” Using cache rows to load a small chunk of data into the cache at a time can help increase computational efficiency.

— Picture quoted from Internet

Of course, caching is not perfect, and there is a side effect — pseudo-sharing. Pseudo-sharing refers to the invalidation of a cache row by multiple threads reading and writing variables in the same cache row simultaneously. Although the two threads are accessing different data, since they reside in the same cache line, any modification by either side will invalidate the cache and reduce the computational efficiency.

The solution to pseudo-sharing is “byte padding”, which means that the contents between two variables are filled with extra bytes so that the two variables are stored in different cache lines to avoid the pseudo-sharing problem.

When using byte padding, you need to consider which variables change independently and which change cooperatively. Coincident variables were placed in one group, while unrelated variables were placed in different groups. This does not invalidate the cache lines of irrelevant variables when changing variables.

In Java, the processing is different before and after Java 8:

  • Pre-java 8: Group by populating the long variable
public class DataPadding{ long a1,a2,a3,a4,a5,a6,a7; Prevent false sharing int Value1 with the previous object; long value2; long b1,b2,b3,b4,b5,b6,b7; Prevent false sharing of two groups of variables; boolean flag; long d1,d2,d3,d4,d5,d6,d7; Prevent pseudo-sharing with the next object}Copy the code
  • Java 8: Grouping by @sun.misc.Contended

The @Contended annotation is a new annotation in JDK 1.8 for dividing variables into different cache lines.

For example, Java 8 thread.java

 /** The current seed for a ThreadLocalRandom */
@sun.misc.Contended("tlr")
long threadLocalRandomSeed;

/** Probe hash value; nonzero if threadLocalRandomSeed initialized */
@sun.misc.Contended("tlr")
int threadLocalRandomProbe;

/** Secondary seed isolated from public ThreadLocalRandom sequence */
@sun.misc.Contended("tlr")
int threadLocalRandomSecondarySeed;
Copy the code

Java 8 ConcurrentHashMap.java

@sun.misc.Contended static final class CounterCell { volatile long value; CounterCell(long x) { value = x; }}Copy the code

Tip: Requires the JVM to enable byte padding -XX: -RestrictContEnded


4. To summarize

  • Since the CPU processor’s processing speed does not match the memory access speed and disk I/O speed, a layer of caching is added to the CPU to balance the speed difference. CPU cache is a three-level structure, in which L0, L1 and L2 are independent of each processing core, while L3 is shared by multiple processors of a CPU.

  • Since each CPU core also has its own separate caches (L1 and L2 caches), when multiple cores operate on the same data at the same time, cache inconsistencies occur if core 2 reads the data before core 1 synchronizes the updated data back into memory. Solutions include “lock bus” & “Cache consistency Protocol”;

  • The basic unit is Cache lines, because “the CPU reads a single piece of data and then reads adjacent pieces of data repeatedly.” Of course, cached pages have a side effect — pseudo-sharing, which is the problem of invalidating cached rows when multiple threads simultaneously read and write variables from the same cached row. The solution is “byte padding”, where the two variables are stored in different cache lines. Before Java 8, the long variable was populated; after Java 8, the @sun.misc.Contended annotation was adopted.


Creation is not easy, your “three lian” is chouchou’s biggest motivation, we will see you next time!