This is the third day of my participation in the August More text Challenge. For details, see:August is more challenging


  • Cpus are doubling every 18 months under Moore’s Law, while memory and disk are not growing nearly as fast as CPUS.
  • As a result, high performance memory and disks are expensive, while high CPU speed requires high data speed. To solve this problem, a small amount of cache is built into the CPU to solve the mismatch between I/O speed and CPU speed.
  • When the CPU accesses the storage device, both data and instructions are aggregated in a contiguity area, which is called the principle of locality
- Time locality: when a piece of information is accessed, it may be accessed again in the near future, for example: loop, recursion, method calls repeatedly; - Spatial locality: When a storage location is referenced, nearby locations may also be referenced, such as sequentially executed code, objects or arrays created sequentiallyCopy the code

The execution flow of the CPU cache

  1. Programs and data are loaded into main memory
  2. Instructions and data are loaded into the CPU cache
  3. The CPU executes the instruction and writes the result to the CPU cache
  4. The CPU cache writes data to main memory

CPU multilevel cache structure

  • The processing speed of THE CPU exceeds the DATA I/O capability of the 1-level cache, and the MULTI-level cache structure is introduced in the CPU

Multi-core CPU cache consistency protocol MESI

  • A multi-core CPU has multiple levels of cache. How to ensure the consistency of internal data in the cache and prevent data confusion? A consistency protocol, MESI, is introduced here
  • MESI is the first letter of the four states. Each Cache line has four states, which can be represented by two bits, as follows:
state describe Listening to the
M change (Modified) This Cache line is valid. The data has been modified, and is different from the data in memory. The data exists only in this Cache The cache line must always listen for all attempts to read the cache line relative to main storage. This operation must be delayed before the cache writes the cache line back to main storage and changes the state to S (shared).
E Exclusive This Cache line is valid. Data in the Cache is the same as data in the memory The cache line must also listen for other caches to read the cache line in main storage. Once this happens, the cache line needs to be in the S (shared) state.
S sharing (Shared) This Cache line is valid. Data is the same as in-memory data. Data is stored in multiple caches The cache line must also listen for requests from other caches to invalidate or take possession of the cache line and make the cache line Invalid.
I Invalid (Invalid) The Cache line is invalid There is no
> - is always accurate for the M and E states, which are consistent with the true state of the cache line, whereas the S state is inconsistent. If one cache invalidates a cache line in the S state, and another cache may actually own the cache, but the cache does not upgrade the cache line to the E state because the other caches do not tell them to invalidate the cache line, and because the cache does not keep the number of copies of the cache line, Therefore there is no way to confirm whether they have exclusive to the cache line > - is evident from the above E state is a speculative optimization: if a CPU to modify a cache line, under a state of S bus transaction will need to copy all the cache line become Invalid state, and modify E state cache transactions do not need to use the bus.Copy the code

MESI state transition

  1. Triggering event
Triggering event describe
Local read Local Cache Reads data from the local cache
Local write Local Cache Data is written to the local cache
Remote read The remote cache reads data from the local cache
Remote write The remote cache writes data to the local cache
  1. Cache classification (Prerequisite: All caches cache an item in main memory together)
    1. Local cache: indicates the cache of the current CPU.
    2. Trigger Cache: Cache that triggers read/write events.
    3. Other cache: indicates the cache except the preceding two types.
  • Note: Local events trigger the local cache the same way they trigger the cache.
state Triggering local reads Triggering local write Triggers remote read Remote write is triggered
M Status (modified) Local Cache :M Trigger cache:M Other cache:I Local Cache :M Trigger cache:M Other cache:I Local cache:M→E→S Triggered cache:I→S Other cache:I→S After the main memory is synchronized, it is shared only by E. After synchronization is triggered and other caches, the local cache, triggered, and other caches are shared by S Local cache:M→E→S→I Trigger cache:I→S→ M Other cache:I→S→I Synchronization Is the same as reading. After synchronization is complete, cache is triggered to M, and local and other cache are changed to I
E status (exclusive) Local Cache :E Trigger cache:E Other cache:I Local cache:E→M Trigger cache:E→M Other cache:I If the local cache is changed to M, the status of other cache should be I (invalid) Local cache:E→S Triggered cache:I→S Other cache:I→S When another cache wants to read the data, other, triggered, and local cache are all set to S(shared). Local cache:E→S→I Triggered cache:I→S→E→M Other cache:I→S→I When triggered cache changes the data shared by the local cache, the local cache, triggered cache, and other cache are changed to the S share. Then the cache is changed to exclusive, other and local cache is changed to I (invalid), and the cache is changed to M
S state (shared) Local Cache :S Triggered cache:S Other cache:S Local cache:S→E→M Triggered cache:S→E→M Other cache:S→I If the local cache is changed to E, other cache is changed to I, and then the local cache is changed to M Local Cache :S Triggered cache:S Other cache:S Local cache:S→I Trigger cache:S→ E→M Other cache:S→I When the cache is triggered to modify the local shared data, the cache is triggered to change to E (exclusive), the local cache and other cache to change to I (invalid), and the cache is triggered to change to M again (modified).
I status (invalid) Local cache:I→S or I→E Trigger cache:I→S or I→E Other cache:E, M, I→S, I Local, trigger cache changes from I invalid to S shared or E exclusive. Other cache changes from E, M, I to S or I Local cache:I→S→E→M Triggered cache:I→S→E→M Other cache:M, E, S→S→I Since this cache is I, other cache operations are irrelevant to it Since this cache is I, other cache operations are irrelevant to it
  • When one cache line adjusts the state, the other cache lines need to adjust the state
M x x x Square root
E x x x Square root
S x x Square root Square root
I Square root Square root Square root Square root
  • For example, if the cache line of a variable x = 0 is in the S state (shared), then the cache line of other variables x must be in the S state (shared) or I state (invalid).

Multi-core caches operate cooperatively

  • Assume that cpus A, B, and C correspond to caches A, B, and C. The reference to x is defined as 0 in main memory.

- Single-core read: CPU A sends an instruction to read X from the main memory, which is read from the main memory to the cache through bus. Then the cache line status changes to E (exclusive). - Dual-core read: CPU A sends an instruction to read X from the main memory. CPU A reads X from the main memory to cache A through bus and sets the status of the cache line to E (exclusive). CPU B issues an instruction to read X from main memory. When CPU B attempts to read X from main memory, CPU A detects an address conflict and responds to the data. In this case, X is stored in cache A and cache B, and the state is S (shared). - Modify data: After the calculation is complete, CPU A sends an instruction to modify X. CPU A sets X to M (modified) and notifying CPU B, which has cached X. CPU B sets X in local cache B to I (invalid) and CPU A assigns A value to X. - Synchronize data: CPU B sends A command to read data from X. CPU B notifies CPU A that CPU A synchronizes the new data to the main memory, and cache A changes its status to E (exclusive). CPU A synchronizes X of CPU B, and sets X of cache A and synchronized cache B to S (shared).Copy the code

Cache line pseudo share

  • The CPU cache is stored in the cache unit process. The main CPU cache line size is 64bytes. If multiple threads need to change the “variables that share the same cache line”, it will unintentionally affect the performance of each other, which is called pseudo sharing.
- For example, if t1 is accessing a and T2 is accessing B, and A and B are in the same cache line, T1 changes A first, and b is flushed.Copy the code
  • Java 8 has a comment: @sun.misc.Contended. Classes with this annotation will automatically complete cache lines. This annotation is invalid by default, and must be added to the JVM: -xx: -restrictContEnded.

Problems introduced by the conformance protocol MESI

  • Cached consistent message delivery takes time, which causes a delay in switching. When one cache switches states, other caches receive a message to complete their own switch and send a response, and the CPU waits for all cache responses to complete for a period of time. You can have blocking that causes all kinds of performance problems and stability problems.
  • For example, if a message is modified in the local cache, the I (invalid) status must be notified to other CPU caches that have the cached data and wait for confirmation. Waiting for the validation process blocks the processor and degrades its performance.

Store bufferes

  • To avoid wasteful CPU blocking, Store Bufferes are introduced. The processor writes the values that it wants to write to main memory to the Store bufferes, and then proceeds to do other things. The data is finally submitted when all the states have been acknowledged.
  • The store bufferes are not infinite, so processors sometimes need to wait for the invalidation to return. To cope with this, the invalidation queue is introduced:
- The invalidate acknowledge message must be sent immediately for all invalidate requests received - the invalidate message is not actually executed, but is placed in a special queue and executed at a convenient time - the processor does not send any message to the cached item being processed, Until it handles invalidateCopy the code

The memory barrier

  • Even with this processing, the processor does not know when optimizations are allowed and when they are not.
  • So the processors are throwing that off to the people who are writing the code, which is called memory barriers.
- Store memory barrier: Tells the processor to apply all instructions already stored in the store buffer before executing any subsequent instruction. - Load memory barrier: Tells the processor to apply all instructions that have been invalidated in the invalidation queue before performing any loadCopy the code

The last

  • Study with an open mind and make progress together