Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

The concept of computer cache consistency protocol and instruction reordering are introduced in detail.

1 Hardware efficiency and cache consistency

“Let computer concurrent execution several computing tasks” and “more fully exploit the power of computer processors” cause-and-effect relationship between seems logical, in fact, the relationship between them is not as simple as I thought, one of the important source of complexity is the vast majority of computing tasks can only rely on processor can complete “computing”, The processor must at least interact with memory, such as reading operational data, storing operational results, and so on, and this I/O operation is very difficult to eliminate (you can’t do all the computation with registers alone).

Because of the computer storage and CPU speed several orders of magnitude difference, in order to avoid a lot of time to spend on the disk IO, network communication, or database access, so the modern computer system would have to add a layer of reading and writing speed as close as possible to the processor speed in the Cache (Cache) for use as a buffer between memory and processor: Copying the data needed for an operation to the cache allows the operation to proceed quickly. When the operation is complete, it is synchronized from the cache back to memory so that the processor does not have to wait for slow memory reads and writes.

Cache-based storage interaction resolves the processor-memory speed conflict well, but it also introduces a new problem for computer systems: Cache Coherence. In a multiprocessor system, each processor has its own cache, and they share the same Main Memory, as shown in the figure below:

When multiple processors work on the same main memory area, it is possible that their cache data will be inconsistent. If this happens, whose cache data will be used when synchronizing back to main memory? In order to solve the problem of consistency, it is necessary for each processor to follow some protocol when accessing the cache, and to operate according to the protocol when reading and writing. Such protocols include MSI, MESI (Illinois Protocol), MOSI, Synapse, Firefly and Dragon Protocol.

1.1. Modern computer caches

Modern computers typically have more than two cpus, and each CPU may contain multiple cores. Therefore, if the application is multi-threaded, these threads may run in parallel in each CPU core.

Inside the CPU, there is a set of CPU registers, which are the CPU’s memory.

  1. The CPU can manipulate registers much faster than the computer’s main memory.
  2. There is also a CPU cache between main memory and THE CPU registers, and the CPU operates on the CPU cache faster than main memory but slower than the CPU registers. Some cpus may have multiple cache layers (level 1 and level 2). A computer’s main memory, also known as RAM, is accessible to all cpus and is much larger than the caches and registers mentioned above.
  3. When a CPU needs to access main memory, it first reads some of the main memory data into the CPU cache, and then the CPU cache into the register. When the CPU writes data to main memory, it also flushes registers to the CPU cache and then, on some nodes, flushes the cached data to main memory.

Caching significantly Narrows the gap between fast cpus and slow memory. Take the three-tier cache architecture as an example.

  1. If Core0 and Core1 hit the same memory address, their RESPECTIVE L1 caches will Cache a copy of the same data.
  2. Core0 modifies the data in the two caches. The data in Core1 L1 Cache is invalidated.

In addition to level 3 caches, there are a variety of caches in hardware architectures implemented by various vendors, all of which suffer from similar visibility problems. For example, a register acts as a Cache between the CPU and L1 Cache.

1.2 Cache Consistency Protocol MESI protocol

In the case of multi-core CPU, there are multiple level 1 caches. How to ensure the consistency of the internal data in the cache and prevent the system data from being chaotic? This leads to a consistent protocol, MESI. MESI is the first letter of the four states. Each Cache line (Cache line: the unit that stores data in the Cache). There are four states, which can be represented by two bits. They are:

state describe Listening task
M change (Modified) This Cache line is valid. The data is modified and inconsistent with the data in memory. The data only exists in this Cache. The cached row must always listen for all attempts to read the cached row relative to main memory, which must be deferred until the cache writes the cached row back to main memory and changes the state to S (shared).
E Exclusive The Cache line is valid and the data is consistent with the data in memory. The data exists only in the local Cache. The cached row must also listen for other caches reading the cached row from main memory. Once this happens, the cached row needs to become an S (shared) state.
S sharing (Shared) This Cache line is valid, and the data is consistent with the data in memory, which exists in many caches. The cache line must also listen for requests from other caches to invalidate or monopolize the cache line and make it Invalid.
I Invalid (Invalid) The Cache line is invalid. There is no

The basic principles of the MESI protocol are:

  1. After Core0 modifies data V, it sends a signal that marks the data V cached by Core1 as invalid and writes the modified value back to memory.
  2. Core0 may modify the data V multiple times, sending only one signal each time (which locks the bus between caches), and the data V cached by Core1 remains invalidated.
  3. Before Core1 uses data V, it finds that the data v in the cache is invalid, and learns that data V has been modified, so it reloads data V from other caches or memory.

1.3 reorder

1.3.1 Overview of Reordering

In addition to increasing the cache, the processor may optimize the out-of-order Execution Of input code to maximize the use Of the processor’s internal units during program Execution, and rearrange the out-of-order Execution results after the computation to improve performance. The result is guaranteed to be the same as the result of sequential execution, but it is not guaranteed that the order of each statement in the program is the same as the order in the input code. Therefore, if there is a calculation task dependent on the intermediate results of another calculation task, the order cannot be guaranteed by the sequence of the code. Similar to out-of-order execution optimization for processors, there is a similar optimization for Instruction Reorder in the Just-in-time compiler for the Java virtual machine.

There are three types of reordering:

  1. Compiler optimized reordering: The compiler can rearrange the execution order of statements without changing the semantics of a single-threaded program.
  2. Instruction-level parallel reordering: Modern processors use parallelism to overlap multiple instructions. If there is no data dependency, the processor can change the execution order of the corresponding machine instructions.
  3. Reordering of memory systems: The processor uses caching and read/write buffers, which makes loading and storing operations appear to be out of order. (Resulting visibility issues can also be addressed by the MESI protocol)

1 in the figure above belongs to compiler reorder, 2 and 3 to handler reorder. These reorders can cause memory visibility problems in multithreaded programs.

1.3.2 Reordering conditions

Resort is not arbitrary, it needs to satisfy the following two conditions.

1.3.2.1 Data dependency

If two operations access the same variable and one of them is a write operation, there is a data dependency between the two operations. There are three types of data dependencies:

The compiler and processor do not change the execution order of two operations that have a data dependency relationship, that is, they are not reordered.

Data dependencies are only for sequences of instructions executed in a single processor and operations performed in a single thread. Data dependencies between different processors and between different threads are not considered by compilers and processors.

1.3.2.2 the as – if – serial semantics

All actions can be reordered for optimization, but the result must be the same as the result of the program code itself. The compiler, runtime, and processor must follow the as-if-serial semantics. Note that this does not work for multithreading.

To comply with the as-if-serial semantics, the compiler and processor do not reorder operations that have data dependencies because such reordering changes the execution result. However, if there are no data dependencies between the operations, they can be reordered by the compiler and processor.

The as-if-serial semantics protect single-threaded programs. Compilers, runtime, and processors that adhere to the as-IF-Serial semantics collectively create the illusion for programmers who write single-threaded programs that they execute in sequence. The ASIF-Serial semantics enable single-threaded programmers to avoid worrying about reordering interfering with them or memory visibility issues.

1.3.3 Instruction level parallel reordering (processor)

As long as it does not affect the results of single-threaded, sequential execution of the program, do not violate the above two conditions. You can reorder two instructions. Out-of-order execution is a process in which the processor optimizes the order of the code in order to improve the speed of the computation.

Execution process when not optimized The execution process during optimization
Command fetch. Command fetch.
If the input operand is reachable (such as if it already exists in a register), the instruction is sent to the appropriate functional unit. If one or more operand are not available for the current clock cycle (usually from main memory), the processor starts to wait until they are available. Instructions are sent to a sequence of instructions (also called an execution buffer or reservation station).
Instructions are executed in appropriate functional units. The instruction will wait in the sequence until its data operand is available. The instruction is then allowed to leave the sequence buffer before the first, older instruction. (Here it is out of order)
The functional unit writes the result of the operation back to the register. Instructions are assigned and executed by an appropriate functional unit.
The results are put into a sequence.
The result of this instruction is written to the register only after all instructions preceding it have written their result to the register. (Result of reforming out of order)

1.3.4 Compiler optimized reordering

For the same reason that the processor executes out of order, it is better to execute other instructions first than to wait for a blocking instruction (such as a cache flush) to complete. Compiler reordering can achieve a larger and better out-of-order optimization than processor out-of-order execution.

1.3.5 Impact of reordering on multi-threading

In a single-threaded program, reordering operations with control-dependencies does not change the execution result (which is why the as-if-serial semantics allow reordering operations with control-dependencies). But in multithreaded programs, reordering operations that have control dependencies may change the program’s execution results. Therefore, multithreading requires programmers to make extra efforts to ensure correct code order.

As follows:

The program uses local variables R1 and R2, and shared variables A and B. You could have r2=2, r1=1. Intuitively, either instruction 1 or instruction 3 should be executed first. If instruction 1 executes first, it should not be able to see the values written in instruction 4. If instruction 3 executes first, it should not be able to see the value written by instruction 2.

If an execution exhibits such behavior, then we might conclude that instruction 4 is to be executed before instruction 1, instruction 1 before instruction 2, instruction 2 before instruction 3, and instruction 3 before instruction 4. This, on the face of it, is perverse.

However, from the perspective of a single thread, the compiler can reorder instructions in that thread as long as reordering does not affect the thread’s execution. If instruction 1 is reordered with instruction 2, it is easy to see why r2 = 2 and R1 = 1 are the result.

1.4. Forward substitution

There is another compiler optimization that can cause unexpected results in multithreaded programs:

A common compiler optimization would reuse R2 when using R5: they all read R1.x and there are no r1.x operations between them.

R6.x is assigned by thread 2 between the first reading of r1.x by thread 1 and the reading of R3.x by thread 2. If the compiler decides to reuse the value of R2 at R5, then r2 and R5 are both 0 and R4 is 3. From a programmer’s point of view, the value of p.x goes from 0 to 3 and then back to 0.

Although this behavior may come as a surprise, most JVM implementations allow it.

References:

  1. JSR133 Specification
  2. The Beauty of Concurrent Programming in Java
  3. The Art of Concurrent Programming in Java

If you don’t understand or need to communicate, you can leave a message. In addition, I hope to like, collect, pay attention to, I will continue to update a variety of Java learning blog!