Hardware basics and Java memory model

The cache
Cache consistency protocol
Write buffers and invalidate queues
Store and forward
Memory reordering
Visibility problem
Basic memory barrier
Synchronization mechanism and memory barrier
Virtual machine memory barrier optimization

The cache

Now the processing capacity of the processor is far more than the access rate of the main memory. The time needed for a read or write operation of the main memory is enough for the processor to execute hundreds of instructions. In order to bridge the gap between the processing capacity of the processor and the main memory, a cache is introduced between the processor and the main memory.

A cache is a storage unit whose read rate is much higher than that of main memory, but whose capacity is much smaller than that of main memory. Each processor has its own cache.
In the cache, it is equivalent to storing a copy of each variable in the program. The variable name corresponds to the memory address and the variable value corresponds to the data stored in the corresponding memory space, but the cache does not contain a copy of all variables at all times.
A cache is equivalent to a hardware-implemented hash table (zipper method) whose key is a memory address and whose value is a copy of the data or data to be written to memory. The cache structure is roughly as follows:

The cache is a classic zipper hash table structure. The cache consists of several buckets, each followed by several cache entries, which can be divided into Tag data-block and Tag data-block There are three parts of Flag. Tag is used to distinguish which data item is on, and Flag is used to identify the status of this data item. A data-block, also known as a Data row, is used to store Data that has been read from main memory or is to be written to main memory. A Data Block may contain the values of several variables.

When the processor is ready to read a piece of data, it “decodes” the corresponding memory address to get three values:
- Index – Determines the bucket number
- Tag – Used to determine the cache entry
- Offset – Used to determine the offset in the entry where the data resides
If the data is found in the cache based on its memory address, it is called a cache hit, otherwise the processor will look for the data in main memory.
Processors now typically have multiple levels of Cache, with the corresponding level 1 Cache (L1 Cache), level 2 Cache (L2 Cache), and level 3 Cache (L3) Cache), general level 1 Cache can be integrated in the processor core, thus its access speed is very high, in general, and Cache read will be completed in 2 to 4 processor clock cycle, some are used to store instructions (L1i), part (L1d) is used to store data, generally as near to the core processor Cache, access rate is higher The higher the manufacturing cost, the smaller the capacity.

On Linux, you can use the lscpu command to view the cache hierarchy

Cache consistency protocol

MESI(Modified-exclusive-shared-invalid) is a widely used cache consistency protocol. The cache consistency protocol of X86 processors is based on MESI. Its access control is similar to read/write lock, that is, concurrent read operations are performed on the same address and write operations are performed on the same address The operation is exclusive.

The MESI protocol classifies the states of cached entries into four categories: Modified Exclusive Shared Invalid

Invalid: Indicates that the corresponding cache line does not contain a copy of the data corresponding to any memory address. This state is the initial state of the cache entry.
Shared: indicates that the corresponding cache row contains a copy of the corresponding memory address, and that caches on other processors may also have a copy of the same memory address. So the state of a cache entry is Shared, and if there are cache entries on other processors with the same tag as the one given to the processor, those cache entries are also Shared. A cache entry in this state contains the same data in its cache row as the main memory.
4, Exclusive: Said the corresponding cache line contains corresponding copy of the memory address of the corresponding data, and the cache line in the form of monopoly retained the memory address of the data copy, that is, there is no the memory address of other processors a copy of the corresponding data, cache entries in this state, containing data in the cache line is in line with the data contained in the main memory.
Modified(M): Said corresponding cache line contains the memory address of the corresponding update corresponding results, and, with the msci specified in the agreement at any time there can be only one processor to do the same memory address data changes, so other processor does not exist the memory address of the data copy, in this to state the cache entry, the cache line is contained in the data package and the main memory The data contained are inconsistent.

MESI protocol defines a group of messages on the basis of these four states to coordinate read and write operations between processors. Compared with THE HTTP protocol, we can divide MESI messages into request messages – Corresponding message, the processor will send the corresponding message to the bus when performing the memory read and write operation, other processors will intercept the message in the bus and reply the corresponding response message to the bus under certain conditions.

Message name	Message type	describe
Read	request	Notifies other processors, main memory, that the current processor is ready to read some data. The message contains the memory address where the message is to be read.
Read Response	The response	The message contains the data requested to be read, either from main memory or from another processor that intercepts the read message.
Invalidate	request	Instructs other processors to set the state of the cache entry corresponding to the memory address in their respective caches to I, that is, instructs these processors to delete the data copy corresponding to the specified memory address.
Invalidate Acknowledge	The response	The processor that receives the Invalidate message must reply to the message to indicate that it has deleted the corresponding copy of the data from the cache.
Read Invalidate	request	This message is a composite of a Read message and an Invalidate message. It notifies other processors that the current processor is about to update a piece of data (Read- modify-write) and asks other processors to delete the corresponding copy of the memory address. The processor receiving this message must reply with a Read Response message and Invalidate Acknowledge message
WriteBack	request	This message contains the data information needed to humiliate the memory and the corresponding memory address

MESI read operation flow

Assume that DATA on memory address A is the DATA that processor -1 and processor -2 May share.

The implementation of reading DATA on processor -1 is discussed below. Processor-1 finds the corresponding cache entry based on address A and reads the Tag and Flag values (cache entry status) of the cache entry. For the sake of discussion, we will not discuss Tag matching here. If the state of the cache entry found by processor-1 is M, E, or S, the processor can read the data for address A directly from the corresponding cache line without sending any message to the bus. If the status of the cache entry found by processor-1 is I, it indicates that the processor’s cache does not contain A valid copy of DATA. In this case, processor-1 needs to send A Read message to the bus to Read the DATA corresponding to address A. Other processors, processor-2 (or main memory), need to reply with a Read Response to provide the data.

When processor-1 receives the Read Response message, it stores the DATA carried in it (the DATA block containing the DATA DATA) into the corresponding cache line and updates the state of the corresponding cache entry to S. The Read Response message received by processor-1 may come from native memory or from another processor (processor-2). Processor-2 sniffs messages sent by other processors in the bus. When processor-2 sniffs a Read message, it retrieves the memory address to be Read from the message and looks for the corresponding cache entry in its cache based on that address. If a cache entry found by processor-2 does not have a state of I, indicating a copy of the DATA to be Read in the processor’s cache, processor-2 constructs the corresponding Read Response message and “inserts” the entire block of DATA stored in the corresponding cache row (not just the DATA requested by processor-1) into the message. If the corresponding cache entry found by processor-2 has a state of M, then processor-2 may write the data in the corresponding cache row to main memory before sending a Read Response message to the bus. After processor-2 sends a Read Response to the bus, the state of the corresponding cache entry is updated to S. If the cache entry found by processor-2 has a state of I, the Read Response message received by processor-1 comes into main memory. Visible, read from memory, the processor – 1 even if the processor – 2 on the corresponding memory data update and the update also stay in the cache of processor – 2 and cause the inconsistency of data in the cache and main memory, under the coordination of the msci news this inconsistency will not lead to read into the processor – 1 an obsolete the old value.
MESI write process

Any processor that performs a memory write must own the data. When performing A memory write operation, processor-1 first finds the corresponding cache entry based on memory address A. If the status of the cache item found by processor-1 is E or M, it indicates that the processor has the ownership of the corresponding data. At this time, the processor can directly write the data to the corresponding cache line and update the status of the corresponding cache item to M. If the state of the cache entry found by processor-1 is not E or M, the handler needs to send an Invalidate message to the bus to gain ownership of the data. When other processors receive the Invalidate message, they update the status of the corresponding cache entry in their cache to I (i.e. delete the corresponding replica data) and respond with an Invalidate Acknowledge message. The processor that sends the Invalidate message (that is, the processor performing the memory write) must update the data to the corresponding cache line after receiving all the Invalidate Acknowledge messages from all other processors.

If the status of the cache entry found by processor-1 is S, it indicates that the cache on processor-2 may also retain A copy of the data corresponding to address A. In this case, processor-1 needs to send an Invalidate message to the bus. After receiving the Invalidate Acknowledge message from all the other processors, processor-1 updates the status of the corresponding cached entry to E, at which point processor-1 gains ownership of the data at address A. Processor-1 can then write the data to the appropriate cache row and update the status of the corresponding cache entry to M. If the state of the cache entry found by processor-1 is I, it indicates that the processor does not contain A valid copy of the data corresponding to address A. In this case, processor-1 needs to send A Read Invalidate message to the bus. Processor-1, after receiving a Read Response message and an Invalidate Acknowledge message from all other processors, updates the status of the cached entry to E, indicating that the handler has acquired ownership of the data. Processor-1 can then write data to the corresponding cache row and update the status of the corresponding cache entry to M. After receiving an Invalidate message or a Read Invalidate message, other processors must look for the corresponding cache entry in the processor’s cache based on the memory address contained in the message. If a cache entry found by processor-2 does not have a state of I, then processor-2 must update the status of the corresponding cache entry to I to delete the corresponding replica data and respond with an Invalidate Acknowledge message to the bus. Invalidate messages and Invalidate Acknowledge messages enable only one processor to write to the same memory address at any one time, thus avoiding data inconsistencies that may occur when multiple processors update the same data at the same time.

Write buffers and invalidate queues

The MESI protocol addresses the cache consistency issue, but when a processor writes, it must wait for the other processors to delete the corresponding data copy and respond with Invalidate Acknowledge/Read Response messages can write data to the cache. To solve performance problems, write buffers and invalidation queues are introduced.
Each processor has its own write buffer. The write buffer may contain multiple entries. One processor cannot read the write buffer of another processor.
If the cache entry is I(or S), the processor stores the data related to the write operation (including the data and the memory address to be operated) into the write buffer entry and sends Read Invalidate messages. If the entry status of all other processors is also “I,” then a so-called “write miss” occurs. That is, the Read request performs a main memory Read operation After writing to the buffer, instead of waiting for a “Read Response /Invalidate Acknowledge” message from other processors, we continue to execute other instructions and wait for a “Read Response /Invalidate” message from all processors The Acknowledge message returns success and writes the data to the buffer.
Instead of deleting the data in the specified cache entry after receiving the Invalidate message, the processor stores the message to the Invalidate queue and replies with the Invalidate message Acknowledge messages to reduce the processor’s wait time, and then remove the data by reading the message from the invalidation queue. Some processors don’t have an invalidation queue (such as X86 processors)

Store and forward

Processor in the read operation, because the variables corresponding to the memory address may be just finished writing, hasn’t the write buffer synchronization to the cache, so in the treatment of the reading, will go to the write buffer reads the existence of a corresponding entry, if not read from the cache, but a processor and can’t read write buffer of other processors.

Memory reordering

Both writing buffers and invalidating queues can cause memory reordering.

Writing buffers may cause StoreLoad to be reordered

Processor 0 Processor 1

X=1; //S1 Y=1; //S3

r1=Y; //L2

r2=X; //L4

As shown in the above table, X and Y are shared variables and their initial values are 0. R1 and R2 are local variables

When Processor 0 executes to L2, if the result of S3 operation is still in the write buffer, THEN L2 reads the initial value of Y as 0. Similarly, when Processor 1 executes to L4, S1 operation is also in the write buffer, and R2 reads the initial value of X 0: For Processor 1, S1 does not occur, that is, Processor 1 executes in the order of L2→S1, which is called StoreLoad reordering.
Writing buffers may cause StoreStore to be reordered

Processor 0 Processor 1

data=1; //S1

ready=true; //S2

while(! ready) continue; //L3

print(data); //L4

For example, data and Ready in the table above are shared variables, and their initial values are 0 and false. Before execution, their corresponding cache entry states on Processor 0 are S(or I) and E respectively, and their corresponding cache entries on Processor 1 are S and I respectively
1. When S1 is executed,Processor 0 stores the result of S1’s operation into the write buffer and then sends an Invalidate request to other processors….
2. Ready Because of the data that Processor 0 has exclusive access to, the results of S2 are directly stored in the cache
3. L3 successfully reads the new value true of ready through the cache consistency protocol
4. Because the operation result of S1 has not been synchronized from the write buffer to the cache, the data value read by L4 is still an old value of 0
The value of ready in Processor 0 has changed, and data is the original value, that is, S2→S1. This is StoreStore reordering.
Invalidating queues causes reordering of loadloads

Processor 0 Processor 1

data=1; //S1

ready=true; //S2

while(! ready) continue; //L3

print(data); //L4

For example, data and Ready in the table above are shared variables, and their initial values are 0 and false. Before execution, their corresponding cache entries in Processor 0 are S and E respectively, and their corresponding cache entries in Processor 1 are S and I respectively
1. S1 modifies the data variable. Processor 0 first puts the operation result of S1 into the write buffer, and then sends an Invalidate request to the bus with the data memory address information. Processor 1 intercepts this request and replies with an Invalidate request Acknowledge the message, then place the request in the invalidate queue
2. The ready of S2 is exclusive to Processor 0, so Processor 0 directly changes the value of ready to true and stores it in the cache
3. L3 synchronizes the value of ready from Processor 0 through the cache consistency protocol and enters L4 with the while condition false
4. This may be the case in L4, because the Invalidate request from 1 is still in Processor 1’s Invalidate queue, and L4 can still read the value of data directly from its cache, but data=0 at this point is still the initial value
Processor 0 senses that the data read by L4 in Processor 1 is an old value, that is, the execution sequence is L4→L3, which is LoadLoad reordering

Processor 0	Processor 1
X=1; //S1	Y=1; //S3
r1=Y; //L2
	r2=X; //L4

Processor 0	Processor 1
data=1; //S1
ready=true; //S2
	while(! ready) continue; //L3
	print(data); //L4

Processor 0	Processor 1
data=1; //S1
ready=true; //S2
	while(! ready) continue; //L3
	print(data); //L4

Visibility problem

Visibility problems are caused by write buffers and invalidation queues, and the solution to visibility problems is to add memory barriers when reading or writing shared variables.

Storage barrier → flushing the contents of the write buffer to the cache so that the latest data cannot be read by other processors.

Load barrier → will execute the request in the invalid queue, lest the host processor read an old value.

Basic memory barrier

Each type of memory reordering supported by the processor (LoadLoad reordering,LoadStore reordering,StoreLoad reordering,StoreStore reordering) provides instructions that can disable the corresponding reordering. These instructions are called LoadLoad barriers (LoadStore barriers,LoadStore barriers,Sto) ReLoad barrier,StoreStore barrier).
A barrier prevents reordering of instructions. For example, a LoadStore barrier prevents reordering of read instructions before this barrier and write instructions after this barrier.
LoadLoad barrier → Resolve LoadLoad reordering (mentioned above caused by invalidating queues)

Therefore, the implementation principle of the LoadLoad barrier is to remove invalid copies from the cache by emptying Invalidate messages in the invalidation queue to ensure that reordering does not occur.

Implementation of load barriers
StoreStore barrier → Resolve StoreStore reordering

Through the write buffer entries marked, through to judge the entries submitted order, if the processor in the process of writing found in the write buffer entries have the tag, so even though the write operation state of the corresponding data in a cache entry for E or M, will not write data in buffer cache. This ensures that writes before the barrier are committed to the cache before operations after the barrier
StoreLoad barrier → basic barrier for many processors

StoreLoad barrier can replace any other barrier, its main operation is to flush the write buffer cache + empty the invalidation queue, so StoreLoad barrier is also the most expensive.

Synchronization mechanism and memory barrier

Acquire and Release barriers are composite barriers made up of base barriers

A combination of the fetch barrier →LoadLoad barrier and LoadStore barrier, it prevents any read operations in front of the barrier from reordering reads and writes behind the barrier.

Release barrier → A combination of LoadStore barrier and StoreStore barrier, it prevents any write operations behind the barrier and read and write operations before the barrier from reordering.
Volatile and storage barriers

The use of barriers for volatile keyword writes

How the barrier is used for volatile keyword reads

However, the reality is that on X86 processors (common PCS, Intel processors & AMD processors → their new Zen architecture is also X86) only StoreLoad reordering is supported (LoadLoad reordering doesn’t happen at all), so in X86 processors, LoadLoad barriers, LoadStore barriers, and StoreStore barriers are empty instructions.
Synchronized storage barriers: Similar to volatile, Synchronized basically consists of several basic barriers, but Synchronized modiifies blocks of code, so it doesn’t discriminate between reads and writes, requires more barriers, and is more expensive. Ultimately, however, only the StoreLoad screen exists on X86 processors Disabled.

The virtual machine inserts a load barrier before the start of the critical section following the MonitorEnter instruction to ensure that updates to shared variables by other threads are synchronized to the processor’s cache. At the same time, a storage barrier is inserted after the MonitorExit directive to ensure that changes to shared variables are synchronized in critical sections of code.

The virtual machine inserts an access barrier after the MonitorEnter directive and a release barrier before the MonitorExit directive.
Final memory barrier rules

Final reorder rules require the decoder to insert a StoreStore barrier after the final field has been written but before the constructor return.

Reordering rules for reading final fields require the compiler to insert a LoadLoad barrier before reading final fields.

Since x86 processors do not reorder write-write operations, the StoreStore barrier required to write final fields is omitted in x86 processors. Also, since x86 processors do not reorder operations that have indirect dependencies, the LoadLoad barrier required to read final fields is omitted in x86 processors. This means that on x86 processors, final domain reads and writes do not insert any memory barriers!

Virtual machine memory barrier optimization

These optimizations include elisions, merges, etc. For example, two consecutive volatile writes, the virtual machine only adds a StoreLoad barrier after the last write. The X86 processor implements each MonitorExit with StoreLoad, so StoreLo is not needed AD barrier.