Personal creation convention: I declare that all articles created are their own original, if there is any reference to any article, will be marked out, if there are omissions, welcome everyone critique. If you find online plagiarism of this article, welcome to report, and actively submit an issue to the Github warehouse, thank you for your support ~

This article refers to a large number of articles, documents and papers, but it is really complicated. My level is limited, and my understanding may not be in place. If you have any objections, please leave a comment. This series will continue to be updated, and you are welcome to comment on your questions and errors and omissions

If you like the single-part version, please visit: The most hardcore Java New Memory Model parsing and Experimental single-part version (constantly updated in QA) If you like this split version, here is the table of contents:

  • 1. What is the Java memory model
  • Analysis and experiment of the new memory model in Java – 2. Atomic access and word splitting
  • Core Understanding of the Memory barrier (CPU+ compiler)
  • The most core Java new memory model analysis and experiment – 4. Java new memory access method and experiment
  • Net most core Java new memory model analysis and experiment – 5. JVM underlying memory barrier source analysis

JMM related documents:

  • Java Language Specification Chapter 17
  • The JSR-133 Cookbook for Compiler Writers – Doug Lea’s
  • Using JDK 9 Memory Order Modes – Doug Lea’s

Memory barriers, CPU and memory model related:

  • Weak vs. Strong Memory Models
  • Memory Barriers: a Hardware View for Software Hackers
  • A Detailed Analysis of Contemporary ARM and x86 Architectures
  • Memory Model = Instruction Reordering + Store Atomicity
  • Out-of-Order Execution

X86 cpus

  • x86 wiki
  • Intel® 64 and IA-32 Architectures Software Developer Manuals
  • Formal Specification of the x86 Instruction Set Architecture

ARM CPU

  • ARM wiki
  • aarch64 Cortex-A710 Specification

Various understandings of consistency:

  • Coherence and Consistency

Aleskey’s JMM:

  • Aleksey Shipilev – Don’t misunderstand the Java Memory Model (Part 1)
  • Aleksey Shipilev – Don’t misunderstand the Java Memory Model (part 2)

Many Java developments use Java’s concurrency synchronization mechanisms, such as volatile, synchronized, and Lock. Also there are a lot of people read the JSR chapter 17 Threads and Locks (address: docs.oracle.com/javase/spec…). , including synchronization, Wait/Notify, Sleep & Yield, and memory models. But I also believe that most people like me, the first time reading, feeling is watching the fun, after reading only know that he is such a regulation, but why such regulation, not so regulation, do not have a very clear understanding. At the same time, combined with the Hotspot implementation and the interpretation of Hotspot source code, we even found that due to the static code compilation optimization of Javac and the JIT compilation optimization of C1 and C2, the final code performance was not quite consistent with our understanding of the code from the specification. In addition, such inconsistencies lead to misunderstandings when we learn the Java Memory Model (JMM) and understand the design of the Java Memory Model if we try to use actual code. I myself am constantly trying to understand the Java memory model, rereading THE JLS and various gods’ analyses. This series will sort through some of my personal insights from reading these specifications and analyzing them, as well as some experiments with JCStress, to help you understand the Java memory model and API abstraction post-Java 9. Still emphasize, however, the design of the memory model, the starting point is to let you don’t have to care about the underlying and abstracting some design, involves a lot of things, my level is limited, might not reach the designated position, understand I will try to put the arguments of the each out as well as the reference, please do not completely believe that all the views here, If you have any objections, please refute them with specific examples and leave a comment.

5. Memory barriers

5.1. Why is a memory barrier needed

A Memory Barrier, also known as a Memory Fence, and a membar, for simplicity’s sake, all mean the same thing. A memory barrier is used to prevent the ordering of instructions out of order (or reordering).

So why are there instructions out of order? This is mainly due to CPU disorder (CPU disorder also includes CPU memory disorder and CPU instructions disorder) and compiler disorder. Memory barriers can be used to prevent this disorder. Memory barriers are generally referred to as hardware memory barriers if they apply to both the compiler and the CPU, and software memory barriers if they apply only to the compiler. We’ll focus on CPU out-of-order here, but we’ll cover compiler reordering briefly at the end.

5.2. The CPU memory is out of order

Starting with CPU caches and cache consistency protocols, we started to analyze why there is disorder in the CPU. Let’s assume a simple CPU model, but keep in mind that the actual CPU is much more complex than the simple CPU model listed here

5.2.1. Simple CPU Model – starting point for CPU caching – reduce CPU Stall

As we’ll see here, many modern CPU designs start with reducing CPU Stall. What is a CPU Stall? To take a simple example, suppose the CPU needs to read data directly from memory (ignoring other structures such as CPU caches, buses and bus events, etc.) :

The CPU issues a read request and cannot do anything else until memory responds. This part of the CPU is stalling. If the CPU is always reading directly from memory, it takes a long time for the CPU to access memory directly. It may take hundreds of instruction cycles, that is, hundreds of instruction cycles each time the CPU is stalling and doing nothing. It is generally necessary to introduce several caches to reduce Stall: caches are small pieces of memory next to the processor, located between the processor and memory.

We are hereIt doesn’t care about multi-level caches and whether there are multiple cpus sharing a cache, we will simply consider the following architecture:When the value of an address needs to be read, access the cache to see if it exists: presence representshit(hit), read directly. Nonexistence is calledmissing(miss). Similarly, if you need to write a value to an address, the address is in the cache and you don’t need to access memory. Most programs show highlocal(the locality) :

  • If a processor reads or writes to a memory address, it is likely to read or write to the same address again soon.
  • If a processor reads or writes a memory address, it is likely that it will soon read or write nearby addresses as well.

For locality, caching typically operates on not one word at a time, but a group of adjacent words, called cache rows.

However, telling the cache about its existence makes it difficult to update memory: when one CPU needs to update the corresponding memory of a cache row, it needs to invalidate the other CPU’s cache row. Cache Coherence Protocols were introduced to maintain Cache data consistency across each CPU.

5.2.2. Simple CPU Model – a simple cache consistency protocol (real cpus use more complex than this) – MESI

Modern cache consistency protocols and algorithms are complex, and cache rows can have dozens of different states. We don’t need to explore such a complex algorithm here, but we will introduce a classic and simple cache consistency protocol called the 4-state MESI protocol. MESI refers to the four states of the cache line:

  • Modified: A cached row is Modified and must eventually be written back to main memory. No other processor can cache the cached row until then.
  • Exclusive: The cache row has not been modified, but other processors cannot load it into the cache
  • Shared: The cache line is not modified and can be loaded into the cache by other processors
  • Invalid: There is no meaningful data in the cache line

As shown in our previous CPU cache structure diagram, assuming that all cpus share the same bus, the following information is sent on the bus:

  1. Read: This event contains the physical address of the cached row to be Read.
  2. Read Response: Contains the data requested by the previous Read event. The data may come from memory or another cache. For example, if the requested data is modified in another cache, the cache row must be Read from this cache as a Read Response
  3. Invalidate: This event contains the physical address of the cache row to expire. Other caches must remove this cache line and respond to the Invalidate Acknowledge message.
  4. Invalidate Acknowledge: Respond with an Invalidate Acknowledge message after removing the corresponding cached line.
  5. Read Invalidate: a combination of Read and Invalidate messages containing the physical address of the cached row to be Read. Both Read the cached line and request a Read Response message in Response, and send it to the other caches, remove the cached line and respond with an Invalidate Acknowledge message.
  6. Writeback: This message contains the memory address and data to be updated. This message also allows cache rows in the Modified state to be culled to make room for other data.

The relationship between cache row state transitions and events:

This is just a diagram, we’re not going to go into it, because MESI is a very compact protocol, and there are a lot of extra problems that MESI can’t solve when it’s implemented, and if you go into detail, you’re going to get the reader involved, you’re going to have to think about what the protocol is going to do to make it right in certain limiting cases, But MESI doesn’t actually solve that. In practice, CPU conformance protocols are much more complex than MESI, but are generally based on MESI extensions.

Take a simple example of MESI:1. The CPU to send AReadRead data from address A. ReceivedRead ResponseStores the data into his cache and sets the corresponding cache row toExclusive

2.CPU B sends Read from address A. CPU A detects an address conflict, and CPU A responds with Read Response. The cache row corresponding to the data at address A is loaded into the cache in the Shared state by A and B

3.CPU B is about to write to A and sends a messageInvalidate, waiting for CPU AInvalidate AcknowledgeAfter the response, the status changes toExclusive. The CPU receives AInvalidateAfter that, set the state of the cache row where A resides toInvalidfailure

4.CPU B modifies the data stored in the cache row containing address A and sets the cache row status to Modified

CPU A sends A Read from address A. CPU B detects an address conflict, and CPU B responds with A Read Response. The cache row corresponding to the data at address A is loaded into the cache in the Shared state by A and B

As you can see from the MESI protocol, sending an Invalidate message requires the current CPU to wait for an Invalidate Acknowledge from another CPU. To avoid this Stall, the Store Buffer was introduced

5.2.3. Simple CPU model – Avoid stall-store Buffer waiting for Invalidate Response

To avoid Stall, add a Store Buffer between the CPU and the CPU cache, as shown below:

With the Store Buffer, the CPU sends an Invalidate message without waiting for the Invalidate Acknowledge reply. If all Invalidate acknowledgments are received, place them in the corresponding cache line from the Store Buffer in the CPU’s cache. But the addition of the Store Buffer brought new problems:

Let’s say we have two variables, a and B, that are not in the same cache line and both start with 0, A is now in the cache line of CPU A, and B is now in the cache line of CPU B:

Suppose CPU B wants to execute the following code:

We definitely expect b to be equal to 2 at the end. But is it really going to happen? Let’s take a closer look at the following steps:

1.CPU B runs a = 1.

(1) A “Read Invalidate” message is issued because CPU B does not have a “A” in its cache and needs to modify it.

(2)CPU B puts the modification (A =1) of A into the Storage Buffer

(3)CPU A receives the “Read Invalidate” message, marks the cache line where A is located as “Invalid”, clears the cache, and responds with “Read Response (A =0) and” Invalidate Acknowlegde “.

2.CPU B runs B = a + 1.

(1)CPU B receives A Read Response from CPU A, where A is still equal to 0.

(2)CPU B saves the result of A +1 (0+1=1) into the cache already contained in B.

3.CPU B fails to run assert(B == 2)

The main reason for this error is that we did not consider the latest value from the Store buffer when loading into the cache, so we can add a step to read the latest value from the Store Buffer when loading into the cache. In this way, we guarantee that the result b we saw above will end up with 2:

5.2.4. Simple CPU model – Avoid out-of-order execution from Store Buffer – memory barrier

Let’s look at another example: Suppose we have two variables, A and B, that are not on the same cache line and both start with 0. Given that CPU A (row A contains b, and the cache row state is Exclusive) executes:

Suppose CPU B executes:

If everything is expected to execute in program order, we would expect CPU B to assert(a == 1) to be successful, but let’s look at the following execution flow:1.CPU A Executes A = 1.

(1)CPU A does not contain A in its cache and needs to modify it, so it issues A “Read Invalidate” message.

(2)CPU A puts the modification (A =1) of A into the Storage Buffer

While (B == 0) continue:

(1)CPU B cache does not contain B, releaseReadThe message.3.CPU A executes b = 1.

(1) SELECT * from CPU A where b is in Exclusive state.

(2) After that, CPU A receives A Read message from CPU B about B.

(3)CPU A responds to b = 1 in the cache, sends A Read Response message, and changes the cache line status to Shared

(4)CPU B receives the Read Response message and puts B into the cache

(5)CPU B can exit the loop because CPU B sees that B is now 1

4.CPU B executes assert(a == 1), but fails because a’s changes have not been updated.

The CPU can’t control this kind of disorder automatically, but it usually provides memory barrier instructions that tell the CPU to prevent disorder, such as:

Smp_mb () causes the CPU to flush the contents of the Store Buffer into the cache. With the memory barrier instruction added, the execution flow becomes:

1.CPU A Executes A = 1.

(1)CPU A does not contain A in its cache and needs to modify it, so it issues A “Read Invalidate” message.

(2)CPU A puts the modification (A =1) of A into the Storage Buffer

While (B == 0) continue:

(1)CPU B cache does not contain B, releaseReadThe message.3.CPU B runs smp_MB () :

(1)CPU B marks all the items in the current Store Buffer.

4.CPU A runs b = 1.

(1)CPU A cache row B, and the state of the Store Buffer is Exclusive, but because there is A marked item A in the Store Buffer, it does not update the cache directly. And send an Invalidate message.

(2) After that, CPU A receives A Read message from CPU B about B.

(3)CPU A responds to b = 0 in the cache, sends A Read Response message, and changes the cache line status to Shared

(4)CPU B receives the Read Response message and puts B into the cache

(5)CPU B keeps looping because CPU B sees that B is still 0

(6)CPU A receives the previous “Read Invalidate” message response from A, and flushed the marked item in the Store Buffer into the cache. This cache line is in modified state.

(7)CPU B receives CPU A’s Invalidate message from CPU B, invalidates B’s cache line, and replies with Invalidate Acknowledge

(8)CPU A received the Invalidate Acknowledge and brushed B from the Store Buffer into the cache.

(9) CPU B keeps reading B, but B is no longer in the cache, so it sends a Read message.

(10)CPU A receives A Read message from CPU B, sets the cache line status of B to shared, and returns A Read Response of B = 1 in the cache

(11)CPU B receives a “Read Response”, which indicates that B = 1 and puts it in the cache line with the state of “shared”

5.CPU B knows that B = 1 and exits the while (B == 0) continue loop

6.CPU B asserts (a == 1) : CPU B asserts (A == 1) (2)CPU A reads A = 1 from the cache and responds with Read Response. (3)CPU B executes assert(A == 1) successfully

The Store Buffer is usually small. If the Store Buffer is full, Stall will still occur. The Store Buffer is expected to flush into the CPU cache fairly quickly after receiving the corresponding Invalidate Acknowledge. However, other cpus may be too busy to respond quickly to Invalidate messages and Invalidate acknowledgments, which may result in a full Store Buffer and CPU Stall. Therefore, each CPU’s Invalidate queue can be introduced to cache Invalidate messages to be processed.

5.2.5. Simple CPU model – Decouple CPU Invalidate and Store Buffer – Invalidate Queues

With the Invalidate Queues, the CPU structure looks like this:

With the Invalidate Queue, the CPU can flush the Store Buffer immediately after placing the Invalidate in the Queue. In addition, the CPU must check its own Invalidate Queue to see if it has the same Invalidate message for a cache row before sending it. If yes, it must wait until it has processed the corresponding message in its Invalidate Queue.

Similarly, the Invalidate Queue introduces out-of-order execution.

5.2.6. Simple CPU models – further out of order due to Invalidate Queues – require memory barriers

Let’s say I have two variables, a and B, that are not on the same cache line, and they both start with 0. Suppose CPU A (shared, Exclusive) executes:

CPU B (the cache line contains a(shared)) executes:

1.CPU A Executes A = 1.

(1) THERE is A (shared) in CPU A’s cache. CPU A puts A’s modification (A =1) into the Store Buffer and sends an Invalidate message.

While (B == 0) continue:

(1)CPU B does not contain B in its cache, and publishes a Read message.

(2)CPU B receives the Invalidate message from CPU A, places it in the Invalidate Queue, and returns it immediately.

(3)CPU A receives the response of the Invalidate message and flusits the cache line in the Store Buffer into CPU cache

3.CPU A runs smp_MB () :

(1) CPU A has flushed the Store Buffer into the CPU cache, so it passes directly

4.CPU A runs b = 1.

(1) CPU A contains CPU B’s Exclusive cache row, so update CPU B’s Exclusive cache row.

(2)CPU A receives A Read message from CPU B, updates the cache line status of B to Shared, and then sends A Read Response containing the latest value of B

(3)CPU B receives a Read Response with the value of B being 1

5.CPU B exits the loop and starts to execute assert(a == 1)

(1) Since the Invalidate message of A is still not processed in the Invalidate queue, CPU B still sees a = 0, and the assert fails

Therefore, to address this disorder, we also put a memory barrier in the code executed by CPU B, where the memory barrier waits not only for the CPU to run out of Store buffers, but also for the CPU to run out of Invalidate Queue. Adding the memory barrier, CPU B executes the following code:

Thus, in step 5 above, CPU B exits the loop and waits for the Invalidate queue to complete before executing assert(a == 1) : (2) Assert (a == 1) is not present in the cache. You need to send a Read message to see the latest value of B, 1.

5.2.7. Simple CPU model – More granular memory barrier

As we mentioned earlier, in our CPU model, the smp_MB () memory barrier instruction does two things: wait for the CPU to run out of Store buffers and wait for the CPU’s Invalidate Queue to run out. However, for the memory barrier in our code executed by CPU A and CPU B, it is not necessary to have both operations at the same time:

As a result, the CPU abstracts more fine-grained Buffer instructions. The instructions that wait for the CPU to run out of Store buffers are called Write Memory buffers. The instructions that wait for the CPU’s Invalidate Queue to complete are called Read Memory buffers.

5.2.8. Simple CPU model – Summary

Here we start with a simple CPU architecture, step by step, to describe some simple CPU architecture and why memory barriers are needed, can be summarized as the following simple flowchart:

  1. Each time a CPU accesses the memory directly, the CPU is stalling. To reduce CPU Stall, CPU caching was added.
  2. CPU caching creates cache inconsistencies between multiple cpus, so MESI is a simple CPU cache consistency protocol to coordinate cache consistency between different cpus
  3. Some mechanisms in the MESI protocol were optimized to further reduce CPU Stall:
  4. By putting updates into the Store Buffer, the Invalidate messages emitted by the update do not have to be stalling for an Invalidate Response.
  5. The Store Buffer introduces instructions (code) out of order, requiring memory barrier instructions that force the current CPU Stall to wait until all of the Store Buffers are flushed. This memory barrier instruction is commonly referred to as a write barrier.
  6. To speed up Store Buffer flushing, add Invalidate Queue,

5.3. CPU instructions are out of order

CPU instructions can also be executed out of order, but we will only mention a relatively common one – instruction parallelization.

5.3.1. Increase CPU execution efficiency – CPU Pipeline mode

Modern cpus run in an instruction pipeline mode when executing instructions. Because the CPU also has different components, we can divide the execution of an instruction into different phases, which involve different components. In this way, pseudo-decoupling allows each component to execute independently without waiting for an instruction to complete execution before processing the next instruction.

It is generally divided into the following stages:Take refers toInstrcution Fetch, IFdecoding(Instruction Decode, ID),perform(Execute, EXE)access(Memory, MEM),Write back to(Write-back, WB)

5.3.2. Further reduce CPU Stall – CPU Out of order Execution Pipeline

Since the instruction’s data is not ready, as in the following example:

If data A is not ready and has not been loaded into the register, then we do not need to Stall and wait for loading A. We can do c = 1 first. The CPU Out of order execution Pipeline is based on this idea:

As shown in the figure, CPU execution stages are divided into:

  1. Instructions Fetch: a batch of Instructions are fetched in batches, and instruction analysis is carried out to analyze the cycles and dependencies, branch prediction, etc
  2. Instruction Decode: Instruction Decode, much the same as the previous pipeline mode
  3. Reservation crisis: Functoinal Unit (FU) will be processed if the input is ready. If not, the Bypass network will be monitored. When the data is ready, signals will be sent back to the Reservation stations for the instructions to be processed into the graph FU.
  4. Functional Unit: processing instructions
  5. Reorder Buffer: The orders are kept in their original order. They are added to one end of the list when they are dispatched and removed from the other end when they are finished. In this way, instructions are completed in the order they dispatch.

This structure ensures that the Store Buffer is written in the same order as the original instructions. But loading the data, as well as the computation, is done in parallel. We have already seen that in our simple CPU architecture, the multi-CPU cache MESI, Store Buffer, and Invalidate Queue cannot read the latest values, which is exacerbated by out-of-order parallel loading and processing. In addition, the structural design can only ensure that the interdependence between instructions in the same thread can be detected to ensure that the order of instruction execution between such interdependence is correct, but the instruction dependence between multithreaded programs, CPU batch instruction fetching and branch prediction can not be perceived. So it’s still going to be out of order. This disorder can also be avoided by the memory barrier above.

5.4. Actual CPU

There will be a variety of actual cpus, with different CPU architecture designs and different CPU cache consistency protocolsDifferent kinds of disorderEach of them individually would be too complicated. So, we use a standard to abstract out of order on different cpus (i.e., the first operation is M, the second operation is N, the two operations are out of order, is very similar to Doug Lea’s description of JMM, in fact the Java memory model is also based on this design), refer to this table:

Let’s start by saying what each column means:

  1. Loads Reordered After Loads: the first operation is read, the second operation is read, whether it is out of order.
  2. Loads Reordered After Stores: the first operation is read, the second operation is write, whether it is out of order.
  3. Stores Reordered After Stores: The first operation is written, the second operation is also written, whether the order will be out of order.
  4. Stores Reordered After Loads: the first operation is write and the second operation is read, whether it is out of order.
  5. Atomic Instructions Reordered With Loads: two operations are Atomic operations (a group of operations occurring at the same time, such as two words being modified at the same time) and Loads, whether the two operations are out of order to each other.
  6. Atomic Instructions Reordered With Stores: The two operations are Atomic operations (a group of operations that happen at the same time, such as the instruction to modify two words at the same time) and write, whether the two will be out of order With each other.
  7. Dependent Loads Reordered: Whether a read is out of order if it depends on the result of another read.
  8. Incoherent Instruction Cache/Pipeline: Whether instructions are executed out of order.

An example is the x86 architecture commonly used on our own PC, where only Stores Reordered After Loads and Incoherent Instruction Cache/Pipeline occur. LoadLoad, LoadStore, StoreLoad, and StoreStore are the four Java memory barriers. This is why the x86 environment only needs to implement StoreLoad.

5.5. Compiler out of order

In addition to CPU disorder, in the software level there are compiler optimization reorder caused by compiler optimization, in fact, some ideas and said above CPU instruction pipeline optimization is actually somewhat similar. For example, the compiler also analyzes your code and optimizes statements that don’t depend on each other. Statements that do not depend on each other can be rearranged at will. But again, the compiler can only think and analyze from a single-threaded perspective and doesn’t know the dependencies and relationships of your program in a multi-threaded environment. To take another simple example, suppose there is no CPU out of order, with two variables x = 0, y = 0, thread 1 executes:

Thread 2 executes:

It is possible for thread 2 to assert to fail because the compiler might throw x = 1 out of order with y = 1.

Compiler out-of-order can be avoided by adding compiler barrier statements on different operating systems. For example, thread one executes:

So you don’t get out of order between x = 1 and y = 1.

At the same time, when we use it in practice, we refer to hardware memory barriers, that is, memory barriers implemented by hardware CPU instructions, and such hardware memory barriers also implicitly surround compiler barriers. Compiler barriers, commonly referred to as software memory barriers, are simply barriers that control the software level of the compiler. For example, volaile in C++ is different from volatile in Java, which simply prevents compiler rearrangements. But you can’t avoid CPU disorder.

That’s where the disorder comes from, and what the memory barrier does. Next, we’ll get down to business and begin our journey through the Java 9+ memory model. Before we do that, one more thing to note: Why is it best not to write your own code to verify some of the JMM’s conclusions, but to use a professional framework to test them

Wechat search “my programming meow” public account, add the author’s wechat, a daily brush, easy to improve skills, won a variety of offers