Personal creation convention: I declare that all articles created are their own original, if there is any reference to any article, will be marked out, if there are omissions, welcome everyone critique. If you find online plagiarism of this article, welcome to report, and actively submit an issue to the Github warehouse, thank you for your support ~

This article refers to a large number of articles, documents and papers, but it is really complicated. My level is limited, and my understanding may not be in place. If you have any objections, please leave a comment. This series will continue to be updated, and you are welcome to comment on your questions and errors and omissions

JMM related documents:

  • Java Language Specification Chapter 17
  • The JSR-133 Cookbook for Compiler Writers – Doug Lea’s
  • Using JDK 9 Memory Order Modes – Doug Lea’s

Memory barriers, CPU and memory model related:

  • Weak vs. Strong Memory Models
  • Memory Barriers: a Hardware View for Software Hackers
  • A Detailed Analysis of Contemporary ARM and x86 Architectures
  • Memory Model = Instruction Reordering + Store Atomicity
  • Out-of-Order Execution

X86 cpus

  • x86 wiki
  • Intel® 64 and IA-32 Architectures Software Developer Manuals
  • Formal Specification of the x86 Instruction Set Architecture

ARM CPU

  • ARM wiki
  • aarch64 Cortex-A710 Specification

Various understandings of consistency:

  • Coherence and Consistency

Aleskey’s JMM:

  • Aleksey Shipilev – Don’t misunderstand the Java Memory Model (Part 1)
  • Aleksey Shipilev – Don’t misunderstand the Java Memory Model (part 2)

Many Java developments use Java’s concurrency synchronization mechanisms, such as volatile, synchronized, and Lock. Also there are a lot of people read the JSR chapter 17 Threads and Locks (address: docs.oracle.com/javase/spec…). , including synchronization, Wait/Notify, Sleep & Yield, and memory models. But I also believe that most people like me, the first time reading, feeling is watching the fun, after reading only know that he is such a regulation, but why such regulation, not so regulation, do not have a very clear understanding. At the same time, combined with the Hotspot implementation and the interpretation of Hotspot source code, we even found that due to the static code compilation optimization of Javac and the JIT compilation optimization of C1 and C2, the final code performance was not quite consistent with our understanding of the code from the specification. In addition, such inconsistencies lead to misunderstandings when we learn the Java Memory Model (JMM) and understand the design of the Java Memory Model if we try to use actual code. I myself am constantly trying to understand the Java memory model, rereading THE JLS and various gods’ analyses. This series will sort through some of my personal insights from reading these specifications and analyzing them, as well as some experiments with JCStress, to help you understand the Java memory model and API abstraction post-Java 9. Still emphasize, however, the design of the memory model, the starting point is to let you don’t have to care about the underlying and abstracting some design, involves a lot of things, my level is limited, might not reach the designated position, understand I will try to put the arguments of the each out as well as the reference, please do not completely believe that all the views here, If you have any objections, please refute them with specific examples and leave a comment.

1. Understand “specification” and “implementation”

First of all, I would like to refer to Aleksey Shipilev’s way of understanding, that is, first of all, to make a clear distinction between Specification and Implementation. The Java Language Specification (JLS) is a Specification that specifies the Java Language, and all JDK implementations that can compile and run the Java Language must implement its specified functions. But for the actual implementation, such as Hotspot JVM JDK, is the concrete implementation, from the specification to the actual implementation, there is a certain difference. First, this code:

The actual code that HotSpot eventually compiles and runs with JIT optimisation and CPU optimisation is:

Putting the result 3 into the register and returning it is the same as the original code, and it is reasonable to omit useless local variable operations. So you might be saying, “No, when I break my point and I get to this point, I see local variables x,y,result.” If you run the JVM in DEBUG mode, the JIT will not be enabled by default and will simply interpret execution so that you can see local variables. But in practice, if the method is a hot one, these local variables will not actually exist after JIT optimization.

Another example is that Hotspot has a lock bloat mechanism (which we will test later) that says:

The operations x = 1 and y = 1 cannot be reordered as described by JLS. But the actual implementation of Hotspot optimizes the above code to:So in this case, we can actually reorder x = 1 and y = 1, which we’ll verify later.

The actual performance varies somewhat from JVM implementation to JVM implementation. And even the same JVM implementation may behave differently in different operating systems, hardware environments, and so on. Take this example:

Normally, r1 should have a value of only{1, 0}One of these two outcomes. However, execution on some 32-bit JVMS can be problematic, such as in an X86_32 environment{-1, 0, -4294967296, 4294967295}These results.

So, if we want to cover the bottom layer of JMM design and Hotspot implementation and JIT optimization and so on, there are too many things involved, one layer of logic, logic, I really can’t do it all. And I can’t guarantee that my understanding is 100% accurate. If we want to involve too much HotSpot implementation, then we may be deviated from the theme of this series, we actually our main concern is the design of the Java memory model itself specification, then summarizes in actual use, we need to know and pay attention to the minimal set of point, this is also the series to comb, at the same time, To make sure this minimal set is accurate, a lot of actual testing code is added, and you can run through it to see if the results and your understanding of the JMM are correct.

2. What is the memory model

Any language that needs to access memory needs to have a memory model that describes how to access memory: how I can write to memory, how I can read from memory, and how different ways of writing and reading behave. Of course, if your program is a simple serial program, what you read must be the most recently written value, in which case you don’t really need a memory model. It’s usually in a concurrent environment that you need an in-memory model.

The Java memory model specifies the reasonable values that can be observed when memory is read or written in different ways in a Java multithreaded environment.

Java memory is also defined in such a way that Java instructions are reordered, and the Java memory model specifies which instructions are not reordered. In fact, this is the main content of the Java Memory model in Chapter 17 of JLS. This is actually a way to achieve a reasonable value for observed memory, that is, what the possible result set is for a given source code.

Let’s take a look at two simple introductory examples to warm up. Atomic access and word splitting.

3. Atomic access

Atomic access: The operation of writing and reading a field is itself atomic indivisible. One of the things that you probably don’t pay much attention to is that, according to JLS Chapter 17, the following two operations are not atomic access:Thanks to the fact that most of your current systems are 64-bit, these two operations are mostly atomic. However, according to the Java specification, these two are not atomic and cannot be guaranteed on a 32-bit system. I quote directly from JLS Chapter 17:

For the purposes of the Java programming language memory model, a single write to a non-volatile long or double value is treated as two separate writes: one to each 32-bit half. This can result in a situation where a thread sees the first 32 bits of a 64-bit value from one write, and the second 32 bits from another write. Writes and reads of volatile long and double values are always atomic.

In short, a non-volatile long or double may write updates to two separate 32-bit bits, so it is non-atomic. Volatile long or double reads and writes are atomic.

To illustrate our atomicity here, I’ll cite an example from JCStress:

We ran the code here using a Java 8 32bit JVM (32-bit machines are no longer supported after Java 9), and the result was:

As you can see, it’s not just -1 and 0 that we specify in our code, but also some intermediate results.

4. I won the word tearing.

Word tearing means that when you update one field, one element in an array, it will affect the value of another field, another element in the array. For example, the processor does not provide the ability to write a single byte. If the smallest dimension is int, updating the byte array on such a processor would be problematic if it simply reads the entire INT of the byte, updates the corresponding byte, and then writes the entire int back. There is no word splitting in Java, fields and array elements are independent, and updating one field or element cannot affect reading or updating of any other field or element.

To illustrate what word splitting is, an unfortunate example is thread-unsafe BitSet. The abstraction of a BitSet is a set of bits (0,1). The underlying implementation is an array of long bits. A long holds 64 bits. Update it back. The interface level is updated bit by bit, but the bottom level is updated in the dimension of long (because it is the bottom long array). Obviously, if there is no synchronization lock, the concurrent access will cause concurrent security problems and cause word splitting problems:

The result is:

This is an unfortunate example of what word splitting is. Java guarantees no word splitting. In the BitSet example above, we tried to update a Boolean array so that the result would only be true:

This is only going to be true true

Next, we will enter a more painful chapters, memory barriers, but you don’t need to worry too much, also from my personal experience, memory barriers are hard to understand because the Internet is basically not from Java have shielding the low-level details for you to tell you, direct understanding will be hard to convince myself, so I would guess something then misunderstanding, So this article is not going to give you Doug Lea’s abstraction of Java’s four memory barriers (LoadLoad, StoreStore, LoadStore, and StoreLoad), which are still in use today. The design of these four memory barriers has been somewhat outdated for modern CPUS, and now more acquire, release and fence are used. I hope that through some articles about the underlying details of the author, we can extract things that are easy to understand for your reference, so as to better and easier understand the memory barriers.

5. Memory barriers

5.1. Why is a memory barrier needed

A Memory Barrier, also known as a Memory Fence, and a membar, for simplicity’s sake, all mean the same thing. A memory barrier is used to prevent the ordering of instructions out of order (or reordering).

So why are there instructions out of order? This is mainly due to CPU disorder (CPU disorder also includes CPU memory disorder and CPU instructions disorder) and compiler disorder. Memory barriers can be used to prevent this disorder. Memory barriers are generally referred to as hardware memory barriers if they apply to both the compiler and the CPU, and software memory barriers if they apply only to the compiler. We’ll focus on CPU out-of-order here, but we’ll cover compiler reordering briefly at the end.

5.2. The CPU memory is out of order

Starting with CPU caches and cache consistency protocols, we started to analyze why there is disorder in the CPU. Let’s assume a simple CPU model, but keep in mind that the actual CPU is much more complex than the simple CPU model listed here

5.2.1. Simple CPU Model – starting point for CPU caching – reduce CPU Stall

As we’ll see here, many modern CPU designs start with reducing CPU Stall. What is a CPU Stall? To take a simple example, suppose the CPU needs to read data directly from memory (ignoring other structures such as CPU caches, buses and bus events, etc.) :

The CPU issues a read request and cannot do anything else until memory responds. This part of the CPU is stalling. If the CPU is always reading directly from memory, it takes a long time for the CPU to access memory directly. It may take hundreds of instruction cycles, that is, hundreds of instruction cycles each time the CPU is stalling and doing nothing. It is generally necessary to introduce several caches to reduce Stall: caches are small pieces of memory next to the processor, located between the processor and memory.

We are hereIt doesn’t care about multi-level caches and whether there are multiple cpus sharing a cache, we will simply consider the following architecture:When the value of an address needs to be read, access the cache to see if it exists: presence representshit(hit), read directly. Nonexistence is calledmissing(miss). Similarly, if you need to write a value to an address, the address is in the cache and you don’t need to access memory. Most programs show highlocal(the locality) :

  • If a processor reads or writes to a memory address, it is likely to read or write to the same address again soon.
  • If a processor reads or writes a memory address, it is likely that it will soon read or write nearby addresses as well.

For locality, caching typically operates on not one word at a time, but a group of adjacent words, called cache rows.

However, telling the cache about its existence makes it difficult to update memory: when one CPU needs to update the corresponding memory of a cache row, it needs to invalidate the other CPU’s cache row. Cache Coherence Protocols were introduced to maintain Cache data consistency across each CPU.

5.2.2. Simple CPU Model – a simple cache consistency protocol (real cpus use more complex than this) – MESI

Modern cache consistency protocols and algorithms are complex, and cache rows can have dozens of different states. We don’t need to explore such a complex algorithm here, but we will introduce a classic and simple cache consistency protocol called the 4-state MESI protocol. MESI refers to the four states of the cache line:

  • Modified: A cached row is Modified and must eventually be written back to main memory. No other processor can cache the cached row until then.
  • Exclusive: The cache row has not been modified, but other processors cannot load it into the cache
  • Shared: The cache line is not modified and can be loaded into the cache by other processors
  • Invalid: There is no meaningful data in the cache line

As shown in our previous CPU cache structure diagram, assuming that all cpus share the same bus, the following information is sent on the bus:

  1. Read: This event contains the physical address of the cached row to be Read.
  2. Read Response: Contains the data requested by the previous Read event. The data may come from memory or another cache. For example, if the requested data is modified in another cache, the cache row must be Read from this cache as a Read Response
  3. Invalidate: This event contains the physical address of the cache row to expire. Other caches must remove this cache line and respond to the Invalidate Acknowledge message.
  4. Invalidate Acknowledge: Respond with an Invalidate Acknowledge message after removing the corresponding cached line.
  5. Read Invalidate: a combination of Read and Invalidate messages containing the physical address of the cached row to be Read. Both Read the cached line and request a Read Response message in Response, and send it to the other caches, remove the cached line and respond with an Invalidate Acknowledge message.
  6. Writeback: This message contains the memory address and data to be updated. This message also allows cache rows in the Modified state to be culled to make room for other data.

The relationship between cache row state transitions and events:

This is just a diagram, we’re not going to go into it, because MESI is a very compact protocol, and there are a lot of extra problems that MESI can’t solve when it’s implemented, and if you go into detail, you’re going to get the reader involved, you’re going to have to think about what the protocol is going to do to make it right in certain limiting cases, But MESI doesn’t actually solve that. In practice, CPU conformance protocols are much more complex than MESI, but are generally based on MESI extensions.

Take a simple example of MESI:1. The CPU to send AReadRead data from address A. ReceivedRead ResponseStores the data into his cache and sets the corresponding cache row toExclusive

2.CPU B sends Read from address A. CPU A detects an address conflict, and CPU A responds with Read Response. The cache row corresponding to the data at address A is loaded into the cache in the Shared state by A and B

3.CPU B is about to write to A and sends a messageInvalidate, waiting for CPU AInvalidate AcknowledgeAfter the response, the status changes toExclusive. The CPU receives AInvalidateAfter that, set the state of the cache row where A resides toInvalidfailure

4.CPU B modifies the data stored in the cache row containing address A and sets the cache row status to Modified

CPU A sends A Read from address A. CPU B detects an address conflict, and CPU B responds with A Read Response. The cache row corresponding to the data at address A is loaded into the cache in the Shared state by A and B

As you can see from the MESI protocol, sending an Invalidate message requires the current CPU to wait for an Invalidate Acknowledge from another CPU. To avoid this Stall, the Store Buffer was introduced

5.2.3. Simple CPU model – Avoid stall-store Buffer waiting for Invalidate Response

To avoid Stall, add a Store Buffer between the CPU and the CPU cache, as shown below:

With the Store Buffer, the CPU sends an Invalidate message without waiting for the Invalidate Acknowledge reply. If all Invalidate acknowledgments are received, place them in the corresponding cache line from the Store Buffer in the CPU’s cache. But the addition of the Store Buffer brought new problems:

Let’s say we have two variables, a and B, that are not in the same cache line and both start with 0, A is now in the cache line of CPU A, and B is now in the cache line of CPU B:

Suppose CPU B wants to execute the following code:

We definitely expect b to be equal to 2 at the end. But is it really going to happen? Let’s take a closer look at the following steps:

1.CPU B runs a = 1.

(1) A “Read Invalidate” message is issued because CPU B does not have a “A” in its cache and needs to modify it.

(2)CPU B puts the modification (A =1) of A into the Storage Buffer

(3)CPU A receives the “Read Invalidate” message, marks the cache line where A is located as “Invalid”, clears the cache, and responds with “Read Response (A =0) and” Invalidate Acknowlegde “.

2.CPU B runs B = a + 1.

(1)CPU B receives A Read Response from CPU A, where A is still equal to 0.

(2)CPU B saves the result of A +1 (0+1=1) into the cache already contained in B.

3.CPU B fails to run assert(B == 2)

The main reason for this error is that we did not consider the latest value from the Store buffer when loading into the cache, so we can add a step to read the latest value from the Store Buffer when loading into the cache. In this way, we guarantee that the result b we saw above will end up with 2:

5.2.4. Simple CPU model – Avoid out-of-order execution from Store Buffer – memory barrier

Let’s look at another example: Suppose we have two variables, A and B, that are not on the same cache line and both start with 0. Given that CPU A (row A contains b, and the cache row state is Exclusive) executes:

Suppose CPU B executes:

If everything is expected to execute in program order, we would expect CPU B to assert(a == 1) to be successful, but let’s look at the following execution flow:1.CPU A Executes A = 1.

(1)CPU A does not contain A in its cache and needs to modify it, so it issues A “Read Invalidate” message.

(2)CPU A puts the modification (A =1) of A into the Storage Buffer

While (B == 0) continue:

(1)CPU B cache does not contain B, releaseReadThe message.3.CPU A executes b = 1.

(1) SELECT * from CPU A where b is in Exclusive state.

(2) After that, CPU A receives A Read message from CPU B about B.

(3)CPU A responds to b = 1 in the cache, sends A Read Response message, and changes the cache line status to Shared

(4)CPU B receives the Read Response message and puts B into the cache

(5)CPU B can exit the loop because CPU B sees that B is now 1

4.CPU B executes assert(a == 1), but fails because a’s changes have not been updated.

The CPU can’t control this kind of disorder automatically, but it usually provides memory barrier instructions that tell the CPU to prevent disorder, such as:

Smp_mb () causes the CPU to flush the contents of the Store Buffer into the cache. With the memory barrier instruction added, the execution flow becomes:

1.CPU A Executes A = 1.

(1)CPU A does not contain A in its cache and needs to modify it, so it issues A “Read Invalidate” message.

(2)CPU A puts the modification (A =1) of A into the Storage Buffer

While (B == 0) continue:

(1)CPU B cache does not contain B, releaseReadThe message.3.CPU B runs smp_MB () :

(1)CPU B marks all the items in the current Store Buffer.

4.CPU A runs b = 1.

(1)CPU A cache row B, and the state of the Store Buffer is Exclusive, but because there is A marked item A in the Store Buffer, it does not update the cache directly. And send an Invalidate message.

(2) After that, CPU A receives A Read message from CPU B about B.

(3)CPU A responds to b = 0 in the cache, sends A Read Response message, and changes the cache line status to Shared

(4)CPU B receives the Read Response message and puts B into the cache

(5)CPU B keeps looping because CPU B sees that B is still 0

(6)CPU A receives the previous “Read Invalidate” message response from A, and flushed the marked item in the Store Buffer into the cache. This cache line is in modified state.

(7)CPU B receives CPU A’s Invalidate message from CPU B, invalidates B’s cache line, and replies with Invalidate Acknowledge

(8)CPU A received the Invalidate Acknowledge and brushed B from the Store Buffer into the cache.

(9) CPU B keeps reading B, but B is no longer in the cache, so it sends a Read message.

(10)CPU A receives A Read message from CPU B, sets the cache line status of B to shared, and returns A Read Response of B = 1 in the cache

(11)CPU B receives a “Read Response”, which indicates that B = 1 and puts it in the cache line with the state of “shared”

5.CPU B knows that B = 1 and exits the while (B == 0) continue loop

6.CPU B asserts (a == 1) : CPU B asserts (A == 1) (2)CPU A reads A = 1 from the cache and responds with Read Response. (3)CPU B executes assert(A == 1) successfully

The Store Buffer is usually small. If the Store Buffer is full, Stall will still occur. The Store Buffer is expected to flush into the CPU cache fairly quickly after receiving the corresponding Invalidate Acknowledge. However, other cpus may be too busy to respond quickly to Invalidate messages and Invalidate acknowledgments, which may result in a full Store Buffer and CPU Stall. Therefore, each CPU’s Invalidate queue can be introduced to cache Invalidate messages to be processed.

5.2.5. Simple CPU model – Decouple CPU Invalidate and Store Buffer – Invalidate Queues

With the Invalidate Queues, the CPU structure looks like this:

With the Invalidate Queue, the CPU can flush the Store Buffer immediately after placing the Invalidate in the Queue. In addition, the CPU must check its own Invalidate Queue to see if it has the same Invalidate message for a cache row before sending it. If yes, it must wait until it has processed the corresponding message in its Invalidate Queue.

Similarly, the Invalidate Queue introduces out-of-order execution.

5.2.6. Simple CPU models – further out of order due to Invalidate Queues – require memory barriers

Let’s say I have two variables, a and B, that are not on the same cache line, and they both start with 0. Suppose CPU A (shared, Exclusive) executes:

CPU B (the cache line contains a(shared)) executes:

1.CPU A Executes A = 1.

(1) THERE is A (shared) in CPU A’s cache. CPU A puts A’s modification (A =1) into the Store Buffer and sends an Invalidate message.

While (B == 0) continue:

(1)CPU B does not contain B in its cache, and publishes a Read message.

(2)CPU B receives the Invalidate message from CPU A, places it in the Invalidate Queue, and returns it immediately.

(3)CPU A receives the response of the Invalidate message and flusits the cache line in the Store Buffer into CPU cache

3.CPU A runs smp_MB () :

(1) CPU A has flushed the Store Buffer into the CPU cache, so it passes directly

4.CPU A runs b = 1.

(1) CPU A contains CPU B’s Exclusive cache row, so update CPU B’s Exclusive cache row.

(2)CPU A receives A Read message from CPU B, updates the cache line status of B to Shared, and then sends A Read Response containing the latest value of B

(3)CPU B receives a Read Response with the value of B being 1

5.CPU B exits the loop and starts to execute assert(a == 1)

(1) Since the Invalidate message of A is still not processed in the Invalidate queue, CPU B still sees a = 0, and the assert fails

Therefore, to address this disorder, we also put a memory barrier in the code executed by CPU B, where the memory barrier waits not only for the CPU to run out of Store buffers, but also for the CPU to run out of Invalidate Queue. Adding the memory barrier, CPU B executes the following code:

Thus, in step 5 above, CPU B exits the loop and waits for the Invalidate queue to complete before executing assert(a == 1) : (2) Assert (a == 1) is not present in the cache. You need to send a Read message to see the latest value of B, 1.

5.2.7. Simple CPU model – More granular memory barrier

As we mentioned earlier, in our CPU model, the smp_MB () memory barrier instruction does two things: wait for the CPU to run out of Store buffers and wait for the CPU’s Invalidate Queue to run out. However, for the memory barrier in our code executed by CPU A and CPU B, it is not necessary to have both operations at the same time:

As a result, the CPU abstracts more fine-grained Buffer instructions. The instructions that wait for the CPU to run out of Store buffers are called Write Memory buffers. The instructions that wait for the CPU’s Invalidate Queue to complete are called Read Memory buffers.

5.2.8. Simple CPU model – Summary

Here we start with a simple CPU architecture, step by step, to describe some simple CPU architecture and why memory barriers are needed, can be summarized as the following simple flowchart:

  1. Each time a CPU accesses the memory directly, the CPU is stalling. To reduce CPU Stall, CPU caching was added.
  2. CPU caching creates cache inconsistencies between multiple cpus, so MESI is a simple CPU cache consistency protocol to coordinate cache consistency between different cpus
  3. Some mechanisms in the MESI protocol were optimized to further reduce CPU Stall:
  4. By putting updates into the Store Buffer, the Invalidate messages emitted by the update do not have to be stalling for an Invalidate Response.
  5. The Store Buffer introduces instructions (code) out of order, requiring memory barrier instructions that force the current CPU Stall to wait until all of the Store Buffers are flushed. This memory barrier instruction is commonly referred to as a write barrier.
  6. To speed up Store Buffer flushing, add Invalidate Queue,

5.3. CPU instructions are out of order

CPU instructions can also be executed out of order, but we will only mention a relatively common one – instruction parallelization.

5.3.1. Increase CPU execution efficiency – CPU Pipeline mode

Modern cpus run in an instruction pipeline mode when executing instructions. Because the CPU also has different components, we can divide the execution of an instruction into different phases, which involve different components. In this way, pseudo-decoupling allows each component to execute independently without waiting for an instruction to complete execution before processing the next instruction.

It is generally divided into the following stages:Take refers toInstrcution Fetch, IFdecoding(Instruction Decode, ID),perform(Execute, EXE)access(Memory, MEM),Write back to(Write-back, WB)

5.3.2. Further reduce CPU Stall – CPU Out of order Execution Pipeline

Since the instruction’s data is not ready, as in the following example:

If data A is not ready and has not been loaded into the register, then we do not need to Stall and wait for loading A. We can do c = 1 first. The CPU Out of order execution Pipeline is based on this idea:

As shown in the figure, CPU execution stages are divided into:

  1. Instructions Fetch: a batch of Instructions are fetched in batches, and instruction analysis is carried out to analyze the cycles and dependencies, branch prediction, etc
  2. Instruction Decode: Instruction Decode, much the same as the previous pipeline mode
  3. Reservation crisis: Functoinal Unit (FU) will be processed if the input is ready. If not, the Bypass network will be monitored. When the data is ready, signals will be sent back to the Reservation stations for the instructions to be processed into the graph FU.
  4. Functional Unit: processing instructions
  5. Reorder Buffer: The orders are kept in their original order. They are added to one end of the list when they are dispatched and removed from the other end when they are finished. In this way, instructions are completed in the order they dispatch.

This structure ensures that the Store Buffer is written in the same order as the original instructions. But loading the data, as well as the computation, is done in parallel. We have already seen that in our simple CPU architecture, the multi-CPU cache MESI, Store Buffer, and Invalidate Queue cannot read the latest values, which is exacerbated by out-of-order parallel loading and processing. In addition, the structural design can only ensure that the interdependence between instructions in the same thread can be detected to ensure that the order of instruction execution between such interdependence is correct, but the instruction dependence between multithreaded programs, CPU batch instruction fetching and branch prediction can not be perceived. So it’s still going to be out of order. This disorder can also be avoided by the memory barrier above.

5.4. Actual CPU

There will be a variety of actual cpus, with different CPU architecture designs and different CPU cache consistency protocolsDifferent kinds of disorderEach of them individually would be too complicated. So, we use a standard to abstract out of order on different cpus (i.e., the first operation is M, the second operation is N, the two operations are out of order, is very similar to Doug Lea’s description of JMM, in fact the Java memory model is also based on this design), refer to this table:

Let’s start by saying what each column means:

  1. Loads Reordered After Loads: the first operation is read, the second operation is read, whether it is out of order.
  2. Loads Reordered After Stores: the first operation is read, the second operation is write, whether it is out of order.
  3. Stores Reordered After Stores: The first operation is written, the second operation is also written, whether the order will be out of order.
  4. Stores Reordered After Loads: the first operation is write and the second operation is read, whether it is out of order.
  5. Atomic Instructions Reordered With Loads: two operations are Atomic operations (a group of operations occurring at the same time, such as two words being modified at the same time) and Loads, whether the two operations are out of order to each other.
  6. Atomic Instructions Reordered With Stores: The two operations are Atomic operations (a group of operations that happen at the same time, such as the instruction to modify two words at the same time) and write, whether the two will be out of order With each other.
  7. Dependent Loads Reordered: Whether a read is out of order if it depends on the result of another read.
  8. Incoherent Instruction Cache/Pipeline: Whether instructions are executed out of order.

An example is the x86 architecture commonly used on our own PC, where only Stores Reordered After Loads and Incoherent Instruction Cache/Pipeline occur. LoadLoad, LoadStore, StoreLoad, and StoreStore are the four Java memory barriers. This is why the x86 environment only needs to implement StoreLoad.

5.5. Compiler out of order

In addition to CPU disorder, in the software level there are compiler optimization reorder caused by compiler optimization, in fact, some ideas and said above CPU instruction pipeline optimization is actually somewhat similar. For example, the compiler also analyzes your code and optimizes statements that don’t depend on each other. Statements that do not depend on each other can be rearranged at will. But again, the compiler can only think and analyze from a single-threaded perspective and doesn’t know the dependencies and relationships of your program in a multi-threaded environment. To take another simple example, suppose there is no CPU out of order, with two variables x = 0, y = 0, thread 1 executes:

Thread 2 executes:

It is possible for thread 2 to assert to fail because the compiler might throw x = 1 out of order with y = 1.

Compiler out-of-order can be avoided by adding compiler barrier statements on different operating systems. For example, thread one executes:

So you don’t get out of order between x = 1 and y = 1.

At the same time, when we use it in practice, we refer to hardware memory barriers, that is, memory barriers implemented by hardware CPU instructions, and such hardware memory barriers also implicitly surround compiler barriers. Compiler barriers, commonly referred to as software memory barriers, are simply barriers that control the software level of the compiler. For example, volaile in C++ is different from volatile in Java, which simply prevents compiler rearrangements. But you can’t avoid CPU disorder.

That’s where the disorder comes from, and what the memory barrier does. Next, we’ll get down to business and begin our journey through the Java 9+ memory model. Before we do that, one more thing to note: Why is it best not to write your own code to verify some of the JMM’s conclusions, but to use a professional framework to test them

6. Why is it best not to write your own code to verify some of the JMM’s conclusions

As we know from the previous series of analyses, the problem of out-of-order programs is complex. Assume that a piece of code, without any restrictions, all possible output results are the complete set shown in the following figure:

Under the constraints of the Java memory model, the possible results are limited to a subset of all out-of-order results:

Under the constraints of the Java memory model, CPU disorder varies on different CPU architectures. Some scenarios have CPU disorder and some do not, but all are within the scope of the JMM, so it is reasonable that all possible result sets are restricted to different subsets of the JMM:

Under the constraints of the Java memory model, the JVM code compiled by different operating system compilers is executed in different order, the underlying system call definition is different, and the Java code executed by different operating system may be slightly different, but they are all within the limits of the JMM, so it is reasonable:

Finally, the underlying code is executed differently under different execution methods and JIT compilation, which further results in the fragmentation of the result set:

So, if you write your own code and try it out on your own computer and your own operating system, the result set you will get is a subset of the JMM, and there will probably be some out-of-order results that you won’t see. Also, some out-of-order execution times are few or have not gone to JIT optimization, so it is really not recommended to write your own code to experiment.

So what should be done? Use jcStress, the official framework for testing concurrency visibility, which does not simulate different CPU architectures and operating systems, but allows you to rule out different executions (explain executions, C1 executions, C2 executions) and low test stress times. A corresponding JCStress code example is attached to all subsequent tutorials for you to use.

7. Layered progressive visibility to the API corresponding to the Java 9+ memory model

Here we refer to Aleksey’s ideas to summarize the different levels of Java memory visibility restrictions and the corresponding apis. In Java 9+, Plain access to normal variables (non-volatile, final variables) is defined as Plain. Normal access, without any barriers to the address of the access (barriers for different GC’s, such as pointer barriers required by generational GC, are not a consideration here, they are only at the GC level and have no effect on visibility here), and can have all of the aforementioned out-of-order. So what are the limitations that are proposed in the Java 9+ memory model and what are the apis that correspond to those limitations?

Coherence vs. Opaque

CPU Cache Coherence Protocol Coherence means consistency in that context. But Coherence doesn’t translate well into consistency. So I’m going to use Doug Lea’s and Aleksey’s definitions for some of the following terms.

So what is coherence here? To take a simple example: Suppose some object field int x starts with 0 and a thread executes:

Another thread executes (r1, r2 are local variables) :

Under the Java memory model, possible outcomes include:

  1. r1 = 1, r2 = 1
  2. r1 = 0, r2 = 1
  3. r1 = 1, r2 = 0
  4. r1 = 0, r2 = 0

And the third result, which is interesting, programmatically understood, is that we first saw x = 1, and then we saw x go to 0. Of course, from the previous analysis, we know that it is actually because the compiler is out of order. Coherence is the only feature we need if we don’t want this third result.

Coherence’s definition, and I quote:

The writes to the single memory location appear to be in a total order consistent with program order.

That is, writes to individual memory locations appear to occur in an overall order consistent with program order. If x changes from 0 to 1 globally, each thread will only see x change from 0 to 1, but not from 1 to 0.

As mentioned earlier, coherence is not guaranteed for Plain reads in Java memory model definitions. But if you run the above test code, you won’t get the third result. This is because semantic analysis in the Hotspot VIRTUAL machine will consider the two loads of X to be mutually dependent, thus limiting the out-of-order:

That’s why I mentioned in the previous chapter why it’s best not to write your own code to verify some of the JMM’s conclusions. Although the limitations of the Java memory model allow a third result of 1, 0, this is not possible with this example.

Here we trick the Java compiler into creating this disorder by using an awkward example:

Let’s not go too far into how this works, but directly into the results:

The out-of-order result is found, and if you run the example yourself, you will see that the out-of-order result occurs only after the JIT C2 compiled acTOR2 method is executed.

So how do you avoid this disorder? The use of volatile is certainly avoidable, but instead of using volatile as a re-operation, Opaque access can be used. Opaque disables Java compiler optimizations, but does not involve any memory barriers, much like volatile in C++. Under test:

ACCEPTABLE_INTERESTING, FORBIDDEN, UNKNOWN: ACCEPTABLE_INTERESTING, FORBIDDEN, UNKNOWN: ACCEPTABLE_INTERESTING

Causality and Acquire/Release

On top of Coherence, we generally need Causality in certain scenarios

By now, you’ll have encountered two common words, happens-before and synchronized-with order, and we’re not going to start with these two obscure concepts (which we won’t explain in this chapter), but with an example, If an object field int x is initialized with 0 and int y is initialized with 0, and the two fields are not in the same cache line (the later JCStress framework automatically fills the cache line for us), a thread executes:

Another thread executes (r1, r2 are local variables) :

This example is very similar to the out-of-order analysis we used earlier in the CPU cache. In the Java memory model, the possible results are:

  1. r1 = 1, r2 = 1
  2. r1 = 0, r2 = 1
  3. r1 = 1, r2 = 0
  4. r1 = 0, r2 = 0

Similarly, the third result is also interesting. The second line sees the y update first, but does not see the X update. This was analyzed in detail in the previous CPU cache disorder. In the previous analysis, we needed to add a memory barrier like this to avoid the third condition, namely:

As well as

Just to recap, thread 1 performs a write barrier after x = 1 and before y = 1 to ensure that all store buffer updates are in the cache. Updates before y = 1 are guaranteed not to be invisible because they are in the Store buffer. Thread 2 performs a read barrier after executing int R1 = y to invalidate all data in the invalidate queue and ensure that there is no dirty data in the current cache. Thus, if thread 2 sees an update to Y, it must see an update to X.

Let’s take it a step further: We think of the write barrier and the subsequent Store (y = 1) as packing the previous update and sending it out at this point, and the read barrier and the previous Load (int R1 = y) as a receiving point, where the outgoing packet is opened and read in if it is received. Therefore, as shown below:

At the transmitting point, all results up to the transmitting point (including the transmitting point itself) are packaged. If the packet is received during the execution of the receiving point’s code, all instructions after the receiving point can see all contents in the packet, i.e., the contents before and at the transmitting point. Causality, sometimes called Casual Consistency, has different meanings in different contexts. Here we only refer to: You can define a series of writes, and if a read sees the last write, then all reads after that read will see that write and all writes before it. This is a Partial Order, not a Total Order, which is defined in more detail in a later section.

In Java, Causality is not guaranteed for Plain access and Opaque access, because Plain does not have any memory barrier, Opaque only has compiler barrier.

First, Plain:

The result is:

Then Opaque:

Here we need to note:Since we saw earlier that x86 cpus are naturally programmed to keep some instructions out of order, we will see later which inorder guarantees Causality hereThe Opaque access can see the result of causal consistency, as shown in the following figure (AMD64 is an implementation of x86) :However, if we switch to a slightly less consistent CPU, we can see that Opaque access does not guarantee causal consistency. Here is my result in AARCH64 (which is an ARM implementation) :

Also, there is an interesting point, which is that all the out-of-order occurs when C2 is compiled.

So, how do we ensure Causality? Similarly, we don’t need to bother with heavy operations like volatile, just use Release/Acquire mode. Release/Acquire guarantees Coherence + Causality. Release /acquire must occur in pairs (one acquire corresponds to one release). Release can be regarded as the launching point mentioned above and acquire as the receiving point mentioned above. Then we can implement the code like the following:

Then, continuing on the aARCH64 machine, the result is:

It can be seen that Causality is guaranteed by the use of Release/Acquire. Note that the selection of the transmitting point and the receiving point must be well chosen. For example, if we change the position here, it will be wrong:

Example 1: The launch point only packages all previous updates. For x = 1 updates after the launch point are not packaged, so 1,0 results will still appear.

Example 2: the packet will be unpacked at the receiving point, so that subsequent reads can see the result in the packet. For the read of x, it is equivalent to not seeing the update in the packet before the receiving point, so the result will still be 1,0.

For this reason, let’s take a look at Doug Lea’s Java memory barrier design to see which memory barriers designed in Java are used. In Doug Lea’s very early and classic article, he introduced the Java memory model and its memory barrier design, proposing four barriers:

1.LoadLoad

If you have two completely unrelated reads (loads) that are not dependent on each other (that can be executed out of order), you can avoid their out-of-order execution (that is, the Load(y) will not be executed until the Load(x) is executed) by using the LoadLoad barrier:

2.LoadStore

If there is a Load and a write that is completely unrelated (that can be executed out of order), they can be prevented from executing out of order by the LoadStore barrier (that is, Store(y) will not execute until the Load(x) executes) :

3.StoreStore

If there are two completely unrelated writes (stores) that are not dependent on each other (that can be executed out of order), they can be prevented from executing out of order through the StoreStore barrier (that is, Store(y) will not execute until Store(x) executes) :

4.StoreLoad

If there is a write (Store) and a Load(Load) that is completely unrelated (that can be executed out of order), they can be prevented from executing out of order by the LoadStore barrier (that is, the Load(y) will not execute until the Store(x) executes) :

So how do you implement Release/Acquire through these memory barriers? We can extrapolate from our abstraction earlier, first the launching point. The launch point is a Store first, and ensure that everything in front of it is packed, so both Load and Store must be packed, and neither can go behind. Therefore, LoadStore and StoreStore memory barriers need to be added in front of Release to achieve this. Similarly, the receiving point is a Load, and it is guaranteed that all the following can see the value in the packet, so neither Load nor Store can run to the front, so LoadLoad and LoadStore should be added after Acquire to achieve this.

However, as we will see in the next chapter, the design of these four memory barriers is somewhat outdated (due to CPU development and the development of the C++ language), and the JVM uses acquire, release, and fence more internally. Acquire and Release are basically release/acquire. The relationship of these three to the traditional four-barrier design is:

We know the Release/Acquire memory barrier here,Why doesn’t x86 have this disorder without the memory barrier? Refer to the previous CPU disorder diagram:

From this we know that x86 is not out of order for Store to Store, Load to Load, and Load to Store, so naturally Casuality is guaranteed

7.3. Consensus and Volatile

Finally, we came to the familiar Volatile, which was essentially the Release/Acquire guarantee of Consensus; Consensus is that all threads see the same order of memory updates, i.e., all threads see the same order of memory updates globally, for example: Given that some object field int x starts with 0 and int y also starts with 0, and the two fields are not in the same cache line (the later JCStress framework automatically fills the cache line for us), a thread executes:

Another execution:

Under the Java memory model, there are also four possible outcomes:

  1. r1 = 1, r2 = 1
  2. r1 = 0, r2 = 1
  3. r1 = 1, r2 = 0
  4. r1 = 0, r2 = 0

The fourth result is interesting because it is inconsistent with Consensus because the two threads see the updates in a different order (the first thread sees 0 because it thinks the update of X was executed before the update of Y, and the second thread sees 0 because it thinks the update of Y was executed before the update of X). If it wasn’t out of order, you wouldn’t see x and y being 0 because thread 1 and thread 2 are both updated and then read. But as with all the previous presentations, the disorder results in the fact that we can look at the big and the third. Can Release/Acquire guarantee that this will not happen? If access to x and Y is in Release/Acquire mode, thread 1 will execute:

There is no memory barrier between x = 1 and int r1 = y.

Similarly, thread 2 might execute:

Or:

Thus, we may see a fourth result. Let’s test the code:

The test results are:

To ensure Consensus, we simply need to ensure that thread 1’s code is not out of order with thread 2’s code, by adding a StoreLoad barrier to the existing barrier, that is, thread 1 executes:

Thread 2 executes:

This ensures that the order is not out of order, which is essentially volatile access. Volatile access = Release/Acquire + StoreLoad

The result is:

This raises the question, is the StoreLoad barrier added after Volatile Store or before Volatile Load? Let’s do this experiment:

Keep Volatile Store and change Volatile Load to Plain Load, i.e.

Test results:

As you can see from the results, Consensus is still maintained. Keep Volatile Load and change Volatile Store to Plain Store:

Test results:

It’s out of order again.

Therefore, it can be concluded that the StoreLoad is added to Volatile writes, as can be seen in subsequent JVM underlying source code analysis.

7.4 Functions of Final

In Java, we create objects by calling class constructors. We might also put values in the constructors that initialize fields, for example:

We can create an object by calling the constructor:

We combine these steps and use pseudocode to show what the underlying implementation actually does:

There is no memory barrier between them, and according to semantic analysis, there is a dependency between 1 and 5, so the order of 1 and 5 cannot change. 1,2,3,4 depends on each other, so the order of 1,2,3,4 cannot change. There is no relationship between 2,3,4 and 5, and the order of execution between them can be out of order. If 5 is executed before any of the steps 2, 3, and 4, then we may see that the constructor has not yet finished executing and that x,y, and z are still initial values. Under test:

On the x86 platform, you will only see two results, namely -1, -1, -1 (no object initialization is seen) and 1, 2, 3 (object initialization is seen, and no out-of-order), as shown below (AMD64 is an x86 implementation) :

This is because, as we mentioned earlier, x86 cpus are fairly consistent cpus, and there is no out-of-order. And as to what kind of out-of-order property of x86 it’s out of order here, we’ll see later.

As before, we switch to a less consistent CPU (ARM), and here we see some exciting results, as shown below (AARCH64 is an ARM implementation) :

So how do we ensure that we see the result of the constructor execution? Using the previous memory barrier design, we can change the fifth step of the pseudocode to setRelease, which is:

The StoreStore barrier prevents 2,3,4, and 5 from being out of order.

Try it on the aARCH64 machine.

As you can see from the result, you can only see the result after either no initialization or full constructor execution.

Let’s take it a step further and actually we only need the StoreStore barrier here, which leads to the Java final keyword: Final means that the StoreStore barrier is added immediately after the update, which is equivalent to adding the StoreStore barrier before the constructor is finished, ensuring that as long as we can see the object, the object’s constructor is finished. Test code:

Let’s take it one step further. Since 2,3, and 4 are dependent on each other in the pseudocode, we just need to make sure that 4 executes before 5, so 2,3, must execute before 5, so we just need to make z final and add a StoreStore barrier, Instead of declaring each as final, adding memory barriers:

Then, we continued testing with aARCH64 and the results were still correct:

Finally, we need to note that final is just a StoreStore barrier after the update. If you expose this during the constructor, you will still see that final values are not initialized. Let’s test:

This time we can see that final is not initialized on an x86 machine:

Finally, to see why x86 can be implemented without a memory barrier in this example, refer to the previous CPU diagram:

X86 itself is not out of order from Store to Store, naturally guaranteed.

Finally, here is the form:

8. Analysis of underlying JVM implementation

8.1. Definition of OrderAccess in the JVM

The JVM uses memory barriers in various ways:

  1. Implement Java’s various syntactic elements (volatile, final, synchronized, etc.)
  2. Implementing various JDK apis (VarHandle, Unsafe, Thread, etc.)
  3. Memory barriers needed for GC: Consider whether GC multithreading and application threads (called mutators in GC algorithms) work stop-the-world or concurrent
    1. Object reference barrier: For example, generational GC, replication algorithm, when young GC we usually copy live objects from one S-region to another S-region, if the replication process, we do not want to Stop the world (STW), but at the same time with the application thread, then we need memory barrier, for example;
    2. Maintenance barriers: Such as partition GC algorithms, we need to maintain cross-partition reference tables and usage tables for each partition, such as Card tables. This also requires a memory barrier if we want the application thread to modify access concurrently with the GC thread, rather than stopping the world.
  4. JIT also requires a memory barrier: similarly, a memory barrier is needed to determine whether the application thread interprets executing code or executes jIT-optimized code.

These memory barriers, different CPUS, different operating systems, the underlying need different code implementation, unified interface design is:

Source code address:orderAccess.hpp

Different CPU, different operating system implementation is not the same, combined with the previous CPU out of order table:

Let’s look at the implementation of Linux + x86:

Source code address:orderAccess_linux_x86.hpp

For x86, since Load is guaranteed to be consistent with Load, Load with Store, and Store with Store, as long as the compiler is not out of order, there is a built-in StoreStore, LoadLoad, LoadStore barrier, So here we see that the implementation of StoreStore, LoadLoad, LoadStore barriers are all just compiler barriers. Acquire is equivalent to adding LoadLoad and LoadStore barriers after Load. For x86, compiler barriers are still needed. Release is equivalent to adding LoadStore and StoreStore in front of Store, but for x86 you still need a compiler barrier. Thus, we have the following table:

Let’s take a look at the implementation of Linux AARCH64, which we often use:

Source code address:orderAccess_linux_aarch64.hpp

As mentioned in the previous table, ARM CPU Load and Load, Load and Store, Store and Store, Store and Load are all out of order. Instead of using CPU instructions directly for aarch64, the JVM uses a memory barrier implementation wrapped in C++. C++ encapsulates much like the simple CPU memory barrier we talked about earlier, namely the read memory barrier (__atomic_thread_fence(__ATOMIC_ACQUIRE)), Write memory barriers (__atomic_thread_fence(__ATOMIC_RELEASE)) and write and write memory barriers (full memory barriers, __sync_synchronize()). The function of acquire is to unpack the contents of the packet as a receiving point, which is similar to the simple CPU model. In fact, it blocks and waits for the Invalidate queue to complete processing to ensure that there is no dirty data in the CPU cache. Release acts as a launch point to pack up the previous updates and send them out. Analogous to a simple CPU model, it blocks and waits for the Store Buffer to flush completely into the CPU cache. Therefore, acquire and release are implemented using read and write memory barriers, respectively.

If the invalidate queue is complete, the invalidate queue will block the read memory from the first Load. Once the invalidate queue is complete, there will be no dirty data in the current CPU. Therefore, there is no need to wait for the current CPU’s Store buffer to empty.

StoreStore ensures that the first Store precedes the second, so it blocks the read buffer after the first write and waits for the Store buffer to flush into the CPU. In the case of StoreLoads, this is special because the second Load needs to see the latest value of the Store, meaning that updates cannot only reach the Store buffer, and expiration cannot be left unprocessed in the Invalidate Queue. Therefore, read/write memory barriers (full barriers) are required.

8.2. Volatile and final memory barrier source code

Let’s look at the code for volatile memory barrier inserts, using ARM as an example. We can actually trace iload bytecode to see what happens if the load is volatile or final, And istore to see what happens if the store is volatile or final

For field access, the JVM also has fast paths and slow paths, so let’s just look at the code for fast paths:

Corresponding source code:

Source code address:templateTable_arm.cpp

9. Some QA

9.1. Why do I see final used in method local variables somewhere

In the case of final ina local variable (which is not the same thing as final in the decorated field mentioned earlier), this is purely semantically, not for performance reasons, but as a token. That is, you can declare a lot of variables locally ina method, but declare final for the sake of semantic clarity when you are certain that they will not change.

JDK developers typically use final native variables to do this, assuming the following code:

Assuming the compiler doesn’t do any optimizations, 1, 2, and 4 are each accessed. If compiler optimizations are involved, it is possible to optimize the following code:

In this way, the X field will be read only once. The problem with this is that when the code is executed by the interpreter and different JIT optimizations are executed, the possible result set is different if x is updated concurrently. To avoid this ambiguity, if we are sure that our function here only wants to read x once, we can write it as:

To indicate that lx does not change (and also to indicate that we only want to read x once), add final to it:

Wechat search “my programming meow” public account, add the author’s wechat, a daily brush, easy to improve skills, won a variety of offers