Personal creation convention: I declare that all articles created are their own original, if there is any reference to any article, will be marked out, if there are omissions, welcome everyone critique. If you find online plagiarism of this article, welcome to report, and actively submit an issue to the Github warehouse, thank you for your support ~

This article refers to a large number of articles, documents and papers, but it is really complicated. My level is limited, and my understanding may not be in place. If you have any objections, please leave a comment. This series will continue to be updated, and you are welcome to comment on your questions and errors and omissions

If you like the single-part version, please visit: The most hardcore Java New Memory Model parsing and Experimental single-part version (constantly updated in QA) If you like this split version, here is the table of contents:

  • 1. What is the Java memory model
  • Analysis and experiment of the new memory model in Java – 2. Atomic access and word splitting
  • Core Understanding of the Memory barrier (CPU+ compiler)
  • The most core Java new memory model analysis and experiment – 4. Java new memory access method and experiment
  • Net most core Java new memory model analysis and experiment – 5. JVM underlying memory barrier source analysis

JMM related documents:

  • Java Language Specification Chapter 17
  • The JSR-133 Cookbook for Compiler Writers – Doug Lea’s
  • Using JDK 9 Memory Order Modes – Doug Lea’s

Memory barriers, CPU and memory model related:

  • Weak vs. Strong Memory Models
  • Memory Barriers: a Hardware View for Software Hackers
  • A Detailed Analysis of Contemporary ARM and x86 Architectures
  • Memory Model = Instruction Reordering + Store Atomicity
  • Out-of-Order Execution

X86 cpus

  • x86 wiki
  • Intel® 64 and IA-32 Architectures Software Developer Manuals
  • Formal Specification of the x86 Instruction Set Architecture

ARM CPU

  • ARM wiki
  • aarch64 Cortex-A710 Specification

Various understandings of consistency:

  • Coherence and Consistency

Aleskey’s JMM:

  • Aleksey Shipilev – Don’t misunderstand the Java Memory Model (Part 1)
  • Aleksey Shipilev – Don’t misunderstand the Java Memory Model (part 2)

Many Java developments use Java’s concurrency synchronization mechanisms, such as volatile, synchronized, and Lock. Also there are a lot of people read the JSR chapter 17 Threads and Locks (address: docs.oracle.com/javase/spec…). , including synchronization, Wait/Notify, Sleep & Yield, and memory models. But I also believe that most people like me, the first time reading, feeling is watching the fun, after reading only know that he is such a regulation, but why such regulation, not so regulation, do not have a very clear understanding. At the same time, combined with the Hotspot implementation and the interpretation of Hotspot source code, we even found that due to the static code compilation optimization of Javac and the JIT compilation optimization of C1 and C2, the final code performance was not quite consistent with our understanding of the code from the specification. In addition, such inconsistencies lead to misunderstandings when we learn the Java Memory Model (JMM) and understand the design of the Java Memory Model if we try to use actual code. I myself am constantly trying to understand the Java memory model, rereading THE JLS and various gods’ analyses. This series will sort through some of my personal insights from reading these specifications and analyzing them, as well as some experiments with JCStress, to help you understand the Java memory model and API abstraction post-Java 9. Still emphasize, however, the design of the memory model, the starting point is to let you don’t have to care about the underlying and abstracting some design, involves a lot of things, my level is limited, might not reach the designated position, understand I will try to put the arguments of the each out as well as the reference, please do not completely believe that all the views here, If you have any objections, please refute them with specific examples and leave a comment.

6. Why is it best not to write your own code to verify some of the JMM’s conclusions

As we know from the previous series of analyses, the problem of out-of-order programs is complex. Assume that a piece of code, without any restrictions, all possible output results are the complete set shown in the following figure:

Under the constraints of the Java memory model, the possible results are limited to a subset of all out-of-order results:

Under the constraints of the Java memory model, CPU disorder varies on different CPU architectures. Some scenarios have CPU disorder and some do not, but all are within the scope of the JMM, so it is reasonable that all possible result sets are restricted to different subsets of the JMM:

Under the constraints of the Java memory model, the JVM code compiled by different operating system compilers is executed in different order, the underlying system call definition is different, and the Java code executed by different operating system may be slightly different, but they are all within the limits of the JMM, so it is reasonable:

Finally, the underlying code is executed differently under different execution methods and JIT compilation, which further results in the fragmentation of the result set:

So, if you write your own code and try it out on your own computer and your own operating system, the result set you will get is a subset of the JMM, and there will probably be some out-of-order results that you won’t see. Also, some out-of-order execution times are few or have not gone to JIT optimization, so it is really not recommended to write your own code to experiment.

So what should be done? Use jcStress, the official framework for testing concurrency visibility, which does not simulate different CPU architectures and operating systems, but allows you to rule out different executions (explain executions, C1 executions, C2 executions) and low test stress times. A corresponding JCStress code example is attached to all subsequent tutorials for you to use.

7. Layered progressive visibility to the API corresponding to the Java 9+ memory model

Here we refer to Aleksey’s ideas to summarize the different levels of Java memory visibility restrictions and the corresponding apis. In Java 9+, Plain access to normal variables (non-volatile, final variables) is defined as Plain. Normal access, without any barriers to the address of the access (barriers for different GC’s, such as pointer barriers required by generational GC, are not a consideration here, they are only at the GC level and have no effect on visibility here), and can have all of the aforementioned out-of-order. So what are the limitations that are proposed in the Java 9+ memory model and what are the apis that correspond to those limitations?

Coherence vs. Opaque

CPU Cache Coherence Protocol Coherence means consistency in that context. But Coherence doesn’t translate well into consistency. So I’m going to use Doug Lea’s and Aleksey’s definitions for some of the following terms.

So what is coherence here? To take a simple example: Suppose some object field int x starts with 0 and a thread executes:

Another thread executes (r1, r2 are local variables) :

Under the Java memory model, possible outcomes include:

  1. r1 = 1, r2 = 1
  2. r1 = 0, r2 = 1
  3. r1 = 1, r2 = 0
  4. r1 = 0, r2 = 0

And the third result, which is interesting, programmatically understood, is that we first saw x = 1, and then we saw x go to 0. Of course, from the previous analysis, we know that it is actually because the compiler is out of order. Coherence is the only feature we need if we don’t want this third result.

Coherence’s definition, and I quote:

The writes to the single memory location appear to be in a total order consistent with program order.

That is, writes to individual memory locations appear to occur in an overall order consistent with program order. If x changes from 0 to 1 globally, each thread will only see x change from 0 to 1, but not from 1 to 0.

As mentioned earlier, coherence is not guaranteed for Plain reads in Java memory model definitions. But if you run the above test code, you won’t get the third result. This is because semantic analysis in the Hotspot VIRTUAL machine will consider the two loads of X to be mutually dependent, thus limiting the out-of-order:

That’s why I mentioned in the previous chapter why it’s best not to write your own code to verify some of the JMM’s conclusions. Although the limitations of the Java memory model allow a third result of 1, 0, this is not possible with this example.

Here we trick the Java compiler into creating this disorder by using an awkward example:

Let’s not go too far into how this works, but directly into the results:

The out-of-order result is found, and if you run the example yourself, you will see that the out-of-order result occurs only after the JIT C2 compiled acTOR2 method is executed.

So how do you avoid this disorder? The use of volatile is certainly avoidable, but instead of using volatile as a re-operation, Opaque access can be used. Opaque disables Java compiler optimizations, but does not involve any memory barriers, much like volatile in C++. Under test:

ACCEPTABLE_INTERESTING, FORBIDDEN, UNKNOWN: ACCEPTABLE_INTERESTING, FORBIDDEN, UNKNOWN: ACCEPTABLE_INTERESTING

Causality and Acquire/Release

On top of Coherence, we generally need Causality in certain scenarios

By now, you’ll have encountered two common words, happens-before and synchronized-with order, and we’re not going to start with these two obscure concepts (which we won’t explain in this chapter), but with an example, If an object field int x is initialized with 0 and int y is initialized with 0, and the two fields are not in the same cache line (the later JCStress framework automatically fills the cache line for us), a thread executes:

Another thread executes (r1, r2 are local variables) :

This example is very similar to the out-of-order analysis we used earlier in the CPU cache. In the Java memory model, the possible results are:

  1. r1 = 1, r2 = 1
  2. r1 = 0, r2 = 1
  3. r1 = 1, r2 = 0
  4. r1 = 0, r2 = 0

Similarly, the third result is also interesting. The second line sees the y update first, but does not see the X update. This was analyzed in detail in the previous CPU cache disorder. In the previous analysis, we needed to add a memory barrier like this to avoid the third condition, namely:

As well as

Just to recap, thread 1 performs a write barrier after x = 1 and before y = 1 to ensure that all store buffer updates are in the cache. Updates before y = 1 are guaranteed not to be invisible because they are in the Store buffer. Thread 2 performs a read barrier after executing int R1 = y to invalidate all data in the invalidate queue and ensure that there is no dirty data in the current cache. Thus, if thread 2 sees an update to Y, it must see an update to X.

Let’s take it a step further: We think of the write barrier and the subsequent Store (y = 1) as packing the previous update and sending it out at this point, and the read barrier and the previous Load (int R1 = y) as a receiving point, where the outgoing packet is opened and read in if it is received. Therefore, as shown below:

At the transmitting point, all results up to the transmitting point (including the transmitting point itself) are packaged. If the packet is received during the execution of the receiving point’s code, all instructions after the receiving point can see all contents in the packet, i.e., the contents before and at the transmitting point. Causality, sometimes called Casual Consistency, has different meanings in different contexts. Here we only refer to: You can define a series of writes, and if a read sees the last write, then all reads after that read will see that write and all writes before it. This is a Partial Order, not a Total Order, which is defined in more detail in a later section.

In Java, Causality is not guaranteed for Plain access and Opaque access, because Plain does not have any memory barrier, Opaque only has compiler barrier.

First, Plain:

The result is:

Then Opaque:

Here we need to note:Since we saw earlier that x86 cpus are naturally programmed to keep some instructions out of order, we will see later which inorder guarantees Causality hereThe Opaque access can see the result of causal consistency, as shown in the following figure (AMD64 is an implementation of x86) :However, if we switch to a slightly less consistent CPU, we can see that Opaque access does not guarantee causal consistency. Here is my result in AARCH64 (which is an ARM implementation) :

Also, there is an interesting point, which is that all the out-of-order occurs when C2 is compiled.

So, how do we ensure Causality? Similarly, we don’t need to bother with heavy operations like volatile, just use Release/Acquire mode. Release/Acquire guarantees Coherence + Causality. Release /acquire must occur in pairs (one acquire corresponds to one release). Release can be regarded as the launching point mentioned above and acquire as the receiving point mentioned above. Then we can implement the code like the following:

Then, continuing on the aARCH64 machine, the result is:

It can be seen that Causality is guaranteed by the use of Release/Acquire. Note that the selection of the transmitting point and the receiving point must be well chosen. For example, if we change the position here, it will be wrong:

Example 1: The launch point only packages all previous updates. For x = 1 updates after the launch point are not packaged, so 1,0 results will still appear.

Example 2: the packet will be unpacked at the receiving point, so that subsequent reads can see the result in the packet. For the read of x, it is equivalent to not seeing the update in the packet before the receiving point, so the result will still be 1,0.

For this reason, let’s take a look at Doug Lea’s Java memory barrier design to see which memory barriers designed in Java are used. In Doug Lea’s very early and classic article, he introduced the Java memory model and its memory barrier design, proposing four barriers:

1.LoadLoad

If you have two completely unrelated reads (loads) that are not dependent on each other (that can be executed out of order), you can avoid their out-of-order execution (that is, the Load(y) will not be executed until the Load(x) is executed) by using the LoadLoad barrier:

2.LoadStore

If there is a Load and a write that is completely unrelated (that can be executed out of order), they can be prevented from executing out of order by the LoadStore barrier (that is, Store(y) will not execute until the Load(x) executes) :

3.StoreStore

If there are two completely unrelated writes (stores) that are not dependent on each other (that can be executed out of order), they can be prevented from executing out of order through the StoreStore barrier (that is, Store(y) will not execute until Store(x) executes) :

4.StoreLoad

If there is a write (Store) and a Load(Load) that is completely unrelated (that can be executed out of order), they can be prevented from executing out of order by the LoadStore barrier (that is, the Load(y) will not execute until the Store(x) executes) :

So how do you implement Release/Acquire through these memory barriers? We can extrapolate from our abstraction earlier, first the launching point. The launch point is a Store first, and ensure that everything in front of it is packed, so both Load and Store must be packed, and neither can go behind. Therefore, LoadStore and StoreStore memory barriers need to be added in front of Release to achieve this. Similarly, the receiving point is a Load, and it is guaranteed that all the following can see the value in the packet, so neither Load nor Store can run to the front, so LoadLoad and LoadStore should be added after Acquire to achieve this.

However, as we will see in the next chapter, the design of these four memory barriers is somewhat outdated (due to CPU development and the development of the C++ language), and the JVM uses acquire, release, and fence more internally. Acquire and Release are basically release/acquire. The relationship of these three to the traditional four-barrier design is:

We know the Release/Acquire memory barrier here,Why doesn’t x86 have this disorder without the memory barrier? Refer to the previous CPU disorder diagram:

From this we know that x86 is not out of order for Store to Store, Load to Load, and Load to Store, so naturally Casuality is guaranteed

7.3. Consensus and Volatile

Finally, we came to the familiar Volatile, which was essentially the Release/Acquire guarantee of Consensus; Consensus is that all threads see the same order of memory updates, i.e., all threads see the same order of memory updates globally, for example: Given that some object field int x starts with 0 and int y also starts with 0, and the two fields are not in the same cache line (the later JCStress framework automatically fills the cache line for us), a thread executes:

Another execution:

Under the Java memory model, there are also four possible outcomes:

  1. r1 = 1, r2 = 1
  2. r1 = 0, r2 = 1
  3. r1 = 1, r2 = 0
  4. r1 = 0, r2 = 0

The fourth result is interesting because it is inconsistent with Consensus because the two threads see the updates in a different order (the first thread sees 0 because it thinks the update of X was executed before the update of Y, and the second thread sees 0 because it thinks the update of Y was executed before the update of X). If it wasn’t out of order, you wouldn’t see x and y being 0 because thread 1 and thread 2 are both updated and then read. But as with all the previous presentations, the disorder results in the fact that we can look at the big and the third. Can Release/Acquire guarantee that this will not happen? If access to x and Y is in Release/Acquire mode, thread 1 will execute:

There is no memory barrier between x = 1 and int r1 = y.

Similarly, thread 2 might execute:

Or:

Thus, we may see a fourth result. Let’s test the code:

The test results are:

To ensure Consensus, we simply need to ensure that thread 1’s code is not out of order with thread 2’s code, by adding a StoreLoad barrier to the existing barrier, that is, thread 1 executes:

Thread 2 executes:

This ensures that the order is not out of order, which is essentially volatile access. Volatile access = Release/Acquire + StoreLoad

The result is:

This raises the question, is the StoreLoad barrier added after Volatile Store or before Volatile Load? Let’s do this experiment:

Keep Volatile Store and change Volatile Load to Plain Load, i.e.

Test results:

As you can see from the results, Consensus is still maintained. Keep Volatile Load and change Volatile Store to Plain Store:

Test results:

It’s out of order again.

Therefore, it can be concluded that the StoreLoad is added to Volatile writes, as can be seen in subsequent JVM underlying source code analysis.

7.4 Functions of Final

In Java, we create objects by calling class constructors. We might also put values in the constructors that initialize fields, for example:

We can create an object by calling the constructor:

We combine these steps and use pseudocode to show what the underlying implementation actually does:

There is no memory barrier between them, and according to semantic analysis, there is a dependency between 1 and 5, so the order of 1 and 5 cannot change. 1,2,3,4 depends on each other, so the order of 1,2,3,4 cannot change. There is no relationship between 2,3,4 and 5, and the order of execution between them can be out of order. If 5 is executed before any of the steps 2, 3, and 4, then we may see that the constructor has not yet finished executing and that x,y, and z are still initial values. Under test:

On the x86 platform, you will only see two results, namely -1, -1, -1 (no object initialization is seen) and 1, 2, 3 (object initialization is seen, and no out-of-order), as shown below (AMD64 is an x86 implementation) :

This is because, as we mentioned earlier, x86 cpus are fairly consistent cpus, and there is no out-of-order. And as to what kind of out-of-order property of x86 it’s out of order here, we’ll see later.

As before, we switch to a less consistent CPU (ARM), and here we see some exciting results, as shown below (AARCH64 is an ARM implementation) :

So how do we ensure that we see the result of the constructor execution? Using the previous memory barrier design, we can change the fifth step of the pseudocode to setRelease, which is:

The StoreStore barrier prevents 2,3,4, and 5 from being out of order.

Try it on the aARCH64 machine.

As you can see from the result, you can only see the result after either no initialization or full constructor execution.

Let’s take it a step further and actually we only need the StoreStore barrier here, which leads to the Java final keyword: Final means that the StoreStore barrier is added immediately after the update, which is equivalent to adding the StoreStore barrier before the constructor is finished, ensuring that as long as we can see the object, the object’s constructor is finished. Test code:

Let’s take it one step further. Since 2,3, and 4 are dependent on each other in the pseudocode, we just need to make sure that 4 executes before 5, so 2,3, must execute before 5, so we just need to make z final and add a StoreStore barrier, Instead of declaring each as final, adding memory barriers:

Then, we continued testing with aARCH64 and the results were still correct:

Finally, we need to note that final is just a StoreStore barrier after the update. If you expose this during the constructor, you will still see that final values are not initialized. Let’s test:

This time we can see that final is not initialized on an x86 machine:

Finally, to see why x86 can be implemented without a memory barrier in this example, refer to the previous CPU diagram:

X86 itself is not out of order from Store to Store, naturally guaranteed.

Finally, here is the form:

Wechat search “my programming meow” public account, add the author’s wechat, a daily brush, easy to improve skills, won a variety of offers