The memory barrier is the last layer of support for concurrency above hardware, beneath the operating system or JVM. Next is hardware support; Up are the various packages that the operating system or JVM makes to the memory barrier. Memory barriers are a standard and may be implemented differently by vendors.

This article is intended only to help you understand the concurrency mechanisms provided by the JVM. First, visibility and reordering issues are derived from the semantics of volatile. Next, explain how the problem arises and why memory barriers are needed. Then, we discuss the standards for memory barriers, vendor support for memory barriers, and how memory barriers solve these problems using volatile as an example. Finally, I’ll add a few encapsulation that the JVM makes on top of the memory barrier. To help understand, some of the fundamentals at the hardware architecture level (especially CPU architecture) will be briefly discussed, but implementation mechanisms will not be delved into.

The implementation of memory barriers involves a lot of knowledge at the hardware architecture level and requires the cooperation of the operating system or JVM to be effective, which is not understood from either level alone. This article integrates a large amount of knowledge from these three levels and is long. It is hoped that the basic problems of memory barriers can be explained clearly in a single article.

If there are omissions, but also hope to correct!

Volatile variable rule

A good example for eliciting memory barriers is the volatile variable rule.

The volatile keyword can be found in monkey’s first blog article on how the volatile keyword works. Volatile variable rules describe the partial ordering semantics of volatile variables. As a refresher, this is from the point of view of volatile variable rules.

define

Rule for volatile Variables: Writes to volatile variables must be performed before reads to them.

The volatile variable rule is merely a standard that requires the JVM implementation to guarantee the off-order semantics of volatile variables. Combined with program order rules and transitivity, the partial order semantics usually has two functions:

  • Maintain visibility
  • Disable resort (read disallows operations after resort, write disallows operations before resort)

Supplement:

  • Program order rule: If operation A precedes operation B in the program, operation A in the thread will precede operation B.
  • Transitivity: If operation A is performed before operation B, and operation B is performed before operation C, then operation A must be performed before operation C.

In the following paragraphs, if only visibility is concerned, “visibility” is specified; If both are involved, it is called partial order. Reordering is bound to introduce visibility issues, so there will be no separate reordering scenario.

Correct posture

Previous articles have covered the use of volatile variable rules many times.

Simply use the rules for volatile variables to ensure visibility of volatile variables themselves:

  • How many ways to write a singleton in an interview? Based on DCL, “Han – Variant 3” uses volatile to modify singletons to ensure the visibility of singletons.

The complex use of volatile variable rules (combined with procedural order rules and transitivity) to ensure partial ordering of variables themselves and surrounding variables:

  • Source | concurrent flower already and AQS (1) : the lock and unlock: exclusiveOwnerThread by volatile variable state guarantee its relative to the partial order of the state.
  • Source | concurrent CopyOnWriteArrayList of flower: CopyOnWriteArrayList by volatile variable array, provides partial order semantics.

Visibility and reordering

The memory barrier exists to solve the visibility and reordering problems mentioned earlier. What exactly is visibility? What is reordering? Why are these problems?

visibility

define

The definition of visibility is common in a variety of concurrent scenarios, such as multithreading: when one thread changes the value of a thread shared variable, other threads are immediately aware of the change.

From a performance perspective, it is not necessary to synchronize the changed values immediately after they are changed — if they are used after several changes, then only one last synchronization is required, and any synchronization before that is a performance waste. Therefore, the actual definition of visibility is weaker, just ensuring that when one thread changes the value of a thread-shared variable, other threads get the latest change before using it.

Visibility can be considered to be the weakest form of “consistency” (weak consistency), guaranteeing that the data that the user sees is consistent, but not that the data stored is consistent at all times (strong consistency). The issue of “cache visibility” is discussed below, and is referred to in some articles as “cache consistency.”

Source of problem

One of the simplest visibility problems comes from the cache architecture inside the computer:

Caching significantly Narrows the gap between fast cpus and slow memory. Take the three-tier cache architecture as an example:

  • L1 Cache is closest to CPU, has the smallest capacity (such as 32K or 64K), and has the highest speed. Each core has an L1 Cache.
  • L2 Cache has a larger capacity (for example, 256K) and lower speed. Generally, each core has an independent L2 Cache.
  • The L3 Cache is closest to the memory, has the largest capacity (such as 12MB), and has the lowest speed. The cores in the same CPU slot share an L3 Cache.

Specifically, each core has two L1 caches, one for data L1d Cache and one for instruction L1i Cache.

Everything was perfect in the mono era. However, visibility problems arise in the multicore age. A badcase is as follows:

  1. If Core0 and Core1 hit the same memory address, their RESPECTIVE L1 caches will Cache a copy of the same data.
  2. So at the beginning, Core0 and Core1 are both kind of reading this data.
  3. All of a sudden, Core0 is going to break, and it modifies the data so that the data in the two caches is different, or more specifically, the data in the Core1 L1 Cachefailure.

Single core era only Core0, Core0 modify Core0 read, no problem; However, after _Core0 is modified, Core1 does not know that the data is invalid, and continues to use _ foolishly. At best, the data will be miscalculated, and at worst, it will cause an infinite loop and program crash.

The actual visibility problem extends in two directions:

  • In addition to the level 3 cache,There is also a wide variety of caches in hardware architectures implemented by vendors, all of which have similar visibility issues. For example, a register acts as a Cache between the CPU and L1 Cache.
  • In multithreaded memory models for various high-level languages, including Java,Maintaining a cache of your own within the thread stack is a common optimization, but obviously all falls apart when it comes to CPU-level cache visibility issues.

This is just the simplest visibility problem, and does not involve reordering, etc.

Reordering can also cause visibility problems; Also, visibility on the cache can cause problems that appear to be caused by reordering.

reorder

define

Reordering is not strictly defined. As a whole, it can be divided into two types:

  • True · Reorder: The compiler, underlying hardware (CPU, etc.) reorder instructions according to certain rules for “optimization” purposes (although sometimes it looks out of order).
  • Pseudo · reorder: It looks like the instructions have been reordered due to cache synchronization order etc.

Reordering is also an excellent optimization method in the single-core era, and there are enough measures to ensure its correctness under single-core. In the multicore era, reordering is also a good tool for performance optimization if worker threads do not share data or only share immutable data. However, if mutable data is shared between worker threads, the worker thread can appear to behave randomly because the results of both reorders are not fixed.

The concept of reordering must be confusing for the first time. Patience, patience.

Source of problem

The reordering problem happens all the time and comes from three scenarios:

  1. Compiler compile-time optimization
  2. Out-of-order optimization of processor execution
  3. Cache synchronization order (resulting in visibility issues)

Scenarios 1 and 2 are true · reorder; Scenario 3 is pseudo-reorder. Scenario 3 is also a visibility issue, so for consistency, we’ll discuss scenario 3 first.

Visibility resulting in pseudo-reordering

Cache synchronization order is essentially a visibility issue.

Assume that variable v1 is updated first and then variable v2 is updated in program order, without considering true · reorder:

  1. Core0 updates v1 in the cache, then v2 in the cache (on two cache rows so that the cache rows are not written back into memory together when they are flushed).
  2. Core0 reads V1 (assuming the LRU protocol is used to flush out the cache).
  3. Core0’s cache is full, and the farthest used V2 is written back to memory.
  4. V1 was stored in Core1’s cache, and now v2 is loaded into the cache.

Reordering refers to program order. If the instruction is executed in a different order than the program order, it indicates that the instruction has been reordered.

At this point, although the “update V1” event occurs earlier than “update V2”, Core1 only sees the latest value of v2, but not the latest value of v1. This is pseudo-reordering due to visibility: there is no actual reordering, but it appears to have occurred.

As you can see, cache visibility not only causes visibility problems, but also pseudo-reordering. Therefore, as long as you solve the visibility problem on the cache, you also solve the pseudo-reordering problem.

The msci agreement

Back to the examples from the visibility problem and the definition of visibility. To solve this problem, just use the definition of visibility: After Core0 changes v, let Core1 get the latest value of v before using it.

Each time v is changed, the value can be synchronized to another Cache containing V. You can also synchronize only the last modified value. The latter is better in performance, how to achieve:

  1. After Core0 modifies v, it sends a signal that marks the V in Core1’s cache as invalid and writes the changed value back to memory.
  2. Core0 may modify v multiple times, sending only one signal each time (which locks the bus between caches), and the V of Core1 cache remains invalidated.
  3. Before Core1 uses v, it finds that the v in the cache has been invalidated, learns that v has been modified, and reloads v from other caches or memory.

This is the basic principle of MESI (Modified Exclusive Shared Or Invalid). It’s not rigorous, but it’s enough to understand cache visibility (more commonly known as cache consistency).

The MESI protocol addresses the visibility problem at the CPU cache level.

Here is a brief look at the MESI protocol cache state machine:

Status:

  • M (Modified) : The local processor has Modified the cache row, or dirty row, whose contents are different from those in memory. And there is only one local copy of this cache (private).
  • E (Exclusive, Exclusive) : The contents of a cache row are the same as those in memory, and no other processor has this row.
  • S (Shared) : The contents of the cached line are the same as those in memory. Other processors may also have copies of the cached line.
  • I (Invalid, Invalid) : The cache row is Invalid and cannot be used.
The remaining problems

Now that the MESI protocol is in place, is the visibility semantics for volatile unnecessary? Of course not. Three more questions:

  • Not all hardware architectures provide the same guarantee of consistency, and JVMS require volatile unified semantics (even MESI addresses only the CPU cache level, not the other level).
  • Visibility issues are not limited to THE CPU cache, but also occur in the memory model maintained by the JVM itself. The use of volatile tags solves visibility issues at the JVM level.
  • If true reordering is not taken into account, MESI does solve the visibility problem at the CPU cache level; However, true reordering can also cause visibility problems.

For now, the first problem is called the “memory visibility” problem, which the memory barrier solves. Discussed later.

Compiler compile-time optimization

The JVM also has visibility issues in its own maintained memory model, and using volatile to flag and de-caching volatile variables solves visibility issues at the JVM level. The compiler generates reordering along the same lines.

Why does the compiler re-order? The purpose of out-of-order execution is the same: rather than waiting for a blocking instruction (such as a cache flush) to complete, it is better to execute other instructions first. Compiler reordering can achieve a larger and better out-of-order optimization than processor out-of-order execution.

Because of the same purpose and similar principle as processor out-of-order execution, the implementation principle of compiler reordering is not discussed here.

Fortunately, reordering at the compiler level can be controlled by the compiler. With volatile, reordering at the compiler level can be disabled.

Out-of-order optimization of processor execution

The out-of-order optimization at the processor level saves a lot of wait time and improves the performance of the processor.

“Out of order” is just called “out of order”, but it actually follows certain rules: two instructions can be out of order as long as there is no data dependency between them. Regardless of the precise definition of data dependency, it can be understood that two instructions can be reordered as long as the single-threaded, sequential execution of the program is not affected.

Without out-of-order optimization, the instruction execution process of the processor is as follows:

  1. Command fetch.
  2. If the input operand is reachable (such as if it already exists in a register), the instruction is sent to the appropriate functional unit. If one or more operand are not available for the current clock cycle (usually from main memory), the processor starts to wait until they are available.
  3. Instructions are executed in appropriate functional units.
  4. The functional unit writes the result of the operation back to the register.

The execution process of out-of-order optimization is as follows:

  1. Command fetch.
  2. An instruction is sent to an instruction sequence (also calledExecution bufferorKeep standing).
  3. The instruction will wait in the sequence until its data operand is available. The instruction is then allowed to leave the sequence buffer before the first, older instruction. (Here it is out of order)
  4. Instructions are assigned and executed by an appropriate functional unit.
  5. The results are put into a sequence.
  6. The result of this instruction is written to the register only after all instructions preceding it have written their result to the register. (Result of reforming out of order)

Of course, in order to achieve out-of-order optimization, there are still many techniques to support, such as register renaming, branch prediction, etc., but this is enough for a general understanding. An implementation of the memory barrier is described in the comments below.

Out-of-order optimization does not affect the correctness in the single-core era. But multi-threading in the multi-core era can achieve true parallelism on different cores, and problems arise when data is shared between threads. Take a look at some classic code:

public class OutofOrderExecution {
    private static int x = 0, y = 0;
    private static int a = 0, b = 0;
    
    public static void main(String[] args)
        throws InterruptedException {
        Thread t1 = new Thread(new Runnable() {
            public void run(a) {
                a = 1; x = b; }}); Thread t2 =new Thread(new Runnable() {
            public void run(a) {
                b = 1; y = a; }}); t1.start(); t2.start(); t1.join(); t2.join(); System.out.println(" (" + x + ", "+ y +") "); }}Copy the code

Regardless of compiler reordering and cache visibility issues, what might the above code output be?

The most obvious results are (0,1), (1,0) or (1,1). Threads T1 and T2 could be executed one after another, or vice versa, or t1 and T2 could be executed alternately.

However, the result of this code execution could also be (0,0), which seems counterintuitive. This is the result of processor out-of-order execution: there is no data dependence between two lines of code inside thread T1, so x = b can be out-of-order before a = 1. Also, y = a in thread T2 executes before a = 1 in thread T1. A possible execution sequence is as follows:

  1. t1: x = b
  2. t2: b = 1
  3. t2: y = a
  4. t1: a = 1

Here code is equated with instructions, which are not rigorous, but do not hinder understanding.

It seems fair to call the above reordering (or out-of-order) problem a “visibility” problem. However, this reordering is far more harmful than mere visibility, because not all instructions are simply read or write-how many ways can a singleton be written in an interview? Examples of partial initializations are mentioned in the volatile keyword and how such unsafe publishing is caused by reordering. Therefore, it is not appropriate to classify reordering as a “visibility” problem, other than to say that reordering causes visibility problems.

That is, it is not enough to solve the memory visibility problem alone; you also need to specifically address the processor reordering problem.

Of course, some processors do not disorder instructions, or can rely on disordering based on data between multiple cores. In this case, volatile is used only to unify the semantics of reordering.

The memory barrier

A Memory Barrier is the same thing as a Memory Fence.

Visibility and reordering problems at the compiler level can be solved with volatile tags. Memory barriers solve visibility and reordering problems at the hardware level.

The monkey has not yet verified the following analysis, but has made a judgment based on logic and system design considerations. I’ll make it up later.

standard

Let’s start with two instructions:

  • Store: Flushes the data cached by the processor to memory.
  • Load: Copies data stored in memory to the processor cache.
Barrier type Order sample instructions
LoadLoad Barriers Load1; LoadLoad; Load2 This barrier ensures that Load1 data is loaded before Load2 and all subsequent load instructions
StoreStore Barriers Store1; StoreStore; Store2 This barrier ensures that Store1 immediately flushers data to memory (making it visible to other processors) before Store2 and all subsequent operations to store instructions
LoadStore Barriers Load1; LoadStore; Store2 Ensure that Load1 loads data before Store2 and all subsequent store instructions flush data into memory
StoreLoad Barriers Store1; StoreLoad; Load2 This barrier ensures that Store1 immediately flusher data to memory before Load2 and all subsequent loading instructions. It makes all memory access instructions (store instructions and access instructions) prior to the barrier complete before executing memory access instructions behind the barrier

StoreLoad Barriers have the effect of the other three Barriers, hence also called mfence, which is supported by most processors today; However, this barrier is relatively expensive compared to other barriers.

However, with the exception of Mfence, different CPU architectures implement memory barriers in very different ways and to very different degrees. The Intel CPU’s strong memory model is relatively simple compared to DEC Alpha’s weak complex memory model, where the cache is not only layered, but partitioned. The implementation of memory barriers in the x86 architecture, which is most common in multithreaded programming, is discussed below.

If you look it up, you’ll see that every article on memory barriers tells a different story. However, it’s important to understand the fundamentals and continue digging as needed.

Either way, the memory barrier implementation is designed for out-of-order execution. The basic principle of out-of-order execution is explained in the previous comment: the core is a sequence buffer, and instructions are allowed to exit the sequence buffer and start execution before the older instruction, as long as the data operand of the instruction is available. For the semantics of memory visibility, memory barriers can be implemented using an idea similar to the MESI protocol. As for the realization mechanism of reorder semantics, the monkey did not continue its research. A feasible idea is as follows:

  • When the CPU receives a barrier instruction, it does not put the barrier instruction into the sequence buffer, but puts the barrier instruction and all subsequent instructions into a FIFO queue (instructions are sent wholesale, otherwise there is no need to out-of-order).
  • Allows out-of-order execution of all instructions in a sequence buffer
  • Fetch the barrier instruction from the FIFO queue, execute (and flush the cache, etc., for the semantics of memory visibility)
  • Put the remaining instructions in the FIFO queue into the sequence buffer
  • Restore normal out-of-order execution

In the case of the sfence instruction in x86 architecture, the store before sfence is executed, then the store after sfence is executed; The out-of-order of the load command is not affected, except for new data dependencies caused by store out-of-order before and after sfence is disabled. See details later.

Memory barriers for x86 architecture

The x86 architecture does not implement full memory barriers.

Store Barrier

The sfence directive implements Store Barrier, which is equivalent to StoreStore Barriers.

Force all stores that fence before the sfence directive to be executed before the sfence directive, send a Cache failure signal, and flush the store buffer to the CPU’s L1 Cache; All store directives that follow the sfence directive are executed after the sfence directive executes. That is, reordering of stores before and after the sfence directive is prohibited across the sfence directive, making all memory updates that occurred before the store Barrier visible.

By “visible”, I mean that the modified value is visible (memory visibility) and the result of the operation is visible (reordering disabled). The same below.

The memory barrier standard talks about the coherence between cache and memory, and in fact it applies to register and cache and even register and memory multilevel caches. The x86 architecture uses a variant of the MESI protocol, which guarantees the correlation between the three levels of Cache and memory, while the memory barrier only needs to guarantee the coherence between the Store Buffer (which can be thought of as a layer of Cache between registers and L1 Cache) and L1 Cache. The same below.

Load Barrier

The lfence directive implements a Load Barrier, which is equivalent to LoadLoad Barriers.

Forces all loads following the lfence directive to be executed after the lfence directive, and wait until the LOAD buffer has been read by the CPU before executing any subsequent loads (brushes initiated when the cache is found to be invalidated). That is, the reordering of loads before and after the lfence directive is prohibited across the lfence directive, in conjunction with the Store Barrier, so that all memory updates that occurred before the Store Barrier are visible to all loads that follow the load Barrier.

Full Barrier

The mfence directive implements Full Barrier, which is equivalent to StoreLoad Barriers.

The mfence directive combines the functions of the sfence directive and the lfence directive, forcing all store/load directives before the mfence directive to be executed before the mfence directive. All store/load directives that follow the mfence directive are executed after the mfence directive executes. That is, reordering the store/load instructions before and after the mfence directive is prohibited across the mfence directive, making all operations that occurred before the Full Barrier visible to all operations that occur after the Full Barrier.

How does Volatile address memory visibility and processor reordering

At the compiler level, use volatile only as a flag, de-caching and reordering at the compile level.

If the hardware architecture itself guarantees memory visibility (such as a single-core processor, a consistent enough memory model, etc.), volatile is an empty flag that does not insert semantic barriers.

If the hardware architecture itself does not perform processor reordering, has stronger reordering semantics (data dependencies between cores can be analyzed), or reordering on a single-core processor, volatile is an empty flag and does not insert memory barriers for relevant semantics.

If not, and using x86 architecture as an example, the JVM handles volatile variables as follows:

  • After writing volatile v, insert a sfence. All stores before sfence (including write v) will not be reordered after sfence, all stores after sfence will not be reordered before sfence, disable store reordering across sfence; And any previous changes to sfence are written back to the cache and marked as cache invalidations in other cpus.
  • Insert a lfence before reading volatile v. Loads after lfence (including read v) will not be reordered before lfence, and loads before Lfence will not be reordered after Lfence, disabling reordering of loads across lfence; After lfence, the invalid cache will be flushed first to get the latest modified value, which ensures memory visibility in cooperation with Sfence.

On other platforms, the JVM uses mfence instead of sfence and lfence for stronger semantics.

Together, they implement the rule of volatile variables in the happens-before relationship.

Additional encapsulation of memory barriers by the JVM

In addition to volatile, common JVM implementations have some other encapsulation based on memory barriers. With the help of memory barriers, these packages also capture the semantics of memory barriers in terms of visibility and reordering.

Piggyback.

In a JVM, leverage usually refers to combining the happens-before program order rule with some other order rule (typically a monitor lock rule, a volatile variable rule) to order access to a variable that is not protected by a lock.

This article extends the semantics of leverage to a much larger scope, where you can leverage any existing mechanism to obtain some properties of the existing mechanism. Of course, not all properties can be exploited, such as atomicity. However, based on the previous analysis of memory barriers, visibility and reordering can be leveraged.

The following discussion is still based on the x86 architecture.

The final keyword

If an instance’s field is declared final, the JVM inserts a sfence after initializing the final variable.

The final fields of a class are initialized in the < Clinit >() method, and their visibility is guaranteed by the JVM’s classloading process.

Initialization of final fields is done in the

() method. Sfence disables reordering of stores before and after sfence, and ensures that memory updates are visible until final fields are initialized.

Let’s talk about partial initialization

These nice properties are called “initialization security.” It ensures that for a correctly constructed object, all threads will see the correct values that the constructor sets for each final field of the object, regardless of how the object is published.

This Narrows the visibility from “memory updates before final field initialization” to “Final field initialization.” Monkey did not find the exact reason, hand only a JDK is not convenient verification. Probably because the JVM does not require the VIRTUAL machine implementation to choreograph the order of field initialization instructions when the

() method is generated.

Initialization security brings a new approach to solving part of the initialization problem: if all fields of the object to be published are final, it prevents initial references to the object from being reordered until the construction process is complete. So, how many ways can you write a singleton in an interview? The third variant of sfence in Chinese can also discard volatile and use the semantics of final sfence:

/ / full han
// ThreadSafe
public class Singleton1_3 {
  private static Singleton1_3 singleton = null;
  
  public int f1 = 1;   // Trigger partial initialization problems
  public int f2 = 2;

  private Singleton1_3(a) {}public static Singleton1_3 getInstance(a) {
    if (singleton == null) {
      synchronized (Singleton1_3.class) {
        // must be a complete instance
        if (singleton == null) {
          singleton = newSingleton1_3(); }}}returnsingleton; }}Copy the code

Note that initialization security only addresses some of the initialization issues in a security release and is not related to other security release issues or post-release visibility issues.

CAS

On x86 architectures, CAS is translated as “lock CMPXCHG…” . CMPXCHG is the CAS assembly instruction. Relying on the LOCK signal in the CPU architecture ensures visibility and disallows reordering.

The lock prefix is a special signal that is executed as follows:

  • Lock the bus and cache.
  • All instructions prior to enforcing a LOCK signal are executed prior to that, and the associated cache is synchronized.
  • Execute the lock instruction (such as CMPXCHG).
  • Release locks on the bus and cache.
  • Instructions that enforce all lock signals are executed after that, synchronizing the relevant caches.

Thus, the LOCK signal, while not a memory barrier, has the semantics of mfence (and, of course, exclusivity).

Compared to memory barriers, lock signals have additional locks on the bus and cache and are more costly.

The lock

The JVM’s built-in locks are implemented through the operating system’s pipes. Regardless of the implementation principle of a pipe, because a pipe is a mutually exclusive resource, modifying a mutually exclusive resource requires at least one CAS operation. Therefore, the lock must also use the lock signal, which has the semantics of mfence.

The mfence semantics of the lock implement the monitor lock rule in the happens-before relationship.

CAS has the same mfence semantics and must also have the same partial ordering relationship as locks. Although the JVM does not explicitly require this.


Reference:

  • Understanding system architecture from Java perspective (1)CPU context switch
  • Understanding system architecture from Java perspective (2) CPU cache
  • Understanding system architecture from Java perspective (3) Pseudo-sharing
  • Volatile: What do you know about volatile?
  • Out-of-order execution
  • (transfer) CPU out-of-order execution principle
  • Research on Java memory access reordering
  • Talk about out-of-order execution and memory barriers
  • Memory barrier
  • Talk about atomic variables, locks, memory barriers
  • High concurrency (35) Java memory model (3) Understanding the memory barrier

This article is published under the Creative Commons Attribution – Share alike 4.0 International License. The attribution and link to this article must be reserved.