One, the origin of CPU cache

During the rapid development of the CPU, the memory and hard disk cannot keep pace with the development of the CPU. As a result, the CPU can read and write data slowly from the memory.

To solve this problem, the CPU manufacturer has built level 3 cache (L1, L2, and L3) into the CPU to solve the problem that the I/O speed does not match the CPU speed. The level 3 cache reduces the interaction between the CPU and memory.

Register: a limited number of data storage units in the CPU

Cache lines: The size of a cache line is usually 64 bytes. For example, L1 has a cache size of 512 KB and a cache line size of 64 bytes, so L1 has (512*1024) /64 cache lines.

The closer the device is to the CPU, the faster the device is, and the smaller the memory is.

  • In order of speed: Register >L1>L2>L3> memory
  • Sort by memory size: register

The process of the CPU reading data:

  1. If the data is taken from the register, it can be read directly
  2. If the data is fetched from L1, the cache line corresponding to the data is first locked, then the data is put into the register, and finally unlocked.
  3. If L2 data is fetched, it is fetched from L1 first. If L1 does not have the data, it locks the data in the corresponding cache line of L2, puts the data into L1, performs the previous step of reading L1, and finally unlocks the data.
  4. The same goes for reading L3.
  5. If the data is taken from the memory, the memory will be notified to add a bus lock, and then sent to the memory read demand, response data stored in L3, and then repeat the above steps, after the completion of the bus lock.

The principle of CPU access memory locality:

There are two main points: time and space

  • Temporal locality: If a piece of information is being accessed, it is likely to be accessed again in the near future (e.g. loop, recursion, etc.), popularly known as a cache of hot data.
  • Spatial locality: when data is accessed, neighboring data will also be accessed.

Let’s look at an example of spatial locality:

public class MyTest {

    private static int arr[][] = new int[1024*1024] [6];

    static {
        for (int i=0; i<1024*1024; i++){for (int j=0; j<6; j++){ arr[i][j]=j; } } System.err.println("Arr 2d array initialization complete");
    }

    public static void main(String[] args) {
        // The data is read line by line. The addresses are sequential, and the CPU space is limited. The CPU only needs to access the memory once to read a row, so it is fast
        int sum = 0;
        long startTime = new Date().getTime();
        for (int i=0; i<1024*1024; i++){for (int j=0; j<6; j++){ sum+=arr[i][j]; } } System.err.println("Time consuming:"+ (new Date().getTime()-startTime)+"Ms, result:"+sum);

        // Read column by column, address is not sequential, CPU access memory can only get one data, so slow
        sum = 0;
        startTime = new Date().getTime();
        for (int i=0; i<6; i++){for (int j=0; j<1024*1024; j++){ sum+=arr[j][i]; } } System.err.println("Time consuming:"+ (new Date().getTime()-startTime)+"Ms, result:"+sum); }}Copy the code

Results:

Arr 2d array initialization time: 15ms, result:15728640Time: 152ms, result:15728640
Copy the code

MESI Cache consistency protocol

We talked about the CPU cache above, so in the multi-core multi-CPU case, does it raise the issue of cache consistency?

Yes, a long time ago the solution was to add bus locks, but are bus locks a waste of our multi-core, multi-CPU processing power? The next solution is what we call the MESI Cache consistency protocol.

Each letter of MESI represents four states, marking the cache line, as follows:

  • M: Valid for the cache line, indicating that the data has been Modified and is inconsistent with the data in memory. And the data exists only in the current cache row.
  • E (Exclusive) : this parameter is valid for cache lines and is consistent with memory data. The data exists only in the current cache.
  • S (Share) : this parameter is valid for cached rows and is consistent with memory data. The data also exists in other caches.
  • I (Invalid) : Invalid for the cache line, that is, the data is junk data.

Let’s use i++ as an example to see how it works:

Mononuclear operation: after the CPU to read data from memory into the CPU cache first, buy for E (exclusive) state, and then performs I + 1 operations in a register, the result is returned to the CPU caches, and change the status by E (exclusive) of I to M (modified), and then write to main memory, after completion of the I will buy again for E (exclusive) state

Multi-core operation:

  1. CPU1 reads data I from the memory and puts it in the CPU cache. The data state is set to E (exclusive).
  2. CPU2 also reads data I from memory. If CPU1 detects an address conflict, the data I on both CPU1 and CPU2 is set to S (shared).
  3. CPU1 then executes the I +1 operation in the register, returns the result to the CPU cache, changes the status of data I from E (exclusive) to M (modified), and sends the bus a message that the data has been modified.
  4. Other cpus listening to this message will set the corresponding row to the I (invalid) state and respond successfully (equivalent to an ACK).
  5. When CPU1 receives a successful response, it writes data to the main memory and sets I to the E (exclusive) state. Other threads have to go back to the main memory to read.

If you want to switch the data state by using an ACK, is it possible that the CPU will block because the ACK takes too long?

To avoid this problem, Store Bufferes were introduced. The CPU writes the value to the Store Bufferes and then proceeds to process other instructions. The data is finally committed only when the other cpus have returned an acknowledgement of failure. But this optimization creates another problem:

  1. There is no way to control the time of Store Bufferes
  2. While executing other instructions, the current CPU will continue to read data from The Store Bufferes, but the Store Bufferes data has not been submitted, resulting in incorrect results.

JMM memory model

The JMM Memory Model is an abstract concept that describes a set of rules or specifications that define how variables in a program can be accessed.

  1. The JMM specifies that all variables are stored in main memory, which is a shared area accessible to all threads
  2. Each thread has its own working memory, each thread is isolated from each other, the operation of the variable must be carried out in the working memory, first copy the variable from the main memory to its own working memory, then read and write operations, after the completion of the operation will be written to the main memory

As shown in figure:

If multiple threads operate on the same variable and write to main memory at the same time, is there a concurrency problem? Makes the thread unsafe?

The JMM provides eight atomic operations to solve this problem (which must be performed in the following order)

  1. Lock: A variable that acts on main memory, marking it as thread-exclusive
  2. Read: Reads variables in main memory for subsequent load operations
  3. Load: Put the read variable into working memory
  4. Use: a variable that acts on working memory, passing the working memory variable to the CPU’s execution engine
  5. Assign: a variable assigned to the working memory that passes the value to the working memory after being processed by the execution engine
  6. Store: a variable that acts on working memory and is passed to main memory for subsequent write operations
  7. Write: a variable that acts on working memory, passing the store variable from working memory to main memory
  8. Unlock: A variable that acts on main memory, freeing the variable from its locked state so that it can be accessed by another thread

As shown in figure:

Let’s look at visibility, atomicity, and order in concurrent programming

  • Atomicity: The operation is not interrupted, and in a multi-threaded environment, the operation once started is not affected by other threads.

  • Visibility: Operations performed by one thread on a shared variable can be immediately sensed by other threads.

  • Order: Code is executed sequentially, whether single-threaded or multi-threaded (reordered instructions are not executed sequentially)

So how does the JMM model solve the problems of atomicity, visibility, and order of concurrent programming?

Atomicity issues: In addition to the basic data types of Atomic operations provided by the JVM itself, atomicity can be implemented with synchronized and Lock.

Visibility issues: The volatile keyword guarantees visibility. When a shared variable is volatile, it ensures that the value changed by one thread can be seen by other threads in a timely manner. Synchronized and Lock also guarantee visibility because they guarantee that only one thread can operate on a shared resource at a time.

Order problems: The volatile keyword guarantees order, as can synchronized and Lock.

Happens-before principle

As mentioned above, relying solely on synchronized and volatile to ensure atomicity, visibility, and order is a lot of trouble for developers. Our JDK dad thought of this and provided a set of happens-before principles to help guarantee atomicity, visibility, and order in program execution:

  1. Program execution order rule: Within a thread, you must ensure that the code is executed in the same order.
  2. Lock rule: For the same lock, the unlock operation must occur before the subsequent lock operation.
  3. Rules for volatile variables:volatileA write to a variable must occur before a read to the variable, which ensuresvolatileIn time visibility.
  4. Transitive rule: If operation A occurred before operation B, which in turn occurred before operation C, then it follows that operation A occurred before operation C
  5. Thread start rule: The start method of a thread occurs before any action on the thread.
  6. The thread interrupt rule: A thread’s interrupt method occurs before the thread interrupts.
  7. Thread termination rule: Any action in a thread occurs before the thread terminates.
  8. Object finalization rule: The completion of an object’s initialization (completion of constructor execution) occurs before its Finalize () method begins.

Learn about order reorders

The Java language states that instructions may be executed in a different order from the code as long as the final result of the program is equal to the result it executed in order, a process called instruction reordering.

What’s the point of reordering?

According to the characteristics of THE CPU (multi-level cache, multi-core processor), the JVM will properly rearrange the machine instructions to make the machine instructions more in line with the execution characteristics of the CPU, and maximize the role of the CPU

The principle of instruction rearrangement: as-if-serial

A single threaded execution result cannot be changed no matter how reordered.

Volatile keyword

Volatile is a lightweight synchronization mechanism provided by the JVM that does two things:

  • Ensure that volatile shared variables are visible to other threads in a timely manner. This means that when a thread makes a change to a volatile shared variable, other threads are immediately aware of it.
  • Disables instruction reordering optimization

So how does volatile do these two things?

  1. Timely visibility

Let’s look at a simple piece of code:

    private volatile static int a = 1;
Copy the code

The JVM will issue a Lock prefix to the CPU when it writes to a volatile shared variable. When the CPU recognizes the Lock prefix, it triggers the cache-consistency protocol we talked about above, using the bus sniffing mechanism. Notify other processors that the shared variable has been modified, invalidate the variable on other processors, and write data to main memory in time.

  1. Disables instruction reordering optimization

We all know that the JMM allows the compiler and CPU to reorder instructions for performance optimization without changing the correct semantics, but what if you want to prevent reordering? The answer is that you can add memory barriers.

Volatile is a memory barrier that prevents reordering instructions

Intel hardware level memory barriers:

  1. Ifence: indicates a Load Barrier
  2. Sfence: a Store Barrier
  3. Mfence: mfence is an omnipotent barrier with the ability of ifence and sfence
  4. Lock is not a memory barrier per se, but can act like a memory barrier.

Different hardware implements memory barriers in different ways. The JMM shields this hardware difference. The JVM generates machine code for different hardware platforms. The JVM provides the following memory barriers:

The Java compiler inserts memory barrier instructions in place when generating the instruction series to prevent reordering of certain types of processors. In order to implement the memory semantics of volatile, the JMM will restrict the reordering of certain types of compilers and processors. The JMM will specify a volatile reordering table for compilers:

“NO” means that reordering is forbidden. To implement volatile memory semantics, when the compiler generates bytecode, it inserts a memory barrier in the instruction sequence to prevent reordering of certain types of processors. It is almost impossible for the compiler to find an optimal placement that minimizes the total number of insertion barriers, so the JMM takes a conservative approach:

  1. Insert a StoreStore barrier before each volatile write;
  2. Insert a StoreLoad barrier after each volatile write;
  3. Insert a LoadLoad barrier after each volatile read;
  4. Insert a LoadStore barrier after each volatile read operation.

Note that volatile writes insert memory barriers before and after, while volatile reads insert two memory barriers after

StoreStore barrier: prohibits reordering of normal writes above and volatile writes below;

StoreLoad barrier: Prevents volatile reads above from being reordered from volatile reads/writes below

LoadLoad barrier: disables all normal read operations below and volatile read reordering above

LoadStore barrier: disallows all normal write operations below and volatile read reorder above

As shown in figure: