This article is compiled from chapter 2 of The Art of Java Concurrent Programming by Fang Tengfei, Wei Peng and Cheng Xiaoming

An atom means “the smallest particle that cannot be further divided”, while an atomic operation means “an operation or series of operations that cannot be interrupted”. Implementing atomic operations on multiple processors becomes a bit more complicated. Let’s talk about how atomic operations are implemented in Intel processors and Java.

The terms defined

Before we look at how atomic operations work, let’s look at the terminology:

The term name English explain
Cache line Cache line The smallest unit of operation for the cache
Compare and exchange Compare and Swap The CAS operation requires input of two values, an old value (the value before the expected operation) and a new value. During the operation, the old value is compared to see if there is any change. If there is no change, the new value is exchanged.
CPU line CPU pipeline The working mode of CPU pipeline is just like the assembly line in industrial production. In CPU, five to six circuit units with different functions form an instruction processing pipeline. Then an X86 instruction is divided into five to six steps and executed by these circuit units separately. Therefore, CPU computing speed is improved
Memory order conflict Memory order violation A memory order conflict is usually caused by a false share. A false share means that multiple cpus modify different parts of the same cache row at the same time so that one CPU’s operation is invalid. When this memory order conflict occurs, the CPU must clear the pipeline

How does the processor implement atomic operations

32-bit IA-32 processors implement atomic operations between multiple processors based on caching or bus locking. First, the processor automatically guarantees the atomicity of basic memory operations. The processor guarantees that a byte read or written from system memory is atomic, meaning that when one processor reads a byte, other processors cannot access the byte’s memory address. Pentium 6 and newer processors can automatically guarantee atomicity for 16/32/64 bit operations on the same cache row by a single processor, but complex memory operations, such as access across bus widths, across multiple cache rows, and across page tables, cannot be automatically guaranteed atomicity by the processor. However, the processor provides bus locking and cache locking mechanisms to ensure atomicity of complex memory operations.

In Intel’s 2019 documentation, this part remains largely unchanged. Check out 8.1 LOCKED ATOMIC OPERATIONS on page 2957 of the Intel documentation at the end of this article

Use bus locks to ensure atomicity

The first mechanism is to guarantee atomicity through a bus lock. If multiple processors read and write a shared variable at the same time (i++ is a classic read and write operation), the shared variable will be performed by multiple processors at the same time, so that the read and write operation is not atomic, and the value of the shared variable will be inconsistent with the expected value. For example, if I =1 and we do I ++ twice, we expect 3, but it might be 2, as shown in Figure 2-3.

The reason may be that multiple processors simultaneously read variable I from their caches, increment it by one, and write it to system memory. Therefore, to ensure that a shared variable is read and overwritten atomic, CPU1 must ensure that when it reads and overwrites a shared variable, CPU2 cannot manipulate the cache that caches the memory address of the shared variable. The processor uses bus locks to solve this problem. A bus lock is the use of a LOCK# signal provided by a processor. When one processor outputs this signal on the bus, requests from other processors will be blocked, so that the processor can monopolize the shared memory.

8.1.2 Bus Locking on page 2959 of the Intel documentation

Use cache locks to ensure atomicity

The second mechanism is to ensure atomicity through cache locking. At the same time, we just can assure the operation of a particular memory address is atomic, but the bus lock lock of communication between the CPU and memory, which makes the lock period, the other processors can not operate other memory address data, so the bus locking overhead is large, the current processor cache lock in certain situations instead of bus lock to optimization.

Frequently used memory is cached in the processor’s L1, L2, and L3 caches, so atomic operations can be performed directly in the processor’s internal cache without declaring a bus lock. In Pentium 6 and current processors, complex atomicity can be achieved by “cache locking”. “Cache Lock” refers to the memory area if the cached processor cache line, and is locked during the Lock operation, so when it performs Lock operations back to writes to the memory, the processor do not claim to LOCK# signals on the bus, but modify the internal memory addresses, and allow it to the cache consistency mechanism to ensure the atomicity of operation, Because the cache consistency mechanism can prevent and modify by two or more data processor cache memory area, while the rest of the processor to write cache the data has been locked, will invalidate the cache line, in the example shown in figure 2-3, when CPU1 revisions in the cache line I used the cache lock, so the CPU2 cannot cache I cache line at the same time.

But there are two situations in which the processor does not use cache locking:

  • In the first case, the processor invokes a bus lock when the data of the operation cannot be cached within the processor, or when the data of the operation spans more than one cache line.
  • The second case is that some processors do not support cache locking. For Intel 486 and Pentium processors, a bus lock is invoked even if the locked memory region is in the processor’s cache line. For the above two mechanisms, we provide many Lock prefix instructions through the Intel processor to implement. For example, bit test and modify instructions: BTS, BTR, BTC; Switch instructions XADD, CMPXCHG, and other operands and logic instructions (such as ADD, OR) lock the memory area operated by these instructions, so that other processors cannot access it at the same time.

Intel Documentation 8.1.4 Effects of a LOCK Operation on Internal Processor Caches

How does Java implement atomic operations

Atomic operations can be implemented in Java by locking and looping CAS.

Atomic operations are implemented using circular CAS

The CAS operation in the JVM takes advantage of the CMPXCHG(Compare and Exchange) instruction provided by the processor. The basic idea of a spin CAS implementation is to loop through the CAS operation until it succeeds. The following code implements a THread-safe CAS counter called safeCount and a non-thread-safe count.

    private AtomicInteger atomicI = new AtomicInteger(0);
    private int i = 0;

    public static void main(String[] args) {
        final Counter cas = new Counter();
        List<Thread> ts = new ArrayList<Thread>(600);
        long start = System.currentTimeMillis();
        for (int j = 0; j < 100; j++) {
            Thread t = new Thread(new Runnable() {
                @Override
                public void run() {
                    for(int i = 0; i < 10000; i++) { cas.count(); cas.safeCount(); }}}); ts.add(t); }for(Thread t : ts) { t.start(); } // Wait for all threads to completefor(Thread t : ts) { try { t.join(); } catch (InterruptedException e) { e.printStackTrace(); } } System.out.println(cas.i); System.out.println(cas.atomicI.get()); System.out.println(System.currentTimeMillis() - start); } /** * use CAS to implement thread-safe counters */ private voidsafeCount() {
        for(; ;) { int i = atomicI.get(); boolean suc = atomicI.compareAndSet(i, ++i);if (suc) {
                break; }}} /** * non-thread-safe counter */ private voidcount() {
        i++;
    }
Copy the code

Since Java 1.5, the JDK has provided classes to support atomic operations, Examples are AtomicBoolean (Boolean values updated atomically), AtomicInteger (int values updated atomically), and AtomicLong (long values updated atomically). These atom wrapper classes also provide useful utility methods, such as atomically incrementing and decrement the current value by one.

Three major problems with CAS implementation of atomic operations

Some concurrency frameworks in Java also use spin CAS to implement atomic operations, such as the Xfer method of the LinkedTransferQueue class. Although CAS solves atomic operations very efficiently, there are still three major problems with CAS. ABA problems, long cycle times and high overhead, and atomic operations that guarantee only one shared variable.

  1. ABA problem. Because CAS needs to check if the value has changed and update it if it hasn’t, if A value that was originally A changes to B and then to A, then CAS checks to find that the value hasn’t changed, but in fact it has. The solution to the ABA problem is to use version numbers. Append the version number to the variable and increment the version number by 1 each time the variable is updated. Then A→B→A becomes 1A→2B→3A. As of Java 1.5, the JDK’s Atomic package provides a class AtomicStampedReference to address ABA issues. The compareAndSet method of this class first checks if the current reference is equal to the expected reference, checks if the current flag is equal to the expected flag, and sets the reference and flag’s value to the given updated value atomically if all are equal.
Public Boolean compareAndSet(V expectedReference, // expectedReference V expectedReference, // updated reference int expectedStamp, // Expected stamp int newStampCopy the code
  1. Long cycle time and high overhead. Spin CAS, if unsuccessful for a long period of time, can impose a significant execution overhead on the CPU. If the JVM could support the processor-provided pause instruction, there would be some efficiency gains. The pause directive has two effects. First, it delays the execution of a de-pipeline so that the CPU does not consume excessive execution resources. The amount of delay depends on the implementation version, and on some processors the delay is zero. Second, it improves CPU execution efficiency by avoiding CPU Pipeline Flush due to Memory Order Violation during loop exit.

  2. Atomic operations of only one shared variable are guaranteed. When performing operations on a shared variable, we can loop CAS to ensure atomic operations, but when performing operations on multiple shared variables, the loop CAS cannot guarantee atomic operations, so we can use locks. Another trick is to combine multiple shared variables into a single shared variable. For example, if you have two shared variables I =2, j=a, merge ij=2a, and then use CAS to manipulate ij. Since Java 1.5, the JDK has provided the AtomicReference class to ensure atomicity between reference objects, allowing multiple variables to be placed in a single object for CAS operations.

Atomic operations are implemented using locking mechanisms

The locking mechanism ensures that only the thread that acquired the lock can manipulate the locked memory region. There are many locking mechanisms implemented within the JVM, including biased locking, lightweight locking, and mutex locking. Interestingly, with the exception of bias locking, the JVM implements locking using cyclic CAS, whereby a thread obtains the lock when it wants to enter a synchronized block and releases the lock when it exits the synchronized block.

Recommended reading

Intel® 64 and IA-32 architectures software developer’s manual

Personal wechat official Account:

Personal CSDN blog:

blog.csdn.net/jiankunking

Personal Nuggets blog:

Juejin. Cn/user / 255931…

Individual making:

github.com/jiankunking

Personal Blog:

jiankunking.com