An overview of the

CAS (Compare-and-Swap) is a common technology used to implement concurrent algorithms. Many classes in Java and packet sending use CAS technology. CAS is also a question often asked in interviews. This article will introduce the principle of CAS in depth.


case

Before we introduce CAS, let’s take a look at an example.

/**
 * @author joonwhee
 * @date 2019/7/6
 */
public class VolatileTest {
    
    public static volatile int race = 0;

    private static final int THREADS_COUNT = 20;
    
    public static void increase() {
        race++;
    }

    public static void main(String[] args) throws InterruptedException {
        Thread[] threads = new Thread[THREADS_COUNT];
        for (int i = 0; i < THREADS_COUNT; i++) {
            threads[i] = new Thread(new Runnable() {
                @Override
                public void run() {
                    for(int i = 0; i < 10000; i++) { increase(); }}}); threads[i].start(); }while(Thread.activeCount() > 1) { Thread.yield(); } System.out.println(race); }}Copy the code

Thread.currentthread ().getThreadGroup().list(); thread.currentThread ().getThreadGroup().list(); The code can print the current thread status as follows:

java.lang.ThreadGroup[name=main,maxpri=10]
    Thread[main,5,main]
    Thread[Monitor Ctrl-Break,5,main]Copy the code

As you can see, in addition to the Main thread, there is a Monitor Ctrl-break thread, which IDEA uses to Monitor the Ctrl-break interrupt signal.

The solution to the endless loop: If it is IDEA, you can run it in DEBUG mode, or use this code.

import java.util.concurrent.CountDownLatch;

/**
 * @author joonwhee
 * @date 2019/7/6
 */
public class VolatileTest {
    
    public static volatile int race = 0;

    private static final int THREADS_COUNT = 20;

    private static CountDownLatch countDownLatch = new CountDownLatch(THREADS_COUNT);

    public static void increase() {
        race++;
    }

    public static void main(String[] args) throws InterruptedException {
        Thread[] threads = new Thread[THREADS_COUNT];
        for (int i = 0; i < THREADS_COUNT; i++) {
            threads[i] = new Thread(new Runnable() {
                @Override
                public void run() {
                    for(int i = 0; i < 10000; i++) { increase(); } countDownLatch.countDown(); }}); threads[i].start(); } countDownLatch.await(); System.out.println(race); }}Copy the code

As we saw in the volatile example above, we did not get the expected result after running the code and found that each time we ran the program, the output was different, with a number less than 200,000.

By analyzing bytecode, we know that this is because volatile only guarantees visibility, not atomicity, and increment is not an atomic operation (as shown in the figure below). In the case of concurrency, putStatic may synchronize smaller RACE values back into main memory, causing us to lose the desired result each time. So, how to solve this problem?

Solutions:

The first thing that comes to mind is the use of synchronized to modify the increase method.

With the synchronized modification, the increase method becomes an atomic operation and is therefore guaranteed to get the correct result. However, we know that each increment is locked, and the performance may be slightly worse. Is there a better solution?


The answer, of course, is that we can use Java and package Atomic action classes (beginning with Atomic), such as the following code.

Let’s change the code in the example slightly: Race to use AtomicInteger definition, “race++” to use “race. GetAndIncrement ()”, AtomicInteger. GetAndIncrement () is the atomic operations, so we can ensure each time can get the right results, And there is a nice performance improvement (for this example, running under JDk1.8.0_151).


Through method invocation, we can find that getAndIncrement method calls getAndAddInt method, and compareAndSwapInt method is called at last, namely CAS, the protagonist of this paper. Next, we will introduce CAS.

GetAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt getAndAddInt


What is CAS?

CAS is the abbreviation of the English word CompareAndSwap, which means compare and replace in Chinese. CAS needs to have three operands: the memory address V, the old expected value A, and the target value B to be updated.

When the CAS instruction executes, the value of memory address V is changed to B if and only if the value of memory address V is equal to the expected value A, otherwise nothing is done. The entire compare and replace operation is an atomic operation.


Source code analysis

Broadening above, the compareAndSwapInt method was called in the first place. Looking more closely at this method, the underlying source for broadening is shown below.

You can see that the “Atomic:: CMPXCHG” method is called. The “Atomic:: CMPXCHG” method is implemented in Linux_x86 and Windows_x86 as follows.

Implementation of Linux_x86:

Windows_x86 implementation:

Atomic:: CMPXCHG

Mp is the return result of “OS ::is_MP()”, which is an inline function that determines whether the current system is multiprocessor.

  1. If the current system is multiprocessor, this function returns 1.
  2. Otherwise, 0 is returned.

LOCK_IF_MP(mp) determines whether to prefix CMPXCHG directives with lock based on the value of mp.

  1. If the current system is judged to be multi-processor by mp (that is, mp is 1), add lock prefix to CMPXCHG instruction.
  2. Otherwise, the lock prefix is not added.

This is an optimization approach that states that the lock prefix is not necessary in a uniprocessor environment and will only be added in a multi-core environment because of the performance degradation associated with locking. CMPXCHG is an assembly instruction that compares and swaps operands.


The Intel Manual describes lock prefixes as follows:

  1. Ensure that read – change – write operations to memory atoms are performed. In Pentium and processors prior to Pentium, instructions prefixed with lock locked the bus during execution, temporarily preventing other processors from accessing memory through the bus. Obviously, this is expensive. Starting from Pentium 4, Intel Xeon and P6 processors, Intel made a significant optimization on the basis of the original bus lock: If the area of memory to be accessed has been locked in the processor’s internal cache during the execution of the lock prefix instruction (that is, the cache line containing the area of memory is currently in the exclusive or modified state), and the area is fully contained in a single cache line, The processor will execute the instruction directly. Since the cache line is locked for the duration of instruction execution, other processors cannot read/write the region of memory that the instruction accesses, thus ensuring atomicity of instruction execution. This process is called cache locking. Cache locking greatly reduces the overhead of executing the lock prefix instruction, but it still locks the bus when the contention between multiple processors is high or the memory address accessed by the instruction is not aligned.
  2. Disallows reordering of this instruction with previous and subsequent read and write instructions.
  3. Flushes all data in the write buffer to memory.

The first point above ensures that CAS is an atomic operation, and the memory barrier effect of points 2 and 3 ensures that CAS has the memory semantics of both volatile read and volatile write.


Disadvantages of CAS:

Although CAS effectively solves the problem of atomic operation, there are still three major problems in CAS.

  1. Long cycle time is expensive.
  2. Atomic operations of only one shared variable are guaranteed.
  3. ABA problem.

Long cycle time and high overhead:

We can see that when the getAndAddInt method executes, if the CAS fails, it keeps trying. If the CAS remains unsuccessful for a long time, it may incur significant CPU overhead.


Atomic operations that can only be guaranteed for one shared variable:

When performing operations on a shared variable, we can use cyclic CAS to guarantee atomic operations, but when performing operations on multiple shared variables, the cyclic CAS cannot guarantee atomic operations. In this case, we can use locks to guarantee atomic operations.


What are ABA problems? How to solve the ABA problem?

If the memory address V is first read with A value and is still A when it is ready to be assigned, can we say that its value has not been changed by another thread?

If its value was changed to B during that time and then changed back to A, the CAS operation would assume that it had never been changed. This vulnerability is called the “ABA” problem for CAS operations. To solve this problem, we provide a labeled atom reference class “AtomicStampedReference” that guarantees the correctness of the CAS by controlling the version of the variable value. Therefore, it is important to consider whether the “ABA” problem affects the concurrency correctness of the program before using CAS. If ABA problems need to be solved, switching to traditional mutual-exclusive synchronization may be more efficient than atomic classes.