“This article has participated in the call for good writing activities, click to view: the back end, the big front end double track submission, 20,000 yuan prize pool waiting for you to challenge!”

CLH algorithm guide

Technical extension

SMP (Symmetric Multiprocessor Architecture)

  • Symmetric multi-processor (SMP) is a Symmetric multi-processor architecture. Multiple cpus on a server work symmetrically and each CPU takes the same time to access a memory address. Its main feature is sharing, including CPU, memory, I/O sharing.

  • SMP has the advantage of ensuring memory consistency, but the disadvantage is that these shared resources may become performance bottlenecks. As the number of cpus increases, each CPU must access the same memory resources, which may cause memory access conflicts and waste CPU resources. Common PCS fall into this category.

NUMA (Non-consistent memory Access)

  • Non-uniform Memory Access (NUMA) divides cpus into CPU modules. Each CPU module consists of multiple cpus and has independent local Memory and I/O slots. Modules can Access each other through interconnected modules. Access to local memory will be much faster than access to remote memory (memory of other nodes in the system), which is where nonuniform storage access NUMA comes from.

  • The advantage of NUMA is that it can solve the scaling problem of the original SMP system, but the disadvantage is that as the number of cpus increases, the system performance cannot increase linearly because the latency of accessing remote memory is much longer than that of accessing local memory.

Spin-locks and mutex

CLH is a spin lock, so let’s see what a spin lock is.

spinlocks
  • A spin lock is also a mutex, except that the thread that has not captured the lock spins and waits for the lock to be released. In this state, the thread that is waiting for the lock does not go to sleep, but is busy waiting for the lock to waste CPU cycles.

  • Therefore, spin locks are suitable for short lock occupancy.

The mutex
  • Mutex refers to the traditional mutex, that is, when multiple threads concurrently compete for the lock, the thread that does not grab the lock will enter the sleep state (sleep-waiting). When the lock is released, the thread in the sleep state will acquire the lock again.

  • Disadvantages: This sequence of processes requires thread switching, many CPU instructions to execute, and also time. If the CPU takes longer to perform thread switches than the lock takes, it might be better to use a spin lock. Therefore, mutex is suitable for situations where the lock is held for a long time.


CLH locking mechanism

CLH lock is actually a kind of spin fair lock based on logical queue non-thread hunger. Because it is invented by Craig, Landin and Hagersten, it is named CLH lock. CLH is a kind of high performance spin lock based on one-way linked list, which can ensure no hunger and provide first-come-first-served fairness.

  • The thread applying for the lock spins through the variables of the precursor node.

  • After the front node is unlocked, the current node spins and locks.

CLH node model

  • QNode in the CLH queue contains a locked field. True indicates that the thread acquires the lock and does not release it. False indicates that the thread has released the lock.

  • Nodes are connected by an invisible list, so called because there is no obvious next pointer between them. Instead, the behavior of myNode is affected by changes in the nodes to which myPred points.

CLH lock principle

  • First, there is a tail node pointer, through this tail node pointer to build the logical queue waiting for threads, so it can ensure the fairness of thread thread first-come-first-served, so the tail pointer can be said to be the bridge to build the logical queue;
  • In addition, the tail node pointer is an atomic reference type, which avoids the thread-safety problem of multithreaded concurrent operations.
  • By each thread waiting on the lock spinning waits on one of its own variables, which will be written by the previous thread.
  • Since a thread always obtains a variable written by the previous thread through the tail node pointer, and the tail node pointer is of atomic reference type, this ensures that this variable is always thread-safe.

CLH lock analysis

  • CLH has more advantages in SMP architecture.

  • In NUMA architectures, if the current node and the precursor node are not in the same CPU module, there is additional overhead across CPU modules, while MCS locks are more suitable for NUMA architectures.

Locking logic

  1. Get the lock node of the current thread. If it is empty, initialize it.

  2. The sync method gets the tail node of the list and sets the current node to the tail node, where the original tail node is the front node of the current node.

  3. If the tail node is empty, the current node is the first node and the lock succeeds.

  4. If the tail node is not empty, it spins based on the lock value of the front node (locked==true) until the lock value of the front node becomes false.

Unlock the logic

  1. Get the lock node corresponding to the current thread. If the node is empty or the lock value is false, it does not need to unlock and returns directly.

  2. If the sync method fails, the current node is not the last node, and the current node needs to be locked=false to unlock the node. If the current node is the tail node, you do not need to set this parameter.

There is also a tail pointer on CLHLock, which always points to the last node of the queue.

The CLHLock class diagram looks like this:

Simple code

// CLHLock.java

public class CLHLock {
   /** * CLH locks nodes */
   private static class CLHNode {
       // Lock status: The default value is false, indicating that the thread has not acquired the lock. True indicates that the thread has acquired the lock or is waiting
       To ensure that locked states are visible between threads, the volatile keyword is used
       volatile boolean locked = false;
   }
   // The last CLHNode
   // [note] We use Java's AtomicReference to ensure that the atom is updated
   private final AtomicReference<CLHNode> tailNode;
   // The successor of the current node
   private final ThreadLocal<CLHNode> predNode;
   // The current node
   private final ThreadLocal<CLHNode> curNode;

   // CLHLock constructor, used to do some initialization logic when creating a new CLH lock node
   public CLHLock(a) {
       // The endpoint points to an empty CLH node during initialization
       tailNode = new AtomicReference<>(new CLHNode());
       // Initialize the current CLH node
       curNode = new ThreadLocal() {
           @Override
           protected CLHNode initialValue(a) {
               return newCLHNode(); }};// Initialize the predecessor node. Note that the predecessor node does not store the CLHNode object, but stores null
       predNode = new ThreadLocal();
   }
   /** * get lock */
   public void lock(a) {
       // Fetch the current node stored by the current thread ThreadLocal. The initial value is always a new CLHNode, and the state is false.
       CLHNode currNode = curNode.get();
       // Set the lock state to true, indicating a valid state.
       // The lock has been acquired or is waiting for the lock
       currNode.locked = true;
       // When a thread arrives, the end node is always taken out and assigned to the successor node of the current thread;
       // Then assign the current node of the current thread to the tail node
       // [note] In the case of multi-threaded concurrency, the AtomicReference class can be used to prevent concurrency problems
       Prednode. set(preNode); Statement, so a logical thread wait chain is built
       // This chain prevents thread starvation
       CLHNode preNode = tailNode.getAndSet(currNode);
       // Pay the newly acquired tail (the current node of the previous thread) to the predecessor ThreadLocal of the current thread
       // [thinking] Can this code also be removed, if removed, will it affect?
       predNode.set(preNode);
       [1] If the precursor nodes are in locked state false, the lock is acquired, and spin wait is not required.
       [2] If the precursor node is in locked state true, the previous thread has acquired the lock or is waiting, spin waiting
       while (preNode.locked) {
           System.out.println("Thread" + Thread.currentThread().getName() + "Failed to acquire lock, spin wait...");
       }
       // The current thread has acquired the lock
       System.out.println("Thread" + Thread.currentThread().getName() + "Lock obtained!!");
   }

   /** * release lock */
   public void unLock(a) {
       // Get the current node of the current thread
       CLHNode node = curNode.get();
       // Perform the unlocking operation
       // This will be locked to false, and subsequent nodes that have performed the lock method and are waiting on spin will acquire the lock
       // Instead of all spinning waiting threads competing for the lock concurrently
       node.locked = false;
       System.out.println("Thread" + Thread.currentThread().getName() + "Lock released!!");
       // What is the function of the following two lines of code?
       CLHNode newCurNode = new CLHNode();
       curNode.set(newCurNode);

       // [optimization] can improve GC efficiency and save memory space, please consider: why?
       // curNode.set(predNode.get());}}Copy the code

Code process

  1. When a thread needs to acquire a lock, a new QNode is created with locked set to true to indicate that the lock needs to be acquired.
  2. The thread then calls the getAndSet method on the tail field, making itself the tail of the queue and getting a reference to its forward direction, myPred.
  3. The thread then rotates on the locked fields of the front node until the front node releases the lock.
  4. When a thread needs to release a lock, set the locked field on the current node to false and reclaim the forward node.

This algorithm is also spatially efficient. If there are n threads and L locks, each thread only acquires one lock at a time, then the required storage space is O (L+ N), N threads have N MyNodes, and L locks have L tail.

A disadvantage of this algorithm is that performance is poor under NUMA architectures because the spins are locked for the precursor threads, and performance suffers if memory is located far away. But it works well on a cache-consistent architecture like SMP.

The process that

  • Thread A needs to acquire the lock, its myNode field is true, and then tail points to the node of thread A, and then thread B joins thread A, and tail points to the node of thread B.

  • Thread A and thread B then rotate on its myPred object, and once the locked fields of its myPred node become false, it can acquire the lock to continue executing the business method.

  • It is clear that thread A’s myPred locked field is false, and thread A has acquired the lock, as shown below.

As you can see from the code, the lock method has a while loop that waits for the locked fields of the forward node to become false, a spin wait. The UNLOCK method is as simple as setting your locked fields to false.

Advantages and disadvantages CLH

The only disadvantage is that performance is poor under NUMA architecture, where each thread has its own memory. If the memory location of the forward node is far away, the spin can determine the locked domain of the forward node. However, this method is very effective under SMP architecture. One solution to NUMA architecture is MCS queue locking.