This is the sixth day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

The art of multipropcessor programming is an important programming tool for programming and programming. The art of multipropcessor programming is an important programming tool for programming. And according to the personal information and understanding of the experience, to you want to have a deeper understanding of the people share some personal information

Spin locking and contention

3. The queue lock

In addition to generality, the previous implementation of fallback based locks has the following two problems:

  • CPU cache consistency traffic: Although the traffic is lower than TASLock due to rollback, the state of multi-threaded access lock still has some traffic consumption due to cache consistency.
  • It may reduce the efficiency of accessing critical sections: all threads are currently sleeping because the sleep delay for all threads is too large, but the lock has actually been released.

Threads can be put into a queue to solve the above two problems:

  • In the queue, each thread checks whether its precursor thread has completed and determines whether the lock has been released without accessing the lock state. This accesses different memory and reduces CPU cache consistency traffic due to lockrelease modification states
  • Without a sleep, the thread can be notified by the precursor thread that the lock has been released, and attempts to acquire the lock have been made, improving the efficiency of accessing critical sections

Finally, the fairness of FIFO is realized through the queue.

3.1. Array-based locks

We implement the queue function through an array, and the flow is as follows:

  • Storage required:
    • Boolean array, true means that the thread in the corresponding slot has acquired the lock, false means that the thread in the corresponding slot has not acquired the lock
    • Holds the atomic variable of the current latest slot. Each lock increments the atomic variable by 1, after which the size of the Boolean array is mod. This value indicates that the thread occupies this position in the Boolean array. The value of this position in the Boolean array indicates whether the thread has acquired the lock. This also shows that the size of the Boolean array determines how many threads can simultaneously contend for the lock
    • ThreadLocal, which records the location of the Boolean array occupied by the current thread
  • Locking process:
    • Atomic variable + 1, mod the size of Boolean array to get current
    • Log current to ThreadLocal
    • The spin waits when the Boolean cuurent position is false
  • Unlock process:
    • Get the location of the current thread from ThreadLocal, mine
    • Mark the mine position of the Boolean array as false
    • Mark true the position where mine + 1 of the Boolean array is mod to the size of the array (to prevent the array from crossing bounds)

The source code is:

public class ArrayLock implements Lock {
	private final ThreadLocal<Integer> mySlotIndex = ThreadLocal.withInitial(() -> 0);
	private final AtomicInteger tail = new AtomicInteger(0);
	private final boolean[] flags;
	private final int capacity;

	public ALock(int capacity) {
		this.capacity = capacity;
		this.flags = new boolean[capacity];
	}

	@Override
	public void lock() {
		int current = this.tail.getAndIncrement() % capacity;
		this.mySlotIndex.set(current);
		while (!this.flags[current]) {
		}
	}

	@Override
	public void unlock() {
		int mine = this.mySlotIndex.get();
		this.flags[mine] = false;
		this.flags[(mine + 1)  % capacity] = true;
	}
}
Copy the code

There are many optimizations we can make on this source code implementation:

  1. Spin waiting can be done without strong spins, with lower CPU usage and CPU instructions optimized for Spin for different architecturesThread.onSpinWait().
  2. Each slot of the Boolean array needs to be filled with cache rows to prevent CPU false sharing from causing too many cache row failures.
  3. Boolean array updates need to be volatile. Normal updates delay the bus signal, causing other equally locked threads to perceive it more slowly and thus idle more times.
  4. Mod to 2 to the n power is equivalent to taking and to 2 to the n power minus 1. To achieve this, we need to convert the incoming capacity value to a value greater than the nearest 2 to the n power of Capacity.
  5. this.flags[current]The operation of reading the array needs to be kept outside the loop to prevent the performance cost of reading the array each time.

The optimized source code is:

public class ArrayLock implements Lock { private final ThreadLocal<Integer> mySlotIndex = ThreadLocal.withInitial(() -> 0); private final AtomicInteger tail = new AtomicInteger(0); private final ContendedBoolean[] flags; private final int capacity; Private static class ContendedBoolean {@contended private Boolean flag; } // Volatile update private static final VarHandle FLAG; Static {try {// Initialize handles FLAG = methodhandle.lookup ().findVarHandle(ContendedBoolean. Class, "FLAG ", Boolean. Class); } catch (Exception e) { throw new Error(e); } } public ArrayLock(int capacity) { capacity |= capacity >>> 1; capacity |= capacity >>> 2; capacity |= capacity >>> 4; capacity |= capacity >>> 8; capacity |= capacity >>> 16; capacity += 1; This. flags = new ContendedBoolean[capacity]; for (int i = 0; i < this.flags.length; i++) { this.flags[i] = new ContendedBoolean(); } this.capacity = capacity; this.flags[0].flag = true; } @Override public void lock() { int current = this.tail.getAndIncrement() & (capacity - 1); this.mySlotIndex.set(current); ContendedBoolean contendedBoolean = this.flags[current]; while (! contendedBoolean.flag) { Thread.onSpinWait(); } } @Override public void unlock() { int mine = this.mySlotIndex.get(); FLAG.setVolatile(this.flags[mine], false); FLAG.setVolatile(this.flags[(mine + 1) & (capacity - 1)], true); }}Copy the code

However, even with these optimizations, the performance of the lock can still be poor during high concurrency with a large number of lock calls. We will analyze and optimize this later.