Atomic manipulation is a very familiar concept to us. From the user’s perspective, you can improve program performance by replacing heavyweight lock synchronization with atomic operations. From an underlying implementation perspective, atomic operations can be used to build various more heavyweight synchronous operations, such as locks or barriers.


For the implementation of atomic operations, we need to consider single-processor single-core systems, and multi-processor systems, multi-core systems separately.

For the single-processor single-core system, as long as the operation instruction sequence is not interrupted, atomic operation can be realized (of course, for the memory read and write operation, need address alignment, otherwise it is not a memory read and write, of course, it is not atomic operation). For simple atomic operations, the CPU implementation provides a single instruction, such as INC and XCHG. For complex atomic operations, multiple instructions need to be included. In the process of execution, there are context switching behaviors, such as task switching and interrupt processing. The behavior here affects the atomicity of atomic operations. Therefore, spinlock[1] is needed to ensure that the sequence of operation instructions will not be disturbed in the middle of execution.


However, if the system is multi-processor or multi-core, the implementation of atomic operation needs spinlock to ensure that it is not affected by other cores or other processors on the same processor. When instructions executed on other cores access memory space that conflicts with the memory space that the current atomic operation needs to access, it destroys the correctness of the atomic operation.

In the x86 architecture, instruction prefix LOCK is provided. LOCK ensures that instructions are not affected by other processors or CPU cores. Prior to PentiumPro, LOCK was implemented by locking the bus to prevent memory access from other CPU cores. As you can imagine, this implementation is very inefficient. Starting with PentiumPro, locks only block access to cache blocks that other cpus check for related memory.


Today, most x86 processors support the hardware implementation of CAS[2], ensuring the correctness of atomic operations in multi-processor, multi-core systems. The CAS implementation also does not need to lock the bus and only blocks access to the cache block that other cpus check for related memory. Similarly, the IMPLEMENTATION of LL/SC is supported under MIPS and ARM architecture [3]. LL/SC does not have ABA problems in CAS, and atomic operations such as CAS FAA can be implemented based on LL/SC.


Before going any further, you need to understand the MESI caching protocol [4]. Of course, there are other variants of MESI, but MESI will be explained briefly here. Each cache line has four states, Modified indicating that the cache line is unique to the CPU core and has not been written back to memory (see here for those not aware of cache consistency [5]). Exclusive Indicates that the cache line is unique to the CPU core and consistent with the memory. Shared indicates that the cache line is Shared by multiple cores and is consistent with the memory. Invalid indicates that the cache is Invalid. Direct communication between multiple processors in a processor system through fast channels, such as Intel’s QPI and AMD’s Hypertransport. The multiple cores within the processor communicate directly through a shared cache such as the L3 bus. Messages communicated between CPU cores include read messages and response messages to read messages. Invalid message, and the response message that invalidates the message. When a thread running on a CPU core is ready to read the contents of a cache line in the M,E, or S state, it simply reads the contents. If the state is I, it broadcasts read messages to other CPU cores, updates the cache line after receiving read responses from other CPU cores, and sets the state to S. If the thread is in the M state, it writes to the cache line directly. If the state is E, write and change the cache line state to M. If it is in S, it needs to broadcast invalid messages to other CPU cores, enter E state, write changes, and then enter M state. If the value is at I, the cache line needs to be updated after the read response is collected by broadcasting invalid messages to other CPU cores. After an invalid response is collected, state E is entered, changes are written, and state M is entered.


This prevents other cpus from checking for changes to the block of memory, rather than locking the entire bus.

In earlier CPU models (Intel486, Pentium series), multi-core processors prevented other checks on a memory region from being modified by locking the bus. However, in recent CPU models, if data is cached on a cache line and the memory type is write back, cachelocking can be implemented using the cache coherency mechanism instead of locking the entire bus.



Similarly, CAS and LL/SC implementations, you should be able to guess?

In the next article, I’m going to talk about memory barriers.