Synchronized to the end

This is a note from the Geek Time Java performance tuning column (not original). Damn good column.

background

In concurrent programming, when multiple threads access the same shared resource, we must consider how to maintain the atomicity of the data.

Prior to JDK 1.5, synchronized was implemented through the built-in Lock, which was implicitly implemented by the JVM. Synchronized was implemented based on the underlying operating system’s Mutex Lock, and each time the Lock was acquired and released, it would bring user mode and kernel mode switching. This increases the system performance overhead. And in the case of lock competition, performance is poor, it is often referred to as a heavyweight lock.

The Lock interface was added in JDK 1.5 to implement the Lock function, which requires explicit Lock acquisition and release. Lock Synchronous locking is implemented based on Java. JDK 1.5 versions of synchronized locks perform much worse than locks, especially in cases where a single thread repeatedly applies for the Lock.

As a result, synchronized was optimized after JDK version 1.6, and even outperformed Lock in some scenarios.

Principle of synchronized synchronization lock

Synchronized can modify code blocks and methods, and can be seen through decompilation: Monitorenter and MonitoreXit are used to implement synchronized code blocks. After entering the Monitorenter directive, the thread will hold the Monitor object. After executing the Monitorexit directive, the thread will release the Monitor object.

When modifying a method, the ACC_SYNCHRONIZED access flag is used to distinguish whether a method is synchronized or not. This method is called to check whether the flag is set. If set, the execution line first holds the Monitor object, which cannot be acquired by other threads during execution. Release the Monitor object after the method completes execution.

Monitor the concept

Tubes, monitors. In the operating system, semaphore and mutex exist, namely semaphore and mutex. When using the basic MUtex for development, you need to be careful to use the down and up operations of mutex, otherwise it may cause deadlock problems. In order to better write concurrent programs, a higher level synchronization primitive is proposed based on Mutex and Semaphore. In fact, Monitor belongs to the category of programming languages. C language does not support Monitor, while Java supports Monitor.

A critical region

The methods, blocks of code, that are modified by the synchronized keyword are the critical sections of the Monitor mechanism

Synchronization in the JVM is implemented based on incoming and outgoing Monitor objects. Each object instance has a Monitor object, which can be created and destroyed with the object. CPP is implemented with C++ ObjectMonitor.

ObjectMonitor() {
   _header = NULL;
   _count = 0; // Number of records
   _waiters = 0,
   _recursions = 0;
   _object = NULL;
   _owner = NULL;
   _WaitSet = NULL; // Threads in wait state are added to _WaitSet
   _WaitSetLock = 0 ;
   _Responsible = NULL ;
   _succ = NULL ;
   _cxq = NULL ;
   FreeNext = NULL ;
   _EntryList = NULL ; // Threads in the waiting block state are added to the list
   _SpinFreq = 0 ;
   _SpinClock = 0 ;
   OwnerIsThread = 0 ;
}
Copy the code

When multiple threads access a piece of synchronized code at the same time, multiple threads are stored in the ContentionList and _EntryList collections. All threads in the block state are added to this list. Monitorenter fails and enters waitSet, indicating that another thread has acquired the lock and needs to wait. Calling the wait method also enters the Waitset. Next, when the thread obtains the object’s Monitor, the Monitor implements mutual exclusion by relying on the Mutex Lock of the underlying operating system. If the thread successfully applies for Mutex, it will hold the Mutex, and other threads will not be able to obtain the Mutex. A thread that failed to compete will enter the ContentionList again and be suspended.

If a thread calls wait(), the currently held Mutex is released and the thread enters the WaitSet collection, waiting to be woken up the next time. Mutex is also released if the current thread completes the method successfully.

In this implementation, because Monitor relies on the underlying operating system implementation, there is a switch between user and kernel state, which increases the performance overhead.

Lock upgrade optimization

In JDK 1.6, the concepts of biased locking, lightweight locking, and heavyweight locking were introduced to reduce the context switching caused by lock contention, and it was Java’s new object header that enabled lock escalation.

Let’s start with Java object headers. In the JDK 1.6 JVM, object instances are split in the heap into three parts: object headers, instance data, and alignment padding. The Java object header consists of Mark Word, pointer to class and array length.

Mark Word records information about objects and locks, which are 64 bits in length in 64-bit JVMS and are stored as follows

The lock upgrade function mainly depends on the lock flag in Mark Word and whether the biased lock flag is used. Synchronized Synchronized lock starts from biased lock. With the increasingly fierce competition, biased lock is upgraded to lightweight lock and finally to heavyweight lock.

Biased locking

Biased locking is mainly used to optimize the competition of the same thread applying for the same lock multiple times. When a thread accesses the synchronization code or method again, the thread simply goes to the object header’s Mark Word to determine if a biased lock points to its ID, without entering Monitor to contest the object. When the object is treated as a synchronous lock and a thread grabs the lock, the lock flag bit is still 01, the flag bit of “biased lock” is set to 1, and the ID of the thread that grabbed the lock is recorded, indicating that the state of biased lock is entered.

The bias lock is revoked as soon as other threads compete for the lock resource. Biased locking cancellation need to wait for global security, suspend the thread holding the lock (if not pause can’t correctly judge whether the thread is holding the biased locking, the purpose of the suspension is to ensure that can correctly judge the thread holds a biased locking status and the situation of the threads to execute the code block), and check whether the thread is executing the method, if it is, the upgrade lock, Otherwise, it is preempted by another thread.

Under high concurrency, when a large number of threads compete for the same lock resource at the same time, biased lock will be revoked. After stop the word occurs, enabling biased lock will undoubtedly bring greater performance overhead, which can be closed by JVM parameters.

Lightweight lock

When other threads compete for the lock, since the lock is already biased, other threads will perform CAS operation to obtain the lock. If the lock is successfully obtained, the thread ID in Mark Word will be directly replaced. If the lock fails to be obtained, it means that the lock has certain competition, and it will be upgraded to lightweight lock.

Lightweight locks are suitable for scenarios where threads alternately execute synchronized blocks, and most locks do not compete for long periods of time throughout the synchronization cycle.

Spin locks and heavyweight locks

If a lightweight CAS lock fails, the thread will be suspended and blocked. The number of spins that can be set by the JVM is not recommended, as CAS retries can mean long CPU usage.

If the spin lock fails again after retry, the synchronization lock is upgraded to the heavyweight lock with the lock flag bit changed to 10.

To optimize the

Dynamic compilation implements lock elimination/lock coarsening

When a synchronized block is dynamically compiled, the JIT uses a technique called escape analysis to determine whether the lock object used by the synchronized block can only be accessed by the same thread and is not published to other threads. If so, lock elimination is performed.

Similarly, when the JIT compiler dynamically compiles, if it finds that several adjacent synchronized blocks are using the same lock instance, the JIT compiler will merge these synchronized blocks into one large synchronized block to avoid the performance cost caused by a thread “repeatedly applying and releasing the same lock”.

Reduce lock granularity

ConcurrentHashMap (previously implemented in JDK1.8) uses Segment locking.

Pessimistic locks and optimistic locks

Both Synchronized and Lock are pessimistic locks. In the scenario of high concurrency, fierce Lock competition will cause thread blocking, and a large number of blocked threads will lead to context switching of the system, increasing the system performance overhead.

Optimistic locks do not hang in the operating system as pessimistic locks do, but simply return, and the system allows the failed thread to retry, as well as automatically aborting the exit operation. There are no deadlocks, starvation and other active failures, and the interaction between threads is much less than pessimistic locking. More importantly, optimistic locks have no overhead due to competition, so they have better performance.

CAS is the core algorithm for optimistic locking. It contains three parameters: V (the variable that needs to be updated), E (the expected value), and N (the latest value). Only when the variable to be updated is equal to the expected value, the variable to be updated is set to the latest value. If the updated value is different from the expected value, it indicates that another thread has updated the variable to be updated. In this case, the current thread does not operate and returns the real value of V.

  // Update values based on CAS operations
    public final boolean compareAndSet(int expect, int update) {
        return unsafe.compareAndSwapInt(this, valueOffset, expect, update);
    }
    // CAS based operation increment by 1
    public final int getAndIncrement(a) {
        return unsafe.getAndAddInt(this, valueOffset, 1);
    }
    
    // CAS based operation minus 1
    public final int getAndDecrement(a) {
        return unsafe.getAndAddInt(this, valueOffset, -1);
 
Copy the code

CAS calls the processor’s underlying instructions to implement atomic operations. How does the processor’s underlying instructions implement atomic operations?

Communication between processors and physical memory is much slower than processing between processors, so processors have their own internal cache (cache). During execution, frequently used memory data is cached in L1, L2, and L3 caches of the processor to speed up frequent reads.

The processor provides bus locking and cache locking mechanisms to ensure atomicity of complex memory operations.

When a processor attempts to manipulate a shared variable, it emits a Lock signal on the bus, and no other processor can manipulate the shared variable. The processor has exclusive access to the shared variable. But bus locking can also cause a lot of blocking while blocking other processors’ operational requests to acquire this shared variable, increasing the performance overhead of the system.

As a result, subsequent processors have provided a cache locking mechanism, which means that when one processor operates on a shared variable in the cache, it will notify other processors to abandon storage of the shared resource or re-read the shared resource. The latest processors support cache locking.

Because AtomicInteger in Java 7 and Unsafe in Java 8 are constantly trying CAS operations in the for loop, CPU overhead is high. In JDK 1.8, Java provides a new atomic class called LongAdder. LongAdder performs better than AtomicInteger and AtomicLong in high-concurrency scenarios, but consumes more space.

The principle of LongAdder is to reduce the number of concurrent operations on shared variables, that is, to spread the operation pressure on a single shared variable to multiple variable values, and to spread the value value of each competing writer thread into an array. Different threads will hit different slots in the array. Each thread only performs CAS operation on the value in its slot, and finally, when reading the value, it adds the shared variable of the atomic operation with the values scattered in the array and returns an approximate accurate value.

LongAdder consists of a base variable and a cell[] array. When there is only one writer thread and no contention, LongAdder will directly use the base variable as atomic operation variable and modify the variable through CAS operation. When multiple writer threads are competing, all threads except the one occupying the base variable will write the modified variable into their slot cell[] array. The final result can be calculated by the following formula:

It can be found that the return value of LongAdder after operation is only an approximate accurate value, but the final return of LongAdder is an accurate value. Therefore, in some scenes with high real-time requirements, LongAdder is not a replacement for AtomicInteger or AtomicLong (which may be an inaccurate value, assuming that the value is retrieved immediately after the operation. This value must be accurate if we wait for all threads to complete. In general, this operation is often used when doing statistics. The real-time display only requires an approximate value, but the final statistical requirement is accurate.

Restrictions on CAS optimistic locks

CAS optimistic locking is limited in its ordinary use, as it guarantees atomicity for a single variable operation, and CAS cannot do anything when multiple variables are involved, but the pessimistic locking described in the previous two sections can be done by locking an entire block of code.

Optimistic CAS lock In the scenario where the number of concurrent writes exceeds the number of reads, most threads will fail to perform atomic operations on CAS. The failed threads will retry the atomic operations on CAS. As a result, a large number of threads occupy CPU resources for a long time, resulting in high performance overhead. In JDK1.8, Java has added an atomic class called LongAdder that solves this problem by using a space-for-time approach.

CAS ABA issues are resolved using version management.

The performance comparison

The read and write performance of ReentrantReadWriteLock, StampedLock, and optimistic lock is the best in a scenario where read and write volumes are larger than write volumes. In the scenario of write more than read, optimistic lock has the best performance, while the other four locks have similar performance. Both read-write locks and optimistic locks perform better than Synchronized and ReentrantLock in similar read and write scenarios.

Reset the lock

During the garbage collection phase, known as STW, the lock state is reset if there are no Java threads contesting the lock.

Flow chart (God diagram)

conclusion

Monitor-mutex Lock is a synchronized Lock, but it is not a synchronized Lock. It is a synchronized Lock, but it is a synchronized Lock. Finally, the performance comparison and optimization of lock are introduced.

reference

Time.geekbang.org/column/arti…