This is why’s 53rd original article


Say first AtomicLong

I won’t go into the details of AtomicLong.

The purpose of this section is to warm up and ask a question about the CAS operation and the volatile keyword.

I do not know why the source code is written like this, hope to know the answer to a friend point one or two.

Clench your fist, old man.

In order to ask this question successfully, I have to use the book The Art of Concurrent Programming in Java as an introduction to this question.

First of all, in chapter 2.3 “Principles of Atomic Operation” of the book, two points are mentioned when introducing how the processor realizes atomic operation:

  • Use bus locks to ensure atomicity.

  • Use cache locks to ensure atomicity.

A bus LOCK is the use of a processor to provide a single LOCK # signal. When a processor outputs this signal on the bus, requests from other processors will be blocked, so that the processor can monopolized the shared memory.

The operation of bus locking to ensure atomicity is somewhat straightforward, resulting in a high overhead of bus locking.

As a result, processors currently use cache locks for optimization in certain situations.

The concept of a cache lock can be read in the book:


Figure 2-3, as mentioned, looks like this:


Actually the key Lock prefix directive.

An area of memory operated by the Lock prefix instruction is locked and cannot be accessed by other processors at the same time.

According to the IA-32 architecture software Developer’s Manual, the Lock prefixed instruction causes two things on a multicore processor:

  • Writes the data of the current processor cache row back to system memory.
  • This write back to memory invalidates data cached in other cpus.

For the volatile keyword, there is no doubt that it uses the Lock prefix.

Does the CAS operation on the JVM use the Lock prefix directive?

Yes, I did.

CAS operations in the JVM are implemented using CMPXCHG instructions passed by the processor. This is also a Lock prefix instruction.


Ok, let’s look at one method:

java.util.concurrent.locks.AbstractQueuedLongSynchronizer#compareAndSetState


This method is located in the AQS package and is a CAS operation. Now I just care about what I’ve boxed up.

This operation has a volatile read and write memory language.

And what is this operation?

The 344 lines unsafe compareAndSwapLong operation, which is a native method.

public final native boolean compareAndSwapLong(Object var1, long var2, long var4, long var6);

Why does this operation have a memory language for volatile reads and writes?

Here’s what the book says:

The final implementation of this native method in openJDK is as follows:openjdk-7-fcs-src-b147- 27_JUN_2011 openJDK \hotspot\ SRC \ OS_CPU \windows_x86\vm\ atomic_windows_x86.inline-hpp (for Windows operating system, X86 processor)Copy the code

The Intel manual describes the Lock prefix as follows.

  • Ensure that read – change – write operations to memory atoms are performed. In Pentium and processors prior to Pentium, instructions prefixed with Lock locked the bus during execution, temporarily preventing other processors from accessing memory through the bus. Obviously, this is expensive. Starting from Pentium 4, Intel Xeon, and P6 processors, Intel uses Cache Locking to ensure atomicity of instruction execution. Cache locking greatly reduces the execution overhead of lock prefix instructions.

  • Disallow this instruction and reorder it with previous and subsequent read and write instructions.

  • Flushes all data in the write buffer to memory.

Points 2 and 3 above have the memory barrier effect of implementing the memory semantics of both volatile reads and volatile writes.

Okay, if you say you have questions about the book. So let me take you back to the official document:

https://docs.oracle.com/javase/8/docs/api/


The part I’ve framed:

CompareAndSet and all other read and update operations such as getAndIncrement have the same memory semantics as volatile read and write.

The reason is to use the Lock command.

Ok, so here we can conclude:

CompareAndSet has memory semantics for both volatile reads and volatile writes.

So here’s the problem!

This operation is also called in AtomicLong:


The AtomicLong value is volatile:


Why should the value of the compareAndSwapLong operation be volatile when it already has the memory semantics of both volatile read and volatile write?

This question was also thrown up by a friend, and the result is that none of us know why:


I wonder if it depends on the operating system. This is true on x86, not on other operating systems, but there is no proof.

Hope to know why do friends can give advice.

Ok, so speaking of CAS, here comes a classic interview question:

What are the problems with CAS implementing atomic operations?

  • ABA problem.

  • Cycle time is expensive.

  • Atomic operations of only one shared variable are guaranteed.

If you do not know the above three points, or you do not understand, then I suggest you must understand after reading this article, belong to the interview FAQ series.

I’m going to focus on this cycle time overhead. Spin CAS, if unsuccessful for a long period of time, can impose a significant execution overhead on the CPU.

“AtomicLong is based on spin CAS and has some performance issues. Bala bala……”

When I was the interviewer, I just smiled and looked at you and made you think you were perfect.


I know why you answer that, because you’ve read a few blog posts and scanned the regular interview questions, and they all say: AtomicLong is based on spin CAS.

But, friend, you can say that, but the answer is not perfect. JDK 7 and JDK 8

JDK 7’s AtomicLong is based on spin CAS, such as the following method:


While (true) is spin, which is purely dependent on the compareAndSet method:


The native comareAndSwapLong method called in this method corresponds to the Lock prefix instruction which is CMPXCHG.

In JDK 8, some of the AtomicLong methods are also spins, but do not rely solely on the CMPXCHG directive.


You can see that there is still a do-while loop, and the compareAndSwapLong method is called:


The corresponding Lock prefix instruction for this method is the xadd instruction we mentioned earlier.

From the perspective of Java code, it’s all spin, it’s all compareAndSwapLong. There’s no difference.

However, we can see that JDK 8 does something with the compareAndSwapLong method on x86, using the XADD assembly instruction instead of CAS.

The xadd directive is fetch and add.

The CMPXCHG directive is compare and swap.

The performance of xADD instruction is better than that of CMPXCHG instruction.

Check out this article on oracle’s website:

https://blogs.oracle.com/dave/atomic-fetch-and-add-vs-compare-and-swap

You can pay more attention to the comments below the article. I have captured two of them for everyone to taste:


And then this:


Bottom line: This article makes sense, and we (Dave and Doug) are thinking about it. So we’ll do something with the JIT and replace the CAS operation with the LOCK:XADD instruction on x86.

(THERE was something wrong with my understanding of this place before, but I corrected it after my friend corrected it.)


So, after JDK 8 AtomicLong methods are improved, xadd+ CMPXCHG double support method.

CAS is a comparison and exchange process that returns true on success and false on failure, especially once. The spin CAS, on the other hand, is compared and swapped in an infinite loop that continues as long as it does not return true.

So, don’t talk about CAS saying loop time is expensive. Remember to add “spin” and “competitive” two conditions.

JDK 8 uses xadd instead of CAS to perform better.

https://blogs.oracle.com/dave/atomic-fetch-and-add-vs-compare-and-swap

Pay more attention to the comments below the article. I have captured one of them for everyone to taste:


From our previous analysis, AtomicLong had some performance improvements from JDK 7 to JDK 8, but the changes were minor.

There is still a problem: although it can achieve atomic addition and subtraction operations, when the competition is very large, the value being manipulated is a hot data, and all threads have to compete for it, resulting in a large conflict in concurrent modification.

So, after all, its main problem is the sharing of hot data.

To solve this problem, Doug Lea introduced the LongAdder class in JDK 8.

Even better, LongAdder

Let’s take a look at the introduction on the official website:


There are two paragraphs in the screenshot above, which are the brief introduction of LongAdder. I will translate and interpret them for you.


In the case of multi-threaded contention, a set of variables will be dynamically increased to reduce contention. The sum() method returns the sum of these variables at a point in time.

So, we know that its return value, both the sum() method and the longValue() method, is not an exact value at that time.

It means that the moment you get this value, the value has actually changed.

This is very important. Why is it so?

We can see this by comparing the AtomicLong and LongAdder increment methods:


AtomicLong increment returns a value, which is the exact value after the call. This is an atomic operation.

The increment of LongAdder returns no value, so you can only call sum to get the current value.

If you want to do this operation: increment first, then get the value, this is not atomic operation.

Therefore, when multiple threads are called concurrently, the sum method must return an inaccurate value. Unless you lock it.

The instructions for this method are as follows:


As for why it can’t return an exact value, this is related to its design, which will be discussed later.


LongAdder performs better than its big brother AtomicLong when updating (writing) a shared data in a multi-threaded environment, such as fulfilling a statistical requirement. When concurrency is low, both classes are about the same. However, the throughput of LongAdder is significantly higher with high concurrency, and it takes up more space. It’s the idea of space for time.

This paragraph actually follows the first paragraph to describe.

Because it does not have an exact return value in the case of multi-threaded concurrency, when you need to do something based on the return value, you have to think carefully about whether you want the return value to be accurate, or the approximate statistical class data.

For example, if you are using an ordinal number generator and you need an exact return value, AtomicLong is more appropriate.

If you’re using it as a counter, this scenario of writing too much and reading too little. For example, if you need to count the number of times an interface is accessed and you don’t need to return an exact value all the time, use LongAdder.

In general, AtomicLong guarantees the exact value every time, while LongAdder guarantees the exact value in the end. LongAdder has better write performance than AtomicLong in high concurrency scenarios.

The following three questions will be discussed:

  • How does LongAdder solve the problem of concurrent modification conflict caused by multi-thread operation hotspot value?

  • Why can’t LongAdder’s sum method return an exact value in high-concurrency scenarios?

  • Why does LongAdder have better write performance than AtomicLong in high concurrency scenarios?

Let’s take a look at a picture first. It doesn’t matter if you can’t understand it. Let’s have a general impression:


Next we go to explore the source code, source code under no secrets.

From the source we can see that the add method is the key:


There are cells, base and so on, so let’s look at these member variables before we explain the add method.

These variables are in Striped64.

LongAdder is a subclass of Striped64:


Four of these variables are as follows:


  • NCPU: The number of cpus that determine the size of the cells array.

  • Cells: An array whose size is a power of 2 when not null. Inside is the cell object.

  • Base: The base value, which is added directly to the base when there is no contest. Another effect is that when cells are initialized, since cells can only be initialized once, threads that failed to initialize will add values to the base.

  • CellsBusy: Lock identifier when cells are being expanded or initialized.

Previously, the set of variables mentioned in the document is the cells here.


Ok, let’s go back to the add method:


Cells have not been initialized, indicating that it is the first call or that there is not much contention, resulting in the CAS operation being successful every time.

The casBase method performs the CAS operation.

When the casBase method returns false due to competition, the if branch judgment is entered.

This if molecular judgment has four conditions, and it makes three judgments


  • Once again, check whether the cells array is null or size is 0. As is an array of cells.

  • The value marked ② is used to determine whether the current thread can fetch a cell object in the cells array after modulating the size of the current thread.

  • The value marked ③ indicates whether the CAS operation on the obtained cell object can be performed successfully.

When there is something in the cells array and the value calculated by getProbe() &m is available in the cells array, the CAS operation is performed on the cell object again.

If the above conditions are not met, the longAccumulate function is entered.

This method operates on cells arrays. If you think of an array as having three states: uninitialized, initializing, and initialized, here are the three states:


  • Cells (1) have already been initialized, so it can be used to add or expand cells.

  • Cells labeled ② are not initialized and have not been locked yet, so CAS the cellsBusy flag to try locking. Once the lock is done, you can do some initialization in there.

  • The cells labeled ③ are being initialized, at which point the CAS is accumulated on the base base.

The first three steps are in an endless loop.

So if cells haven’t been initialized yet, because of the lock flag, even if the concurrency is very large, only one thread is going to initialize cells, and then when cells are initialized or expanded, the values of the other threads are going to add up on base.

This is how the sum method works.

See, this is actually the idea of a segmented operation, and I don’t know if you’ve thought of ConcurrentHashMap, but it’s not surprising, because both of these things are written by Doug Lea.

As a footnote, cells are initialized to size 2:


Cells is the maximum number of CPU cores:


The cell is decorated with Contended annotations to solve the problem of fake sharing:


Speaking of fake sharing, I’m reminded of a conjecture I made earlier in the article “A Technical problem that has been bothering me for 122 days, AND I think I know the answer” :


Later, I also used this annotation to solve the problem of fake sharing, but the final experimental results showed that it was not the cause.

After that article was published, a lot of friends gave me feedback on their opinions, and more and more metaphysical problems were found along the way, but in the end, behind these problems all pointed to the same thing: JIT.

Anyway, back to the LongAdder in this article.

In general, LongAdder behaves just like AtomicLong when there is no conflict. When there’s a conflict, that’s when LongAdder shows up, and then we can go back to the diagram and see what’s going on:


Ok, now let’s go back to the three questions raised earlier:

  • How does LongAdder solve the problem of concurrent modification conflict caused by multi-thread operation hotspot value?

  • Why can’t LongAdder’s sum method return an exact value in high-concurrency scenarios?

  • Why does LongAdder have better write performance than AtomicLong in high concurrency scenarios?

They are actually a problem.

Because LongAdder split the hotspot value and put it into each cell to operate. This essentially disperses the conflict into the cell. So the problem of concurrent modification conflicts is solved.

Sum = Base +cells when a conflict occurs. In the case of high concurrency, when you fetch sum, cells are most likely being changed by another thread. A value that changes in real time in a high concurrency scenario, how do you expect it to give you an accurate value? Of course, you can also use the lock operation to get the current exact value, but this scenario you still use LongAdder, AtomicLong is not fragrant?

Why does LongAdder have better write performance than AtomicLong in high concurrency scenarios?

You use your little brain to think, my friend.

AtomicLong writes a shared value regardless of whether there is a conflict, and spins when there is a conflict.

LongAdder behaves like AtomicLong when there is no conflict, but when there is conflict, the conflict is divided into different cells, which makes writing faster.

A little thought

The title of this article is “I found the secret of high concurrency in LongAdder, and it just wrote……”. .

So what are these two words?

It’s split up. I think that distribution and high concurrency are based on the idea of splitting.

Not LongAdder in this article.

Microservitization, separate library and table, read and write separation…… All of these things are breaking up, spreading out the concentrated pressure.

We often say that the performance is not good, then heap machines to solve, that is doing a split.

Of course, there are problems as well as benefits.

For example, difficult distributed transactions, data aggregation queries and other requirements.

Let me give you an example that I’ve come across.

Before writing this article, I looked at the source code for LongAdder, learned about its structure, and the difference between it and AtomicLong, and I was reminded of a requirement I had worked on earlier.

Is the account service, there is a big merchant’s account is a hot account, transactions are very frequent.

The amount in this account is like a shared hot data.

What we did was we split this account into multiple shadow accounts, so we split the hot account into multiple shadow accounts, and the pressure was spread.

In fact, this idea and LongAdder is in the same line.

What’s the problem with splitting in this scenario?

One of the problems is that the total balance of this account is the sum of multiple shadow accounts, and the balance of each shadow account is constantly changing, so we cannot guarantee that the balance is a real-time and accurate value.

But the merchant doesn’t care about that. All he cared about was that the last day’s balance was accurate and the daily reconciliation was correct.

We have improved performance while meeting demand.

Another simple way to think about it is if we take the statement “implement atomic operations to add and subtract” as a requirement.

In my humble opinion, the AtomicLong class does just that, and when it’s delivered, it works, it works, and it comes with the ability to return you the exact value every time.

LongAdder is a more elegant implementation of this requirement, it is on the original basis of iterative development, the function can still be the same, no additional functions, but for some scenarios, better use.

The idea they conveyed to me was not that we often say: go first, run, and iterate later.

It’s: It does run, but there are faster, more elegant ways to do it, and we can do it.

This is something we need to learn from.

One last word (for attention)

If you find the wrong place, please leave a message to point out, I will modify it.

Thank you for reading, I insist on original, very welcome and thank you for your attention.

I am Why, a literary creator delayed by code. I am not a big shot, but I like sharing. I am a nice man in Sichuan who is warm and interesting.