Dry goods, liver a week of CPU cache base

The directory structure

1. CPU cache basics

2. Cache hit

3. Cache consistency

Iv. Typical use cases

5. Pseudo sharing of queues

Introductive takeaway

Basically, THE CPU cache knowledge is a basic knowledge into the factory, and is also quite important, this part of the knowledge to master the better, will be very bonus points!

A little history:

In the early decades of computing, main memory was incredibly slow and expensive, but cpus weren’t particularly fast either. Since the 1980s, the gap has widened rapidly. Microprocessor clock speeds have developed rapidly, but memory access times have improved much less dramatically. As the gap widens, it becomes increasingly clear that a new type of fast memory is needed to bridge the gap.

1980 and before: CPU has no cache

1980~1995: CPU starts to have level 2 cache

So far: there have been L4, some have L0, and generally L1, L2, L3

We practice

CPU caching basics

A Register, “Register,” is a computer memory used to store instructions, data, and addresses in the CENTRAL processing unit (CPU). The storage capacity of registers is limited, and the reading and writing speed is very fast. In computer architecture, registers store the intermediate results of calculations performed at a given point in time and speed up the operation of a computer program by rapidly accessing the data.

Registers are at the top of the memory hierarchy and are the fastest memory that the CPU can read and write to. Registers are usually measured by the number of bits they can hold, for example, an 8-bit register or a 32-bit register. In the CPU, the components that contain registers are the instruction register (IR), the program counter, and the accumulator. Registers are now implemented as an array of registers, but they can also be implemented using separate flip-flops, high-speed core memory, thin film memory, and other methods on several machines.

A register can also refer to a group of registers to which the output or input of an instruction can be directly indexed. These registers are better known as schema registers. For example, the x86 instruction set defines a set of eight 32-bit registers, but a CPU implementing the x86 instruction set may have more than eight registers inside.

CPU cache

In computer systems, the CPU Cache (in English: CPU Cache, for short in this article) is used to reduce the average time it takes the processor to access memory. In a pyramidal storage system it is located in the second layer from the top down, after the CPU register. Its capacity is much smaller than memory, but its speed can be close to processor frequency.

When the processor makes a memory access request, it first checks to see if the requested data is in the cache. If there is (a hit), the data is returned without memory access. If not, the corresponding data in memory must be loaded into the cache before being returned to the processor.

Caching is effective mainly because of the Locality feature of memory access when the program is running. Such Locality includes both Spatial Locality and Temporal Locality. Using this locality effectively, the cache can achieve very high hit rates.

From the processor’s point of view, the cache is a transparent component. As a result, the programmer usually has no direct intervention with the cache. However, specific optimizations can be made to program code based on the characteristics of the cache to make better use of the cache.

Today’s computers generally have level 3 caches (L1, L2, L3). Look at the structure:

Among them:

L1 cache is divided into two types, one is instruction cache, the other is data cache. L2 and L3 caches do not distinguish instruction and data.
L1 and L2 are cached in each CPU core, and L3 is shared by all CPU cores.
The L1, L2, and L3 devices are smaller and faster the closer they are to the CPU. The farther away they are from the CPU, the slower they are.
After that is memory, and after memory is hard disk

Take a look at their speed:

Take a look at the processor I’m working on, though it’s a bit rubbish

Specific information we can see:

L1 is about 27~36 times faster than main memory. L1 and L2 are KB level, L3 is M level, L1 is divided into data and instruction cache of 32KB respectively. Why is there no L4?

So let’s look at a picture

This chart, reviewed by Haswell from Anandtech, is useful because it illustrates the performance impact of adding a large (128MB) L4 cache aswell as a regular L1 / L2 / L3 structure. Each ladder represents a new cache level. The red line is the chip with L4 – note that for large files, it’s still almost twice as fast as the other two Intel chips. But a larger cache that requires more transistors is slow and expensive, and increases the size of the chip.

Is there L0?

The answer is: yes. Modern cpus also typically have a very small “L0” cache, typically just a few kilobytes in size, for storing microoperations. Both AMD and Intel use this cache. Zen has a cache of 2,048 µOP, while Zen 2 has a cache of 4,096 µOP. These tiny cache pools operate under the same general principles as L1 and L2, but represent even smaller pools of memory that the CPU can access with lower latency than L1. Often, companies tweak these features with each other. Zen 1 and Zen + (Ryzen 1xxx, 2xxx, 3xxx APU) have a 64KB L1 instruction cache that is 4-way associated and has a 2,048 µOP L0 cache. Zen 2 (Ryzen 3XXX desktop CPU, Ryzen Mobile 4XXX) has a 32KB L1 instruction cache that is 8-way associated and has a 4,096 µOP cache. Doubling the set affinity and the size of the µOP cache allows AMD to cut the size of the L1 cache in half.

Having said all that, how does a CPU cache actually work?

The purpose of the CPU cache: MY CPU is so fast that every time I go to main memory to fetch data, it is too expensive. I create my own memory pool to store the data I most want. What data can be loaded into the CPU cache channel? Complex computation and programming code.

What if I don’t find the data I’m looking for in L1’s memory pool? A cache miss

What else can be done? Go to L2. Some processors use an inclusive cache design (meaning that data stored in L1 is repeated in L2), while others are mutually exclusive (meaning that the two caches never share data). If the data is not found in the L2 cache, the CPU continues down the chain to L3 (usually still on the bare chip), then L4 (if present) and main memory (DRAM).

And leads to a question: how to find is more efficient? There’s no way the CPU is going through it right next to each other

CPU cache hit

cache line

Cache lines are also called cache blocks, which means that the CPU loads the data in chunks. In general, the minimum size for a cache line is 64Bytes=16 32-bit integers (other cpu32Bytes and 128Bytes are also available). Here is a cache line size for my computer processor.

In the figure above my processor’s L1 data cache is 32KBytes:

32KBytes/64Bytes=512 Cache line

Cacheline and memory mapping strategy:

Hash: (memory address % cache line) * 64 Hash conflicts are likely to occur
N-way Set Associative: N cache lines are divided into a group. Each cacheline is addressed according to the offset

It can be seen from the figure above that the 32KBytes data cache of L1 is divided into 8-ways, so each way is 4KBytes.

How do you address it?

Previously, we know that most Cache lines are 64Bytes

Tag: each Cache line is preceded by an independently allocated Tag of 24bits=3Bytes, which is the first 24bits of the memory address.
Index: 6bits= 3/4bytes after the memory address stores the Index of the way Cache line. 6bits can be used to Index 2^6=64 Cache lines.
Offset: specifies the Offset of the Cache line stored in 6bits following the index.

Specific process:

Use the index to locate the appropriate cache block.
The tag is used to try to match the corresponding tag value of the cache block. The result is a hit or miss.
If hit, locate the target word in this block with an intra-block offset. And then I just rewrite this word.
If there is a miss, there are two processing strategies, which are called Write allocate and no-write allocate, depending on the system design. If the allocation is by write, the missed data is read into cache **** as if it were a read miss, and then the data is written to the read word unit. If the data is not allocated by write, the data is written back to memory.

What if one of the caches is full?

Replace some of the last accessed bytes, which is often called LRU(most unused).

If you have analyzed the L1 cache, you can also analyze the other L2 and L3 caches, which will not be analyzed here.

Cache consistency

Part from Wikipedia:

In order to maintain data consistency with the child storage (such as memory), data updates must be propagated in time. This propagation is done by writing back. There are two types of Write back policies: Write Back and Write Through.

Based on the write back policy and the missed allocation policy mentioned above, see the following table

From the picture above, we know:

Write back: If the cache hits, memory is not updated to reduce memory write operations. The allocation strategy is usually allocate

How do I indicate that a cache has been updated when loaded by another CPU? Each Cache line provides a dirty bit to identify whether it has been updated since it was loaded. (THE CPU is loaded block by block, not byte by byte, as mentioned above)

Writing:

Write pass means that whenever the cache receives an instruction to write data, it writes the data directly back to memory. If this data address is also in the cache, the cache must be updated at the same time. Since this design causes a large number of writes to memory, it is necessary to set up a buffer to reduce hardware collisions. This buffer is called a Write buffer and is usually no more than four cache block sizes. However, write buffers can also be used for write-back caches for the same purpose.
Write-through is easier to implement than write-back, and data consistency is easier to maintain.
Usually the allocation policy is non-allocation

For a two-level cache system, the first-level cache might use write through to simplify implementation, while the second-level cache uses write back to ensure data consistency

The msci protocol:

Here is a page (MESI Interactive Animations) : www.scss.tcd.ie/Jeremy.Jone…

It is recommended to play the above url GIF, you can understand, each CPU cache and main memory read, write data.

Here’s a quick explanation: we have a value of x=0, and the processor has two cpu0 and CPU1

Cpu0 reads the value of x, cpu0 first looks for the value of X in the cache of CPU0, but it can’t find it, there is an address bus, which is to route the CPU and main memory, at the same time to the CPU and main memory, compare versions, go to the main memory to get x, get the value of x, and then assign the value to the cache of CPU0 through the data bus
Cpu0 writes x+1 to cpu0 and increments it by 1 (not updating main memory, not updating cpu1’s cache, not updating cpu1’s cache)
If the version is the same, the main memory value will be deleted first. If the version is the same, the main memory value will be deleted first. If the version is the same, the main memory value will be deleted first
For x+1, cpu1 obtains x=1 from CPU1 and increments it by 1. (This will update main memory and will not update CPU0’s cache, but will notify other cpus via RFO.

In other cases, you can try it yourself.

Notification Agreement:

Snoopy agreement. This protocol is more like a bus technology for data notification. The CPU Cache uses this protocol to identify the status of data in other caches. If data is shared, the status of the shared data can be notified to other CPU caches through a broadcast mechanism. This protocol requires that each CPU Cache can “snoop” on notifications of data events and react accordingly.

Status of the MESI protocol:

Modified, Exclusive,Shared, Invalid.

Follow the animation. It’s not that complicated.

To expand:

MOESI: MOESI is a complete cache consistency protocol that contains all possible states commonly used in other protocols. In addition to the four common MESI protocol states, there is a fifth “owned” state, which represents modified and shared data. This avoids the need to write modified data back to main storage before sharing. Although you must eventually write back the data, you can delay writing back.

MOESF: Data in the Forward state is clean and can be discarded without notice

AMD uses MOESI, Intel uses MESIF

I won’t go any further here

Use cases take a wave

Case 1:

public class CpuCache {

    static int LEN = 64 * 1024 * 1024; 
    static int arr[] = new int[LEN]; // 64M
    public static void main(String[] args) {
        long currAddTwo = System.currentTimeMillis();
        addTwo();
        System.out.println(System.currentTimeMillis() - currAddTwo);
        long currAddEight = System.currentTimeMillis();
        addEight();
        System.out.println(System.currentTimeMillis() - currAddEight);
    }
    private static void addTwo(a) {
        for (int i = 0; i<LEN; i +=2) { arr[i]*=i; }}private static void addEight(a) {
        for (int i = 0; i<LEN; i +=8) { arr[i]*=i; }}}Copy the code

You can guess what the difference in print time is, or by a factor of several

Analyze the time complexity: addTwo if 4n, then addEight is n

But don’t forget that the CPU loads with a Cache line of 64Bytes, so it takes about the same amount of time to add 2 or 8. My machine takes about the same amount of time:

48
36
Copy the code

False sharing:

Using Martin’s example, with a few modifications, the code looks like this:

public class FalseShare implements Runnable {
        public static int NUM_THREADS = 2; // change
        public final static long ITERATIONS = 500L * 1000L * 1000L;
        private final int arrayIndex;
        private static VolatileLong[] longs;

        public FalseShare(final int arrayIndex) {
            this.arrayIndex = arrayIndex;
        }

        public static void main(final String[] args) throws Exception {
            Thread.sleep(1000);
            System.out.println("starting....");
            if (args.length == 1) {
                NUM_THREADS = Integer.parseInt(args[0]);
            }

            longs = new VolatileLong[NUM_THREADS];
            for (int i = 0; i < longs.length; i++) {
                longs[i] = new VolatileLong();
            }
            final long start = System.currentTimeMillis();
            runTest();
            System.out.println("duration = " + (System.currentTimeMillis() - start));
        }

        private static void runTest(a) throws InterruptedException {
            Thread[] threads = new Thread[NUM_THREADS];
            for (int i = 0; i < threads.length; i++) {
                threads[i] = new Thread(new FalseShare(i));
            }
            for (Thread t : threads) {
                t.start();
            }
            for (Thread t : threads) {
                t.join();
// System.out.println(t);}}public void run(a) {
            long i = ITERATIONS + 1;
            while (0 != --i) {
                longs[arrayIndex].value = i;
            }
        }

        public final static class VolatileLong {
            public volatile long value = 0L;
            public long p1, p2, p3, p4, p5, p6;//, p7, p8, p9;}}Copy the code

The logic of the code is that by default four threads modify the contents of different elements of an array. The element is of type lelong, with only one long member value and six unused long members. Value is set to volatile so that changes to value are visible to all threads

When I set the thread to 4: line 50, with 6 long integers, runs for 13s, whereas line 50 with only 4 long integers runs for 9s, and when line 50 is commented out, runs for 24s. I found something fishy about this test result.

Let’s comb through the definition of fake sharing:

In a Java program, the members of an array are contiguously cached. In fact, adjacent member variables from Java objects are also loaded into the same cache line. If multiple threads operate on different member variables, False Sharing can occur if it is the same cache line.

The Disruptor project (github.com/LMAX-Exchan…

A thread running on processor core 1 wants to update the value of variable X, while another thread running on processor core 2 wants to update the value of variable Y. However, both of these frequently changed variables are on the same cache line. The two threads take turns to send RFO messages, taking ownership of the cached row.

Both X and Y are ostensibly operated by separate threads, and there is no relationship between the two operations. It’s just that they share a cache line, but all the competing conflicts are from the sharing.

According to the code above instance, simply put, we an array, an object is 8 bytes (32-bit system) or 12 bytes (64 – bit systems), if add the 6 long integer = 48 bytes, so that different objects with a cache line, can avoid the cache line frequently sends RFO message Shared cache line, reducing competition, Then why is the data we tested wrong? When there are 6 longs, it takes more time than 4 longs.

The reason is that our machine is 2-core. When the thread is set to 2, the 6 longs become 4S, and comment out line 50 becomes 10s.

This allows an object to have as many cache lines as possible by using caching line padding, reducing the synchronization of cache lines.

Queue pseudo sharing

In the JDK’s LinkedBlockingQueue, there is a reference to the head of the queue and a reference to the last of the queue. This queue is often used in asynchronous programming, where the values of the two references are often modified by different threads, but they are likely to be on the same cache line, resulting in pseudo sharing. The more threads and cores there are, the greater the negative effect on performance.

But: Fake sharing should not be optimized for optimization’s sake either. In Grizzly, there is a LinkedTransferQueue which is different from the one in JDK 7. The difference is that it uses PaddedAtomicReference to improve concurrency performance. In fact, this is a wrong coding technique, it does not make sense.

Netty used the PaddedAtomicReference instead of the Node. It used the completion method to solve the problem of pseudo sharing of the queue, but it was cancelled.

The nature of AtomicReference and LinkedTransferQueue is that optimistic locks perform poorly in highly competitive situations. Optimistic locks should be used in non-highly competitive situations. Optimizing performance for optimistic locks in highly competitive situations is the wrong direction because pessimistic locks should be used if highly competitive situations are required.

PaddedAtomicReference is also a Padded statement, if it is Padded, why not use Lock + volatile? If it is not Padded, using PaddedAtomicReference does not have an advantage over the AtomicReference. Therefore, using Padded-AtomicReference is an incorrect encoding technique.

So remove the pad logic related to LinkedTransferQueue in 1.8 and post a code in 1.7

public class FalseShare implements Runnable {
        public static int NUM_THREADS = 2; // change
        public final static long ITERATIONS = 500L * 1000L * 1000L;
        private final int arrayIndex;
        private static VolatileLong[] longs;

        public FalseShare(final int arrayIndex) {
            this.arrayIndex = arrayIndex;
        }

        public static void main(final String[] args) throws Exception {
            Thread.sleep(1000);
            System.out.println("starting....");
            if (args.length == 1) {
                NUM_THREADS = Integer.parseInt(args[0]);
            }

            longs = new VolatileLong[NUM_THREADS];
            for (int i = 0; i < longs.length; i++) {
                longs[i] = new VolatileLong();
            }
            final long start = System.currentTimeMillis();
            runTest();
            System.out.println("duration = " + (System.currentTimeMillis() - start));
        }

        private static void runTest(a) throws InterruptedException {
            Thread[] threads = new Thread[NUM_THREADS];
            for (int i = 0; i < threads.length; i++) {
                threads[i] = new Thread(new FalseShare(i));
            }
            for (Thread t : threads) {
                t.start();
            }
            for (Thread t : threads) {
                t.join();
// System.out.println(t);}}public void run(a) {
            long i = ITERATIONS + 1;
            while (0 != --i) {
                longs[arrayIndex].value = i;
            }
        }

        public final static class VolatileLong {
            public volatile long value = 0L;
            public long p1, p2, p3, p4, p5, p6;//, p7, p8, p9;}}Copy the code

50 threads competing for 10 objects, LinkedBlockingQueue than LinkedTransferQueue

In 1.7 it was several times faster, but in 1.8 it was almost as fast.

Finally, Disruptor

Ring array structure

To avoid garbage collection, use arrays instead of linked lists. At the same time, arrays are more friendly to the processor’s caching mechanism.

Element location

Array length 2^n, through bit operation, speed up the positioning speed. The subscripts are incremented. Don’t worry about index overflow. Index is of type long, and even with a processing speed of 1 million QPS, it would take 300,000 years to use up.

No lock design

Each producer or consumer thread first requests the location of the element in the array that it can manipulate, and then writes or reads data directly to that location.

Let’s ignore the ring structure of arrays and show you how to implement a lock-free design. The entire process is threaded safe through the atomic variable CAS.

Consumer waiting strategy:

BlockingWaitStrategy: For scenarios where CPU resources are scarce and throughput and latency are not important
BusySpinWaitStrategy: Spin reduces latency by minimizing the number of system calls caused by switching threads through repeated retries. Recommended for use in scenarios where threads are bound to a fixed CPU
PhasedBackoffWaitStrategy: spin + yield + custom strategy, shortage of CPU resources, throughput and delay is not important
SleepingWaitStrategy: spin + yield + sleep, a good trade-off between performance and CPU resources. Delay nonuniformity
TimeoutBlockingWaitStrategy: lock, timeouts, shortage of CPU resources, throughput and delay is not important scene (logfj2 using this strategy by default)
YieldingWaitStrategy: spin + yield + spin, a good trade-off between performance and CPU resources. The delay is more uniform

import com.lmax.disruptor.*;
import com.lmax.disruptor.dsl.Disruptor;
import com.lmax.disruptor.dsl.ProducerType;

import java.util.concurrent.ThreadFactory;


public class DisruptorMain
{
    public static void main(String[] args) throws Exception
    {
        // Elements in the queue
        class Element {

            private String value;

            public String get(a){
                return value;
            }

            public void set(String value){
                this.value= value; }}// Thread factory for the producer
        ThreadFactory threadFactory = new ThreadFactory(){
            @Override
            public Thread newThread(Runnable r) {
                return new Thread(r, "simpleThread"); }};// The factory that produces the RingBuffer is used when initializing the RingBuffer
        EventFactory<Element> factory = new EventFactory<Element>() {
            @Override
            public Element newInstance(a) {
                return newElement(); }};// The handler that handles the Event
        EventHandler<Element> handler = new EventHandler<Element>(){
            @Override
            public void onEvent(Element element, long sequence, boolean endOfBatch)
            {
                System.out.println("Element: "+ element.get()); }};// Blocking policy
        BlockingWaitStrategy strategy = new BlockingWaitStrategy();

        // Specify the size of the RingBuffer
        int bufferSize = 16;

        // Create disruptor in single producer mode
        Disruptor<Element> disruptor = new Disruptor(factory, bufferSize, threadFactory, ProducerType.SINGLE, strategy);

        / / set the EventHandler
        disruptor.handleEventsWith(handler);

        // Start the disruptor thread
        disruptor.start();

        RingBuffer<Element> ringBuffer = disruptor.getRingBuffer();

        for (int l = 0; true; l++)
        {
            // Get the index of the next available position
            long sequence = ringBuffer.next();
            try
            {
                // Return the element in the available position
                Element event = ringBuffer.get(sequence);
                // Set the value of the element at this position
                event.set(l+"rs");
            }
            finally
            {
                System.out.println("push" + sequence);
                ringBuffer.publish(sequence);
            }
            Thread.sleep(10); }}}Copy the code

Post the test case! Wait for the next article to explain in depth! The next post is on Volatile. Give me a break

Welcome to correct communication oh!!

Welcome to pay attention to my public number and my nuggets < search: tingyu notes >, will be the first of some latest articles oh!

Here is my personal website: ransongv587.com:996

Popular recommendations:

Add the two numbers together
ThreadPoolExecutor thread pool
Add JVM parameters to microservices as they prepare to go live
To get into a big factory, you have to know the BASICS of CPU caching

At the end of the article, I have compiled a recent interview data “Java Interview Customs Manual”, covering Java core technology, JVM, Java concurrency, SSM, microservices, database, data structure and so on. You can obtain it from GitHub github.com/Tingyu-Note… More to come.

Dry goods, liver a week of CPU cache base

1. CPU cache basics

2. Cache hit

3. Cache consistency

Iv. Typical use cases

5. Pseudo sharing of queues

CPU caching basics

Related Posts

Mysql deadlock problem: Mysql deadlock problem:

Since using Gradle, I’m a little tired of Maven! Fast!

07-Mongoose Predefined pattern modifiers Getters and Setters