Experience pipelining/multi-launch of CPU

preface

In >, we experienced an order of magnitude difference in performance caused by CPU pipeline blocking. At the time, it was just inferred from machine code analysis, and this time we did some smaller experiments to verify it.

Before we get started, let’s get some background. As described in \< What does the CPU provide >, the CPU provides the ability to run machine instructions. How does the CPU execute the machine’s instructions?

How does the CPU execute machine instructions

A machine instruction is divided into several steps within the CPU. Logically, it can be divided into five stages:

Taking instructions
Parse command
executed
Access memory
Results to write back

Pipeline work

For example, for four consecutive ABCD instructions, the CPU does not complete the execution of A before the execution of B; However, when instruction A is completed, instruction A will begin to parse, and instruction B will continue to be taken, and so on, forming assembly line operation.

Ideally, we want to make full use of the CPU hardware, which means that every device in the pipeline is working all the time. In reality, however, for various reasons, the CPU is not able to fully run the pipeline.

Such as:

Jump instructions, which may jump to execute another instruction, are not executed in a fixed order.

Such a jump instruction, for example, might then need to jump to400553The instructions.

je    400553

For such branch instructions, the CPU has branch prediction technology, which predicts the direction of the current branch based on the previous results to minimize pipeline congestion.

Data dependence, subsequent instruction, dependent on previous instruction.

For example, the following two instructions, the operand of the second instructionr8Depends on the result of the first instruction.

Mov r8,QWORD PTR [rdi] add r8,0x1

In such cases, the CPU takes advantage of operand push forward techniques to minimize blocking waits.

How to launch

Modern complex CPU hardware, in fact, there is not only a CPU pipeline. Simple logical understanding, can assume that there are multiple pipeline, can execute instructions at the same time, but it is not simple to repeat all the hardware in the pipeline.

Multiple emits can be understood as concurrency at the CPU hardware level. If two instructions do not depend on each other in order, they can be executed concurrently. The CPU only needs to ensure that the final result of the execution is as expected. In fact, many performance optimizations are based on this principle, by optimizing the execution process, but keeping the final result consistent.

The practice experience

Theory needs to be combined with practice and practical experience, so as to understand the principle more clearly.

This time we built a few use cases in C inline assembly to see the difference.

The benchmark case

#include <stdio.h>

void test(long *a, long *b, long *c, long *d) {
    __asm__ (
        "mov r8, 0x0;"
        "mov r9, 0x0;"
        "mov r10, 0x0;"
        "mov r11, 0x0;"
    );

    for (long i = 0; i <= 0xffffffff; i++) {
    }

    __asm__ (
        "mov [rdi], r8;"
        "mov [rsi], r9;"
        "mov [rdx], r10;"
        "mov [rcx], r11;"
    );
}

int main(void) {
    long a = 0;
    long b = 0;
    long c = 0;
    long d = 0;

    test(&a, &b, &c, &d);

    printf("a = %ldn", a);
    printf("b = %ldn", b);
    printf("c = %ldn", c);
    printf("d = %ldn", d);

    return 0;
}

We execute it with the following command, which only takes 1.38 seconds. Note that -o1 compilation is required because the benchmark code itself can be expensive under -o0.

C $time./asm a = 0 b = 0 c = 0 d= 0 real 0m1.380s user 0m1.379s sys $gcc-masm = Intel-std = c99-o asm -o1-g3 asm 0 m0. 001 s

In the above code, we mainly built an empty for loop, you can look at the assembly code to confirm. Here is a compilation of the test function to ensure that the empty for loop code is not optimized by the compiler.

000000000040052D < Test >:40052D: 49 C7 C0 000000 MOV R8,0x0 400534:49 C7 C1 00000000 MOV R9,0x0 40053B: 49 C7 C2 00 00 00 00 MOV R10,0x0 400542:49 C7 C3 00 00 00 MOV R11,0x0 400549: 48 b8 00 00 00 00 01 movabs rax,0x100000000 400550: 00 00 00 400553: 48 83 e8 01 sub rax,0x1 // 400557:75 fa jne 400553 <test+0x26> 400559: 4c 89 07 mov QWORD PTR [rdi],r8 40055c: 4c 89 0e mov QWORD PTR [rsi],r9 40055f: 4c 89 12 mov QWORD PTR [rdx],r10 400562: 4c 89 19 mov QWORD PTR [rcx],r11 400565: c3 ret

Add two simple instructions

This time, we added the “add one” and “write to memory” instructions to the for loop.

    for (long i = 0; i <= 0xffffffff; i++) {
        __asm__ (
            "add r8, 0x1;"
            "mov [rdi], r8;"
        );
    }

The execution time is basically the same as the basic test. Indicates that the two new instructions added, along with the empty for loop used by the benchmark, were executed “concurrently” and therefore did not increase the execution time.

C $time./asm a = 4294967296 b = 0 c = 0 d= 0 real 0M1.381S user 0 m1. 381 s sys 0 m0. 000 s

Add a memory read

This is the example we encountered in the previous article when optimizing Luajit. The newly added memory reads and writes to the existing memory, forming a data dependency.

    for (long i = 0; i <= 0xffffffff; i++) {
        __asm__ (
            "mov r8, [rdi];"
            "add r8, 0x1;"
            "mov [rdi], r8;"
        );
    }

Look at the execution time, this time obviously slow very much, yes, the effect of the pipeline blocking is so touching 😅

C $time./asm a = 4294967296 b = 0 c = 0 d= 0 real 0M8.723s user 0 m8. 720 s sys 0 m0. 003 s

More instructions like this

This time we add three other similar sets of instructions, each of which constitutes the same data dependency.

    for (long i = 0; i <= 0xffffffff; i++) {
        __asm__ (
            "mov r8, [rdi];"
            "add r8, 0x1;"
            "mov [rdi], r8;"
            "mov r9, [rsi];"
            "add r9, 0x1;"
            "mov [rsi], r9;"
            "mov r10, [rdx];"
            "add r10, 0x1;"
            "mov [rdx], r10;"
            "mov r11, [rcx];"
            "add r11, 0x1;"
            "mov [rcx], r11;"
        );
    }

When we look at the execution time, there is almost no difference between the previous group. Because the CPU executes out of order, it doesn’t just sit there and wait. On the other hand, pipeline blocking is not so terrible. When there is instruction blocking, the CPU can still do something else.

$ gcc -masm=intel -std=c99 -o asm -O1 -g3 asm.c
$ time ./asm
a = 4294967296
b = 4294967296
c = 4294967296
d = 4294967296

real    0m8.675s
user    0m8.674s
sys     0m0.001s

conclusion

The billions of transistors in modern CPUs are not meant to be showcased, but actually have very complex circuits inside them. Many of the same optimization techniques that are common at the software level are also widely used in CPU hardware.

Pipeline and multiple launch, in my opinion, are very important concepts that software engineers need to be able to understand deeply. It’s not just understanding how the machine instructions are executed on the CPU, but it’s also typical system building. Recently I have learned some distributed transaction knowledge, in fact, the principle is very similar to the CPU hardware system.

It’s good to get your hands dirty.

Other Points of Knowledge:

The CPU’s L1 cache, instructions and data are cached separately, making it less likely that the cache will fail.

References:

https://techdecoded.intel.io/…