series

  1. IOS assembler introductory tutorial (I) ARM64 assembler basics
  2. Embedding assembly code in Xcode projects
  3. IOS assembler introduction (3) Assembler Section and data access
  4. IOS assembly tutorial (four) based on LLDB dynamic debugging quickly analyze the implementation of system functions
  5. IOS assembly tutorial (5) Objc Block memory layout and assembly representation

preface

Machines with ARM architecture have a relatively weak memory model, and such cpus have considerable freedom in the reordering of read and write instructions. In order to ensure a specific execution order to obtain certain results, developers need to insert appropriate memory barriers into the code to prevent the instruction reordering from affecting the code logic [1].

This article looks at the implications and side effects of CPU reordering, uses an experiment to examine the impact of reordering on code logic, and then introduces a memory barrier based solution and what to look out for in iOS development.

Instruction rearrangement

Introduction to the

When the CPU with ARM architecture executes the instruction, if the CPU does not obtain the exclusive permission of the cache segment when it meets the write operation, it needs to negotiate with other cores based on the cache consistency protocol and wait until it obtains the exclusive permission to complete the execution of the instruction. Or you may need to wait when the multiplier is busy while performing the multiplication instruction. In these cases, to speed up the execution of the program, the CPU will preferentially execute some instructions that have no prior dependencies.

A case in point

Take a look at the following simple program:

; void acc(int *counter, int *flag);
_acc:
ldr x8, [x0]
add x8, x8, # 1
str x8, [x0]
ldr x9, [x1]
mov x9, # 1
str x9, [x1]
ret
Copy the code

This code sets the value of counter +1 and flag to 1. According to normal code logic, the CPU reads counter (x0) values from memory and writes back, and then reads flag (x1) values and sets them to write back.

However, if the memory where X0 is located does not hit the cache, it will cause waiting for loading the cache, or it cannot obtain the exclusive right of the cache segment when writing back. In order to ensure the consistency of multi-core cache, it also needs to wait. At this time, if the memory corresponding to X1 has a cache segment, LDR X9, [x1] can be preferentially executed. Meanwhile, since the operation on X9 and the operation on the memory where X1 is located do not depend on the operation on the memory where X8 and X0 are located, subsequent instructions can also be preferentially executed. So the out-of-order order of CPU execution might look like this:

ldr x9, [x1]
mov x9, # 1
str x9, [x1]
ldr x8, [x0]
add x8, x8, # 1
str x8, [x0]
Copy the code

Even if all writes need to wait, it is possible to delay all writes:

ldr x9, [x1]
mov x9, # 1
ldr x8, [x0]
add x8, x8, # 1
str x9, [x1]
str x8, [x0]
Copy the code

Or if the adder is busy, a whole new order of execution may be introduced, but of course the instructions that are being reordered cannot depend on each other’s results.

Side effects

Instruction reordering greatly improves CPU execution speed, but everything has two sides. Although instructions reordering at the CPU level can ensure correct computation, it may cause errors at the logical level. For example, in a common spinlock scenario, a flag of type bool may be set to wait for the completion of an asynchronous task. In this case, the flag is set at the end of the task. If the set flag statement is rearranged in the middle of the asynchronous task statement, a logical error will occur. Here’s an experiment to illustrate the side effects of reordering.

An experiment

In the following code we set up two threads, one to perform the operation and set flag after the operation, and the other to spin and wait for flag to set the result.

We first define a structure that holds the results of the operation.

typedef struct FlagsCalculate {
    int a;
    int b;
    int c;
    int d;
    int e;
    int f;
    int g;
} FlagsCalculate;
Copy the code

In order to quickly reproduce the errors caused by rearrangement, we use multiple flag bits, which are stored in the three member variables e, F and G of the structure, with a, B, C and D as the storage variables of the operation results:

int getCalculated(FlagsCalculate *ctx) {
    while (ctx->e == 0 || ctx->f == 0 || ctx->g == 0);
    return ctx->a + ctx->b + ctx->c + ctx->d;
}
Copy the code

To trigger a missed cache more quickly, we use multiple global variables; In order to simulate the busy adder and multiplier, we use intensive operations:

int mulA = 15;
int mulB = 35;
int divC = 2;
int addD = 20;

void calculate(FlagsCalculate *ctx) {
    ctx->a = (20 * mulA - mulB) / divC;
    ctx->b = 30 + addD;
    for (NSInteger i = 0; i < 10000; i++) {
        ctx->a += i * mulA - mulB;
        ctx->a *= divC;
        ctx->b += i * mulB / mulA - mulB;
        ctx->b /= divC;
    }
    ctx->c = mulA + mulB * divC + 120;
    ctx->d = addD + mulA + mulB + 5;
    ctx->e = 1;
    ctx->f = 1;
    ctx->g = 1;
}
Copy the code

Next we wrap them in the execution function of the pThread thread:

void* getValueThread(void *arg) {
    pthread_setname_np("getValueThread");
    FlagsCalculate *ctx = (FlagsCalculate *)arg;
    int val = getCalculated(ctx);
    assert(val == - 276387.);
    return NULL;
}

void* calValueThread(void *arg) {
    pthread_setname_np("calValueThread");
    FlagsCalculate *ctx = (FlagsCalculate *)arg;
    calculate(ctx);
    return NULL;
}

void newTest(a) {
    FlagsCalculate *ctx = (FlagsCalculate *)calloc(1.sizeof(struct FlagsCalculate));
    pthread_t get_t.cal_t;
    pthread_create(&get_t.NULL, &getValueThread, (void *)ctx);
    pthread_create(&cal_t.NULL, &calValueThread, (void *)ctx);
    pthread_detach(get_t);
    pthread_detach(cal_t);
}
Copy the code

Each time newTest is called, a new round of experiment is started, and the final operation result is -276387 if flag is not executed out of order. By continuously executing the experiment concurrently in a short time and observing whether assertion is encountered, we can determine whether the logical exception is caused by rearrangement:

while (YES) {
    newTest();
}
Copy the code

I added the above code to an iOS Empty Project and ran it on an iPhone XS Max. About 10 minutes later, I encountered an assertion error:

Obviously, all the flags were set in advance due to out-of-order execution, which led to incorrect execution results obtained by asynchronous threads. The above theory was verified through experiments.

The answer

You might break out in a cold sweat and start to recall similar logic you’ve written in your career. Maybe there are a lot of things going on online that never go wrong. Why is that?

In iOS development, we often use GCD as the framework for multi-threaded development. Such high-level multi-threaded model has provided a natural memory barrier to ensure the execution order of instructions, so we can boldly write the above logic without worrying about instruction rearrangement. This is why we used PThread for the above experiments.

You should also be aware of the side effects of instruction reordering if you are developing in a low-level multithreaded model. Below we will show you how to prevent instruction reordering from affecting logic through memory barriers.

The memory barrier

Introduction to the

A memory barrier is an instruction that explicitly guarantees that all memory operations prior to the barrier have been completed (visible) before operations behind the barrier can be executed, but it does not affect the execution order of other instructions (non-memory operation instructions) [3].

Therefore, we only need to place a memory barrier before setting flag to ensure that all the operation results are written into memory before setting flag, thus ensuring the correctness of logic.

Placing memory barriers

We can insert a memory barrier in the form of inline assembly:

void calculate(FlagsCalculate *ctx) {
    ctx->a = (20 * mulA - mulB) / divC;
    ctx->b = 30 + addD;
    for (NSInteger i = 0; i < 10000; i++) {
        ctx->a += i * mulA - mulB;
        ctx->a *= divC;
        ctx->b += i * mulB / mulA - mulB;
        ctx->b /= divC;
    }
    ctx->c = mulA + mulB * divC + 120;
    ctx->d = addD + mulA + mulB + 5;
    __asm__ __volatile__("dmb sy");
    ctx->e = 1;
    ctx->f = 1;
    ctx->g = 1;
}
Copy the code

Continuing with the previous experiment, you can see that assertions no longer raise exceptions, and that the memory barrier limits the impact of out-of-order CPU execution on normal logic.

Volatile and memory barriers

We often hear that volatile is a memory barrier. If volatile is the same as DMB, try using it to modify three flags.

typedef struct FlagsCalculate {
    int a;
    int b;
    int c;
    int d;
    volatile int e;
    volatile int f;
    volatile int g;
} FlagsCalculate;
Copy the code

The result is that the assertion exception ends up being raised. Why? Because volatile is only a compile-level memory barrier in C, ensuring that the compiler does not optimize or rearrange volatile modifications, volatile acts as a cpu-level memory barrier in Java [4]. Different environments behave differently, which is why volatile is so confusing.

Volatile is used in C to ensure that inline assembly is not optimised or moved. For example, when we place a compile-level memory barrier with inline assembly, __volatile__ modifiers the assembly code block to ensure that the barrier is not moved by the compiler:

__asm__ __volatile__("": : :"memory");
Copy the code

conclusion

By now, I believe you have a clearer understanding of command rearrangement and memory barriers, and the role of volatile is also more clear. I hope this article will be helpful to you. Welcome to follow my public account, public account will update the iOS low-level series of articles.

The resources

  1. Getting started with Cache Coherency
  2. CPU Reordering — What is actually being reordered?
  3. ARM Information Center – DMB, DSB, and ISB
  4. Summary of volatile and memory barriers