Graphic | 30 picture, show you understand the CPU assembly line and branch prediction of those things

Hello, my friends.

Today we are going to talk about a core topic, this article takes about 15 minutes, read carefully will have a harvest, go!

Here’s what you’ll learn:

Interesting problem with StackOverflow
CPU pipelining mechanism and internal data flow
Three adventures in CPU pipelining
CPU branch prediction revealed

Interesting question

I found an interesting question on StackOverflow while fishing the other day:

Stackoverflow.com/questions/1…

The questioner found a function of array summation written in C++, and the calculation performance of array summation sorted and unordered is 6 times different, very strange.

Let’s look at the code:

#include <algorithm> #include <ctime> #include <iostream> int main() { // Generate data const unsigned arraySize = 32768; int data[arraySize]; for (unsigned c = 0; c < arraySize; ++c) data[c] = std::rand() % 256; / /!!!!!! With this, the next loop runs faster. std::sort(data, data + arraySize); // Test clock_t start = clock(); long long sum = 0; for (unsigned i = 0; i < 100000; ++i) { for (unsigned c = 0; c < arraySize; ++c) { // Primary loop if (data[c] >= 128) sum += data[c]; } } double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC; std::cout << elapsedTime << '\n'; std::cout << "sum = " << sum << '\n';Copy the code

The code is relatively simple, first make a large array, then the array of elements within 256, all the elements fall between 0 and 256, then in the loop use if condition summation >=128 elements.

In order to avoid A single error, the author did 10W cycles and found that the running time varies greatly:

The total unordered sum time is 11.54 seconds
The total sorting and summation time is 1.93 seconds

Yes, adding STD :sort() would take longer, but the performance is still so good.

The questioner did it again in Java, not exactly the same as C++, but almost exactly the same.

Harm, what is going on? I looked at the comments of this post, ah ah, the principle is so!

Reading this friend, must be a technical person, congratulations you are better than yesterday, come on, let us find out.

The story of the car wash

Recently I drove my Jetta to the car wash. There were so many cars waiting in line.

I found the car wash process to be like this: spray, lather, scrub, wipe and blow-dry.

Cars lined up outside, including Audi A6L, BMW X5, Mercedes C200L and Jetta VS5.

After such a process is completed, the vehicle moves to the next process, and the current process fills in a car.

I thought one car would go in and finish all the work and then come out, and the next car would finish all the work and so on, but the car wash is an assembly line.

Why assembly line? Increase the number of car washes, i.e., throughput, and earn more per unit time!

If you finish all the processes and then get the next one, only one of the five processes is in operation at a certain moment, and the other four processes are in a waiting state. The workers start to fish for fish, and the customers wait for a long time without making any money.

Wisdom in life is really a lot ah, assembly line, YYDS!

Now, what does this have to do with the array summation? Don’t worry, it does matter.

The inner workings of the CPU

Let’s take a look at the internal structure of the CPU from a macro perspective:

From the two graphs, we can get the following information:

The core components of CPU include various registers, control unit CU, logical operation unit ALU and cache
There are three kinds of bus: address bus, data bus, and control bus
The I/O device and RAM realize functional interaction through three major buses and CPU

Cpp program is processed by the compiler into machine code for execution, and the program will be translated into Instruction one by one. In order to simplify the problem, we choose the CPU of level 5 pipeline to explain the problem:

Instruction Fetch (IF) is the process of fetching an Instruction from main memory to an Instruction register.
Instruction Decode ID After the Instruction is taken out, the computer enters Instruction Decode (ID) stage immediately. In the instruction decoding stage, the instruction decoder splits and interprets the retrieved instruction according to the predetermined instruction format, and identifies different instruction categories and various methods for obtaining operands.
Instruction execution EX After the instruction fetching and instruction decoding stages, the Execute instruction (EX) stage is followed. The task of this stage is to complete the various operations stipulated by the instruction and realize the function of the instruction concretely. To do this, different parts of the CPU are connected to perform the desired operation.
According to the requirements of the instruction, it is possible to access the main Memory to read the operands, thus entering the Memory access (MEM) stage. The task of this stage is to obtain the address of the operands in the main Memory according to the instruction address code, and read the operands from the main Memory for operation.
The result Writeback WB is the last stage, and the Writeback (WB) stage writes the result data of the execution instruction stage back to some form of storage.

IF, ID, EX, MEM, WB are the 5 level pipeline of CPU, which is similar to the pipeline of car wash:

Yes, the process of processing instructions inside a CPU is very similar to a car wash. Let’s dig deeper!

Summary CPU pipelining is a technique that speeds up the program by dividing instructions into multiple steps and overlapping the steps of different instructions. Each step of the instruction has its own independent circuit to process, each step is completed, into the next step, and the previous step is to process the subsequent instructions, belonging to the CPU hardware circuit level of concurrency.

CPU pipeline throughput and latency

Let’s take a look at the changes in throughput after the introduction of pipelining:

When the pipeline is not used, each execution part constitutes the combination logic, and the register is written after the execution. The whole time includes: the combination logic time is 300ps and the register writing time is 20ps

Throughput in this mode is 1/(300+20)ps = 3.125GIPS(gigabit instructions per second)

With pipelining, the composition logic is split into three parts, but each part needs to write a register, which increases the overall process time from 320ps to 360ps.

Split for two more logic and two more register writes for an additional 40ps.

At this time, the throughput is 1/(100+20)ps = 8.333GIPS(gigabit instructions per second), and the overall throughput is 2.67 times that of the unused pipeline.

From the above comparison, adding some hardware and latency resulted in an increase in throughput, but adding hardware was not a cure-all, and the simple register write delay was obvious.

Pipeline progression is also known as depth. Currently, Intel Core I7 uses a 16-level pipeline depth. Increasing pipeline depth within a certain range can improve CPU throughput, but it also brings great challenges to hardware design, and even reduces throughput.

CPU pipelining adventures

Pipelining to increase CPU throughput is a double-edged sword, as we take risks while increasing throughput.

Adventure is bumps in the road, potholes in the road, and the assembly line is not smooth.

Structural Hazard, Data Hazard and Control Hazard are mentioned as three major risks to be addressed in pipelining design.

Structure of adventure

In essence, structural risk is a kind of hardware conflict. Taking the level 5 pipeline as an example, both the IF stage of instruction reading and MEM operation need to read data in memory. However, there is only one address decoder in memory and only one data can be read in a clock cycle.

In other words, it is just like the water pipe in the washing line, but there is only one water pipe, so there is a conflict, leading to only one water pipe or brush.

One solution to the conflict between the MEM phase and the IF phase is to divide the memory into two parts: the memory that holds the instructions and the memory that holds the data, and give them their own address decoders, thus increasing the hardware resources to resolve the conflict.

Yes, this separate storage of instructions and data is the famous Harvard Architecture, and instructions and data together is the Von Neumann Architecture (also known as Princeton Architecture).

Both of these architectures have their own advantages and disadvantages. Modern CPUS adopt a hybrid architecture and introduce caching to reduce the speed mismatch between CPU and memory, as shown in the figure below:

This hybrid structure solves the risk problem of pipeline structure well.

The data of adventure

Data risk-taking is when instructions have data dependencies, like this code:

int a = 10; int b = a + 10; Int c = b + a; / / 3Copy the code

Statement 3 depends on the value of B, and statement 2 evaluates B, so statement 3 depends on statement 2, but each statement is translated into a number of instructions, so two of them have dependencies.

For example, instructions 3-3 need to wait for instructions 2-2 to complete WB stage before the EX stage can be carried out. Otherwise, the result obtained is wrong.

One solution is to introduce NOP operations, in which the clock cycle does nothing until the dependent data completes. This solution to data risk-taking through Pipeline pauses is called Pipeline Bubble.

The bubbling of the assembly line is simple, but its efficiency is reduced. After a lot of practice, we found that the result data of the first instruction can be transferred to the ALU of the next instruction, and the next instruction can continue normal operation without inserting NOP stage.

This technique of transmitting results directly is called Operand forward/forward, and it can be used with pipelining bubbling NOPs, because Operand forward alone cannot completely avoid noPs.

Summary: Operand push forward is an unnecessary operation by creating a bypass at the hardware level so that the calculation result of one instruction can be directly transmitted to the next instruction, instead of instruction 1 writing back to the register and instruction 2 reading the register.

The ADD command does not need to wait for WB completion before executing EX. Instead, the LOAD command is directly transmitted to the ADD command through operand forwarding, reducing one NOP operation and requiring only one NOP operation, thus improving the efficiency of the pipeline.

To control the risk

In the pipeline, multiple instructions are executed in parallel. When instruction 1 is executed, subsequent instruction 2 and instruction 3 May have completed the IF and ID stages waiting to be executed. At this time, IF instruction 1 suddenly jumps to another place, the IF and ID of instruction 2 and instruction 3 will be useless.

In the case of instruction transfer, the processor needs to empty the pipeline corresponding to instruction 2 and instruction 3 first, and then jump to the new target position of instruction 1 to enter the new pipeline, which is called the transfer cost.

The transfer instruction itself is in conflict with the mode of the pipeline, because the transfer instruction will change the direction of the instruction, and the pipeline wants to be able to retrieve instructions in turn to fill the pipeline, but the transfer instruction is very common in the actual program, which is also the CPU pipeline must face the problem.

Transfer instructions can be divided into unconditional transfer and conditional transfer.

Unconditional transfer is certain to take place, and the jump address can be obtained in the finger-taking stage, we design the corresponding bypass circuit in the CPU, the calculation result can be fed back to the pipeline earlier, this belongs to the hardware scheme called shortening branch delay.

However, for conditional transfer, we cannot obtain the jump position in the IF stage and can only know it in the EX stage, which leads to the branch prediction.

In other words, the last stage of the pipeline has not been completed yet, but the next instruction depends on the result. For efficiency, the pipeline cannot stop, it must make a choice, go left or right, and choose the next instruction to be executed.

CPU branch prediction

Branch prediction includes static branch prediction and dynamic branch prediction.

Static branch prediction is to choose a result every time, just like flipping a coin and guessing heads every time. For THE CPU pipeline, if the guessing instruction does not jump, it will have 50% accuracy. This kind of prediction method is simple but not efficient.

Dynamic branch prediction will predict the current situation according to the previous selection and accuracy, and judge whether to branch sequentially or jump. Therefore, there will still be two cases of success and failure.

For example, when predicting a jump:

When the prediction is successful, the branch target instruction address should be found as soon as possible to avoid the flow line stop caused by control correlation.
When the prediction is wrong, the prefetched and analyzed instructions are invalidated, the field is restored, and the instructions are fetched again from another branch path.

The simplest dynamic branch predictors have 1bit and 2bit, in which 2bit represents two bit markers, which record the last prediction state and the last prediction result respectively. Here, many articles pass over, and give a state machine migration diagram, and then finish summarily:

To be honest, it seemed to me that I got it, and it seemed that I didn’t, so I decided to take a closer look at some of the principles of the 2bit branch predictor. Let’s move on:

The two decisions not taken represent the order branch taken represent the jump branch
’00’ representing Strongly not taken strong order branch ’01’ representing Weakly order branch ’10’ representing Weakly jump branch ’11’ strongly taken strong jump branch ‘

Let’s continue to look at the 2bit dynamic branching prediction if state machine migration is carried out:

When the current state is in the sequential branch with a strength of 00, it is inevitable to predict the sequential branch next time. At this time, there will be two results. The prediction succeeds, and the next state will still be 00, and the prediction fails.

When the current state is in the weak sequential branch of 01, it is inevitable to predict the sequential branch of the next time. At this time, there will be two results. The prediction is successful, and the state of the next time is adjusted to 00, and the prediction fails.

When the current state is 10 weak jump branch, it is inevitable to predict that the next jump branch will also be 10 weak jump branch. At this time, there will be two results. The prediction is successful, the next state is adjusted to 11, and the prediction fails.

When the current state is in 11 strong jump branch, it must predict that the next jump branch will also be the next one, and there will be two results at this time. If the prediction succeeds, the state will remain 11, and the prediction fails. Finally, the program chooses the sequential branch and the state will change to 10 next time.

If you stick with this, you’re a real learner. Let’s draw a complete migration map:

As can be seen from this figure, changing from sequential branch to jump branch requires two consecutive prediction failures, and changing from the same jump branch to sequential branch also requires two consecutive prediction failures:

The section of memory that marks the state of the branch and the history of the branch is called BTB. This section of memory is very small and only stores the branch instruction address, the predicted target address, and the predicted bit. This section is also complicated, so we won’t expand it here, but we will deal with it later.

It can be seen from the above analysis that the dynamic branch predictor has certain fault tolerance and low volatility. Only two consecutive prediction failures will change the selection results, and the overall accuracy will improve significantly.

Data from some articles show that the accuracy of the 2bit predictor can reach more than 95% in most cases:

Review questions

Back to stackOverflow’s sorting and out-of-order time problem, there are two key factors:

The array elements are completely random, and the results are distributed independently of the last results
The existence of cyclic structures and conditional judgments

Yes, random + loop + condition completely hit the SOFT spot of CPU pipeline.

Branch prediction after array sorting

Unsorted branch prediction of an array

When the array is sorted, the dynamic branch prediction will combine the previous results to make a judgment with very high accuracy. When the array is not sorted, it is completely random and the static branch prediction is about as accurate as the static branch prediction, so the accuracy is not very high.

Failure of branch prediction means that the pipeline emptens, discards IF and ID instructions, and then selects the correct instruction, a process that currently consumes 10-20 clock cycles on the CPU.

conclusion

This article starts with a problem on StackOverflow about sorting and unsorting sums of random groups of numbers.

Further, the simplest level 5 CPU pipelining is used to describe the fundamentals and the three risks involved in pipelining, and their respective solutions, especially control risks.

The branch prediction technology in control adventure is further elaborated and the basic principle of dual mode dynamic branch predictor is analyzed.

Finally, combined with stackOverflow problems, it reveals the huge time consuming impact of different decision results of pipeline branch prediction and random number group sorted/unsorted in a circular structure.

Thank you for your patience. See you next time!