Branching prediction: Why ordered Arrays are Faster than unordered arrays?

In recent days, I have been collecting some performance test cases about JavaScript functional programming, as well as memory usage analysis.

A year ago (January 2017) I wrote an article called “Is JavaScript functional Programming a Performance Problem?” In this paper, I benchmark array higher-order functions and for-loop, and the result is that these map ‘reduce’ functions have about 20 times the performance gap compared with the native for-loop.

A year and a half later, however, the V8 engine has also made significant improvements, including performance improvements to array higher-order functions, with good results.

You can see they’re pretty close.

But I found a very interesting in jsperf test case: https://jsperf.com/sorted-loop/2

// Sum the integer array data
function perf(data) {
    var sum = 0;
    for (var i = 0; i < len; i++) {
        if (data[i] >= 128) { sum += data[i]; }}return sum;
}
Copy the code

Both test cases use this function, the only difference being the array (arguments) : one is ordered, the other is unordered. As a result, the performance was four times worse.

We all know that if we search an ordered array, we can get better performance with binary search algorithms. However, binary search and ordinary search are two very different algorithms, so a performance gap is normal. But this test case is different, and the algorithm is exactly the same, because it’s the same function. The binary machine code generated by both is the same. Why is there such a big performance gap?

So I did a Google search using the keyword fast Array sorted, and sure enough I found stackOverflow results, with both the question and the answer getting over 20,000 likes, which should be worth checking out. Although the original text is written in C++ and Java, it should have some commonality.

Although the original code is identical, it is not the same when executed by the CPU, due to an optimization feature of the CPU: Branch Prediction.

To make it easier to understand, the answer uses an analogy:

Consider a fork in a railroad track:

(Image By Mecanismo, source Wikimedia, Licensing CC-BY-SA 3.0)

Let’s say we’re in the 19th century, and you’re in charge of choosing a direction for a train, before telephones and cell phones, and you don’t know which way the train is going when it arrives. So what you do is, you stop the train, the train stops, you ask the driver, and you figure out which way the train is going, and you put the track on the right track.

Another thing to note is that the inertia of the train is very high, so the driver has to start slowing down far away. When you turn the track in the right direction, it takes a long time for the train to accelerate.

So is there a better way to reduce train waiting times?

In a very simple way, you turn the track in a certain direction in advance. So which way to turn, you use this trick — “blind” :

If the trick is correct, the train goes straight through, which takes zero time.
If it’s wrong, the train stops, it goes back, you flip the track in the opposite direction, the train restarts, speeds up, and goes.

If you are lucky and get it right every time, the train will never stop, it will just keep going! (You can buy a lottery ticket.)

If you’re wrong, you’ll waste a lot of time.

So let’s think about an if statement. The if statement produces a “branch”, similar to the previous rail:

A lot of people think, how can a CPU be like a train? Does the CPU also need to slow down and back up? Isn’t it a direct jump to interrupt it?

Modern CPU chips are very complex and most chips use Instruction pipeline technology to improve performance. There are usually several main steps:

Read the instructions (IF) - > decoding (ID) - > execute (EX) - > memory access (MEM) - > write back to register (WB)Copy the code

This greatly increases the speed of instruction passage (the number of instructions executed per unit of time). When the first instruction has been executed, the second instruction has been decoded and can be executed immediately.

So how does the CPU do branch prediction? One of the easiest ways to do this is to look at history. If 99% of the past times were executed on a branch, the CPU will guess that it will be executed on that branch again, so it can load that branch’s code into the pipeline ahead of time. If the prediction fails, the pipeline needs to be emptied and reloaded, potentially losing about 20 clock cycle times.

If the array is sorted in a certain order, the CPU’s prediction will be very accurate, as in our previous code, data[I] >= 128. Regardless of whether the array is in ascending or descending order, the CPU’s branch prediction will get the correct result before and after the partition point 128. If the array is out of order, the CPU pipeline will constantly anticipate failures and reload instructions.

So if we know that our array is out of order and there is a high probability that branch prediction will fail, can we optimize the code to avoid the CPU’s branch prediction?

The answer is yes. We can remove the branch statements so that the CPU can load instructions directly on the pipeline without relying on the branch prediction function. A bit manipulation technique is used here. Let’s look at the previous code:

if (data[i] >= 128) {
    sum += data[i];
}
Copy the code

Add up all the numbers greater than 128.

Since bitwise operations only work on 32-bit integers, we can use a right shift to determine a number.

For signed numbers:

A shift of 31 bits to the right of a nonnegative number must be0
A negative number moving 31 places to the right is bound to be- 1, that is,0xffff

Since the binary representation of -1 is that all bits are 1, that is:

0b1111_1111_1111_... _1111_1111_1111/ / 32
Copy the code

So minus 1 is the same value when you add and to anything.

- 1 & x === x
Copy the code

0 is the opposite of -1, and all 32 bits are 0:

0b0000_0000_0000_... _0000_0000_0000/ / 32
Copy the code

We can see that each bit is the opposite of the digits 0 and -1.

~ - 1= = =0
~ 0= = =- 1
Copy the code

So we can analyze the previous code, “if greater than 128, then sum”, let’s break it down:

Let’s subtract this number128, then there are only two outcomes: positive (0) and negative
If we move 31 places to the right, we get0 或 - 1

We need to add all values with a result of 0(greater than 128) :

I’m going to invert it, I’m going to put greater than128The number of a- 1That is less than128The into0
And with the original number

The code is:

const t = (data[i] - 128) > >31
sum += ~t & data[i];
Copy the code

This avoids branch prediction failures. Performance test:

You can see that both have nearly the same performance, and that the performance is significantly higher than the random number group using the if branch. But we also see that the performance of both of them is much worse than that of the if branch code for ordered arrays. Is it because the bit operation does not use branch prediction and therefore performs worse than the branch prediction code for ordered arrays? And it isn’t.

Even though the branch prediction success rate of an ordered array is very high, the CPU will still fail to predict and lose a lot of clock cycle time after reaching the branch threshold of 128. Unless all the arrays in the array are greater than 128 or less than 128. Using bitwise operations does not require CPU pauses at all.

Bitwise operations are slower than the if branch, which is not what many developers expect, many of whom take for granted that bitwise operations are the fastest. In fact, I wrote an article a long time ago:

JavaScript bit arithmetic, maybe not that fast

The reason why the bitwise operation is slower than the if branch is that the bitwise operation is more cumbersome to implement, and the resulting binary machine code is longer, thus requiring longer instruction cycles to complete, and thus slower than the if branch.

So let’s just conclude.

Bitwise operations are more stable, but less readable, because they eliminate branching. Some even say that “all the use of bit-computing in business code is fake” and “code is written for people, bit-computing is written for machines”.

Branching prediction: Why ordered Arrays are Faster than unordered arrays?

Related Posts

Modern CSS solution: Modern CSS Reset

Adapter pattern for javascript design pattern

Flying book WASM practice – SQLite