Xiaogu has a friend, ah Q, is CPU no.1 workshop staff. Something terrible has happened to him recently

The CPU I am in has 8 workshops, that is, 8 cores, each of us can execute two threads at the same time, that is, 8 cores and 16 threads, that speed is tremendous.

In the no.1 workshop I was in, in addition to me who was responsible for executing instructions, there were also little A who was responsible for reading instructions, Little Fat who was responsible for decoding instructions and Old K who was responsible for writing back the results. We each performed our own duties and completed the execution of the program together.

A simple loop

That day, we came across a piece of code:

void array_add(int data[], int len) { for (int i = 0; i < len; i++) { data[i] += 1; }}Copy the code

It took hundreds of loops to complete the code, exhausting me with simple, repetitive tasks each time.

The old K, who was responsible for writing back the results, was also tired and sweating. He joked: “Every time I take out 1 and write back. If I can take several more numbers at a time, it will be good to batch process.”

Lao K’s words let me see, right, can you batch operation?

The in the mind side think, side continue to work.

The busy day soon ended, and in the blink of an eye it was evening again. After the computer was turned off, I called everyone together.

“Guys, remember that loop we had during the day?”

“Which loop? We’ve executed quite A few loops this day,” little A said.

“The one that increments every element of the integer array by one.”

“I remember, what happened to the loop? Is there a problem?”

I looked at the old K one eye, said: “I think today the old K, like this cycle, every time is taken out to add 1 and write back, an operation of a number, the efficiency is too low, if we upgrade the transformation, support a take out more than one number, batch add 1, so is not a lot faster?”

Old K a listen to the interest, “this dare situation good, how do you plan to do?”

“I haven’t decided yet. Any suggestions?”

Xiao Pang, who was in charge of decoding instructions, said, “You can add a new instruction specially used to take out multiple data at a time to add 1.”

“No line no line, can’t limit of so dead, today is add 1, in case next time is add 2? Instruction can’t be limited to 1.”

“What if each number is added differently?”

“If you say so, what if it’s subtraction instead of addition?”

“And…”

We began to discuss, did not expect a small addition cycle, all of a sudden led to so many problems, this is we did not expect.

Parallel computing

With the deepening of the discussion, I think it is beyond the scope of our no.1 workshop. We need to report it to the leader and organize representatives of the eight workshops to discuss together.

As soon as the leader heard about the new technology to improve performance, he immediately became interested in it and organized a meeting to discuss the plan.

“Are you all here? Ah Q, tell us the purpose of this meeting”, the leader said.

I stood up and began to tell the group about our problems and ideas.

“Well, we came across a loop the other day in workshop 1. The body of the loop simply incremented each element in the array by 1. When we do it, we just keep taking each element, adding it, and writing it back. It seems too slow to add one by one. If we could take several at a time and add one by one, it would be faster than one by one.”

As SOON as I finished, everyone began to whisper.

“I see that this is actually parallel computing!” The words of the tiger in workshop No. 2 speak the key.

Workshop 6 small six asked: “Ah Q, you have a plan? “

“Not yet. That’s why we’re meeting today, because it’s a bit complicated and we need to come up with ideas.”

“It doesn’t seem that complicated.”

“The example I gave above is just a simple case. Parallel computation may not be a fixed number, but may be the addition of an array to another array. It might not be adding integers, it might be floating point numbers, it might even be subtraction or multiplication, it might not be arithmetic, it might be logic.”

No sooner had I finished than we all began to whisper again.

“I think you said this series of things, we are going to add a special set of instructions for parallel computing ah,” tiger said.

“That’s a big job.”

“Yes…”

At this time, small six asked again: “Our calculation, are the data read into the register, but this register can only load a number, how to read multiple data?”

“You may need to add some larger registers, such as 128bit, which can hold four 32-bit integers.”

“Is it necessary? We are a general-purpose CPU, not a chip dedicated to mathematical calculations, why do these things?” The representative of workshop No. 4 raised a question.

I said, “That’s absolutely necessary. There are a lot of computing needs in image, video, audio processing, and so on, and we need to improve our ability to process that data.”

See we dispute not, the leader patted the table, the assembly hall suddenly quiet down.

“I think Ah Q has a point. We really need to improve our ability to handle this kind of data processing. But you don’t have to do anything complicated, just support parallel integer arithmetic. Add register this also need not worry, can borrow a floating number operation unit FPU register. This should be decided in advance, and you can continue to discuss the specific plan.” “And left the meeting room.

Leadership is worthy of leadership, a few words on our arrangements clearly.

SIMD

After another tense discussion, we finally settled on the plan.

We borrowed registers from the floating-point arithmetic unit and gave them new names: MM0-MM7. Because it is a 64-bit register, it can store two 32-bit integers or four 16-bit integers or eight 8-bit integers at the same time.

We have also added a new instruction set called MMX for performing integer operations in parallel.

We call this technique of processing Multiple Data simultaneously in a Single Instruction Single Instruction Multiple Data, or SIMD.

With this instruction set, we can deal with integer arithmetic problems much faster.

But gradually two very troubling problems emerged:

The first problem is that since it borrows registers from FPU, it is not possible to use the FPU cell when executing SIMD instructions, and vice versa. If you use the FPU cell at the same time, you will have trouble switching between different modes.

Another more important problem is that our instruction set can only handle integer parallel operations, but floating-point operations are becoming more and more common, especially in image, video and some data processing in deep learning. Floating-point operations are becoming more and more common, and they are not needed at this time.

We reported these problems to the leader, seeing the achievements we have made, the leader finally agreed to continue to upgrade.

This time, we extended a new SSE instruction set, adding xMM0-XMM7, a total of 8 128-bit registers, no longer need to share registers with FPU. And by doubling the bit width, more data can be held and more data can be processed simultaneously.

Later, we continued to update, not only support floating point parallel processing, but also introduced a new generation of AVX instruction set, the register once again expanded to 256 bits, and now our SIMD technology is more advanced, processing data operations more and more powerful!