Use GPU Compute in the Web

Get started with GPU Compute on the Web

background

A graphics processing unit (GPU), as we know it, is an electronic subsystem in a computer that was originally dedicated to processing graphics. Over the past decade, however, it has evolved into a more flexible architecture that leverages the GPU’s unique architecture to allow developers to implement multiple types of algorithms, not just render 3D graphics. These features are called GPU Compute, and the use of a GPU as a coprocessor for general-purpose scientific computing is called general-purpose GPU (GPGPU) programming.

GPU Compute has contributed significantly to the recent machine learning boom, such as convolutional neural networks and other models that can take advantage of this capability to run more efficiently on gpus. Due to the lack of GPU Compute in current Web platforms, the W3C’s “GPU for the Web” community group is designing an API to expose the GPU API for use on most devices, called WebGPU.

WebGPU is an underlying API, such as WebGL. As we can see, it’s very powerful and very detailed. But that’s okay, we’re focused on performance.

In this article, I’m going to focus on the GPU Compute part of WebGPU, and to be honest, I’ll make it easy to understand so you can play around with it. I will introduce and delve into WebGPU rendering (canvas, texture, etc.) in an upcoming article.

WebGPU is currently available in Chrome 78 of macOS using the experimental Flag. You can enable it in Chrome ://flags/ #enable-unsafe-webgpu. The API changes frequently and is currently not secure. Since there is currently no GPU sandbox implemented for the WebGPU API, it is possible to read GPU data from other processes! Therefore, do not browse web pages in WebGPU enabled mode.

Access to the GPU

Accessing the GPU in WebGPU is easy. Call the navigator. Gpu. RequestAdapter JavaScript () returns a Promise, a parsed asynchronous returns a gpu adapter (gpu adapter). We can think of this adapter as a video card. It can be an integrated graphics card (on the same chip as the CPU) or a stand-alone graphics card (usually a PCIe card with higher performance but more power consumption).

Once you have the GPU adapter, you can call Adapter.requestDevice () and return a promise that it will be resolved using the GPU device to do some GPU computation.

const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();
Copy the code

Both of these functions allow you to pass in parameters for the desired adapter type (power preference) and device (extension, limit). For simplicity, we will use the default options in this article.

Write a buffer

Let’s take a look at how to write data to the GPU’s memory using JavaScript. Because of the sandbox model used in modern Web browsers, this process is not simple.

The following example shows how to write four bytes to buffer memory accessible from the GPU. It calls device. CreateBufferMappedAsync () to get the size of the buffer and said to do. Even if this API call does not require the flag gpuBufferUsage.map_write, it will indicate that we are writing to the buffer. The generated promise returns a GPU buffer object and its associated raw binary array buffer.

If you’ve played with ArrayBuffer, you’re familiar with the writing section; Use TypedArray and copy the values in.

// Get a GPU buffer and an arrayBuffer for writing.
// Upon success the GPU buffer is put in the mapped state.
const [gpuBuffer, arrayBuffer] = await device.createBufferMappedAsync({
  size: 4.usage: GPUBufferUsage.MAP_WRITE
});

// Write bytes to buffer.
new Uint8Array(arrayBuffer).set([0.1.2.3]);
Copy the code

At this point, the GPU buffer is mapped, which means it is owned by the CPU and can be read/written from JavaScript. Therefore the GPU can access it and it must be unmapped, which is as simple as calling gpubuffer.unmap ().

The concept of mapping or not mapping is a race condition to prevent both GPU and CPU from accessing memory at the same time.

Read buffer

Now let’s see how to copy a GPU buffer to another GPU buffer and read it back.

Since we are writing to the first GPU buffer and we want to copy it to the second GPU buffer, we need the new usage flag gpuBufferUsage.copy_src. Create a second GPU buffer in an unmapped state using synchronous device.createBuffer (). . It is the use of the logo GPUBufferUsage COPY_DST | GPUBufferUsage. MAP_READ, because it will be used as the first GPU buffer target, and the execution of the GPU in JavaScript to read after the copy command.

// Get a GPU buffer and an arrayBuffer for writing.
// Upon success the GPU buffer is returned in the mapped state.
const [gpuWriteBuffer, arrayBuffer] = await device.createBufferMappedAsync({
  size: 4.usage: GPUBufferUsage.MAP_WRITE | GPUBufferUsage.COPY_SRC
});

// Write bytes to buffer.
new Uint8Array(arrayBuffer).set([0.1.2.3]);

// Unmap buffer so that it can be used later for copy.
gpuWriteBuffer.unmap();

// Get a GPU buffer for reading in an unmapped state.
const gpuReadBuffer = device.createBuffer({
  size: 4.usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ
});
Copy the code

Because gpus are independent coprocessors, all GPU commands are executed asynchronously. This is why there is a list of GPU commands built up and sent in bulk when needed. In WebGPU device. CreateCommandEncoder () returns the GPU command encoder is to build a “buffer” the order of JavaScript objects, these commands to be delivered at some point to the GPU. Methods on GPUBuffer, on the other hand, are “buffered free,” meaning they execute atomically when called.

After get the GPU command encoder, called copyEncoder. CopyBufferToBuffer (), as shown below, add this command to the command queue for later execution. Finally, the coding command is completed by calling copyEncoder.Finish () and submitted to the GPU device command queue. This queue is responsible for processing via device.getQueue (). Submit () and the GPU command as parameters. This will execute all the commands stored in the array sequentially and atomically.

// Encode commands for copying buffer to buffer.
const copyEncoder = device.createCommandEncoder();
copyEncoder.copyBufferToBuffer(
  gpuWriteBuffer /* source buffer */.0 /* source offset */,
  gpuReadBuffer /* destination buffer */.0 /* destination offset */.4 /* size */
);

// Submit copy commands.
const copyCommands = copyEncoder.finish();
device.getQueue().submit([copyCommands]);
Copy the code

At this point, the GPU queue command has been sent but may not be executed. To read the second GPU buffer, please call gpuReadBuffer. MapReadAsync (). Once all of the queued GPU commands have been executed, it returns a promise that will be parsed using an ArrayBuffer containing the same values as the first GPU buffer.

// Read buffer.
const copyArrayBuffer = await gpuReadBuffer.mapReadAsync();
console.log(new Uint8Array(copyArrayBuffer));
Copy the code

You can poke examples

In a nutshell, this is what you need to remember about buffered memory operations:

GPU buffers must be unmapped to be used in device queue commits.
After mapping, the GPU buffer can be read and written using JavaScript.
GPU buffers are mapped when mapReadAsync (), mapWriteAsync (), createBufferMappedAsync () and createBufferMapped () are called.

Shader programming

A program running on a GPU that only performs calculations (and does not draw triangles) is called a calculation shader. They are executed in parallel by hundreds of GPU cores (smaller than CPU cores) that run together to process data. Their inputs and outputs are buffers in a WebGPU.

To illustrate the use of computational shaders in a WebGPU, we will use matrix multiplication, which is a common algorithm in machine learning as shown below.

In short, here’s what we’re going to do:

Create three GPU buffers (two for matrix multiplication and one for result matrix)
Describes the input and output of the computed shader
Compile evaluates shader code
Setting up computing pipes
Batch submit the encoded commands to the GPU
Read the resulting matrix GPU buffer

GPU buffer to create

For simplicity, the matrix is represented as a list of floating point numbers. The first element is the number of rows, the second is the number of columns, and the remaining elements are the actual numbers of the matrix.

The three GPU buffers are storage buffers because we need to store and retrieve data in the compute shader. This explains why GPU buffers use flags including gpuBufferUsage.storage. The resulting matrix also has gpubufferUsage.copy_src using the flag, because once all GPU queue commands have been executed, it will be copied to another buffer for reading.

const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();


// First Matrix

const firstMatrix = new Float32Array([
  2 /* rows */.4 /* columns */.1.2.3.4.5.6.7.8
]);

const [gpuBufferFirstMatrix, arrayBufferFirstMatrix] = await device.createBufferMappedAsync({
  size: firstMatrix.byteLength,
  usage: GPUBufferUsage.STORAGE,
});
new Float32Array(arrayBufferFirstMatrix).set(firstMatrix);
gpuBufferFirstMatrix.unmap();


// Second Matrix

const secondMatrix = new Float32Array([
  4 /* rows */.2 /* columns */.1.2.3.4.5.6.7.8
]);

const [gpuBufferSecondMatrix, arrayBufferSecondMatrix] = await device.createBufferMappedAsync({
  size: secondMatrix.byteLength,
  usage: GPUBufferUsage.STORAGE,
});
new Float32Array(arrayBufferSecondMatrix).set(secondMatrix);
gpuBufferSecondMatrix.unmap();


// Result Matrix

const resultMatrixBufferSize = Float32Array.BYTES_PER_ELEMENT * (2 + firstMatrix[0] * secondMatrix[1]);
const resultMatrixBuffer = device.createBuffer({
  size: resultMatrixBufferSize,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC
});
Copy the code

Binding group layout and binding groups

The concept of binding group layouts and binding groups is specific to WebGpus. The binding group layout defines the input/output interfaces required by the shader, and the binding group represents the actual input/output data of the shader.

In the following example, the binding group layout requires the calculation of some storage buffers at the shader’s numbered bindings 0,1 and 2. On the other hand, the binding group defined for this binding group layout associates the GPU buffer with the binding: gpuBufferFirstMatrix binds to 0, gpuBufferSecondMatrix binds to binding 1, and resultMatrixBuffer binds to binding 2.

const bindGroupLayout = device.createBindGroupLayout({
  bindings: [{binding: 0.visibility: GPUShaderStage.COMPUTE,
      type: "storage-buffer"
    },
    {
      binding: 1.visibility: GPUShaderStage.COMPUTE,
      type: "storage-buffer"
    },
    {
      binding: 2.visibility: GPUShaderStage.COMPUTE,
      type: "storage-buffer"}}]);const bindGroup = device.createBindGroup({
  layout: bindGroupLayout,
  bindings: [{binding: 0.resource: {
        buffer: gpuBufferFirstMatrix
      }
    },
    {
      binding: 1.resource: {
        buffer: gpuBufferSecondMatrix
      }
    },
    {
      binding: 2.resource: {
        buffer: resultMatrixBuffer
      }
    }
  ]
});
Copy the code

Evaluates the shader code

The computational shader code for multiplication matrix is written in GLSL, a high-level shader language used in WebGL with a C-programming language-based syntax. Without going into detail, you should find three storage buffers labeled with keyword buffers below. The program will use firstMatrix and secondMatrix as inputs and resultMatrix as its output.

Note that each storage buffer uses a binding qualifier that corresponds to the same index defined in the binding group layout and the binding group declared above.

const computeShaderCode = `#version 450 layout(std430, set = 0, binding = 0) readonly buffer FirstMatrix { vec2 size; float numbers[]; } firstMatrix; layout(std430, set = 0, binding = 1) readonly buffer SecondMatrix { vec2 size; float numbers[]; } secondMatrix; layout(std430, set = 0, binding = 2) buffer ResultMatrix { vec2 size; float numbers[]; } resultMatrix; void main() { resultMatrix.size = vec2(firstMatrix.size.x, secondMatrix.size.y); ivec2 resultCell = ivec2(gl_GlobalInvocationID.x, gl_GlobalInvocationID.y); Float result = 0.0; for (int i = 0; i < firstMatrix.size.y; i++) { int a = i + resultCell.x * int(firstMatrix.size.y); int b = resultCell.y + i * int(secondMatrix.size.y); result += firstMatrix.numbers[a] * secondMatrix.numbers[b]; } int index = resultCell.y + resultCell.x * int(secondMatrix.size.y); resultMatrix.numbers[index] = result; } `;
Copy the code

Pipe set

Webgpus in Chrome currently use bytecode rather than raw GLSL code. This means that we must compile the computeShaderCode before running the compute shader. Fortunately, the @WebGPU/GLslang package allows us to compile computeShaderCode in a format accepted by WebGpu in Chrome. This bytecode format is based on a secure subset of SPIR -v.

Please note that the “GPU on the Web” W3C community group is still undecided when writing the coloring language for WebGPU.

import glslangModule from 'https://unpkg.com/@webgpu/glslang/web/glslang.js';
Copy the code

The calculation pipe is the object that actually describes the calculation operation we are going to perform. It was created by calling device. CreateComputePipeline (). It takes two parameters: the binding group layout we created earlier, as well as the calculation stage that defines the entry point of the calculation shader (the main GLSL function) and the actual calculation shader module compiled using glslang.compileGLSL ().

const glslang = await glslangModule();

const computePipeline = device.createComputePipeline({
  layout: device.createPipelineLayout({
    bindGroupLayouts: [bindGroupLayout]
  }),
  computeStage: {
    module: device.createShaderModule({
      code: glslang.compileGLSL(computeShaderCode, "compute")}),entryPoint: "main"}});Copy the code

Orders submitted

After instantiating the binding groups with our three GPU buffers and compute pipes with the binding group layout, it was time to use them.

Let us use commandEncoder. BeginComputePass () start a programmable computing transfer encoder. We’ll use it to encode GPU commands that will perform matrix multiplication. Set its pipeline using passEncoder. SetPindline (computePipeline), and set its binding group at index 0 using passEncoder. SetBindGroup (0, bindGroup). Index 0 corresponds to the SET = 0 qualifier in GLSL code.

Now, let’s talk about how this compute shader will run on the GPU. Our goal is to execute the program step by step in parallel for each cell of the resulting matrix. For example, for a result matrix of size 2 by 4, we call passencoder.dispatch (2,4) to encode the execution command. The first argument “x” is the first dimension, the second argument “y” is the second dimension, and the latest “z” is the third dimension that defaults to 1, because we don’t need it here. In the GPU computing world, the command encoding that executes a kernel function on a set of data is called scheduling.

In our code, “x” and “y” will be the number of rows in the first matrix and the number of columns in the second matrix, respectively. With it, we can now schedule calculation calls using passencoder.dispatch (firstMatrix [0], secondMatrix [1]).

As shown in the figure above, each shader has access to a unique gl_GlobalInvocationID object that will be used to know the resulting matrix cells to be evaluated.

const commandEncoder = device.createCommandEncoder();

const passEncoder = commandEncoder.beginComputePass();
passEncoder.setPipeline(computePipeline);
passEncoder.setBindGroup(0, bindGroup);
passEncoder.dispatch(firstMatrix[0] /* x */, secondMatrix[1] /* y */);
passEncoder.endPass();
Copy the code

To end the calculation of the passEncoder, call passencoder.endpass (). Then, a GPU buffer is created to be used as a target to copy the result matrix buffer using copyBufferToBuffer. Finally, use copyEncoder.finish () to complete the coding command and call device.getQueue () by using the GPU command. Submit () submits these commands to the GPU device queue.

// Get a GPU buffer for reading in an unmapped state.
const gpuReadBuffer = device.createBuffer({
  size: resultMatrixBufferSize,
  usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ
});

// Encode commands for copying buffer to buffer.
commandEncoder.copyBufferToBuffer(
  resultMatrixBuffer /* source buffer */.0 /* source offset */,
  gpuReadBuffer /* destination buffer */.0 /* destination offset */,
  resultMatrixBufferSize /* size */
);

// Submit GPU commands.
const gpuCommands = commandEncoder.finish();
device.getQueue().submit([gpuCommands]);
Copy the code

Read result matrix

Read the result matrix as call gpuReadBuffer. MapReadAsync () and record the generated as simple promise the returned ArrayBuffer.

In our code, the result recorded in the DevTools JavaScript console is “2,2,50,60,114,140.”

// Read buffer.
const arrayBuffer = await gpuReadBuffer.mapReadAsync();
console.log(new Float32Array(arrayBuffer));
Copy the code

Click here for an example

Performance investigation

So how does running matrix multiplication on the GPU compare to running matrix multiplication on the CPU? To find out, I wrote the program I just described for the CPU. As you can see in the figure below, using the full power of the GPU seems like an obvious choice when the size of the matrix is greater than 256 by 256.

This article is just the beginning of my exploration of WebGPU. There will be more articles soon on the deeper potential in GPU Compute and how rendering (canvas, texture, sampler) works in WebGpus.