Make writing a habit together! This is the 10th day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

The reason for adding blocks is that there is a shared cache for each block. That is, all threads in the same block share a block of memory, and access to this memory is much faster than access to global memory.

A = np.array([
    [1.2.1],
    [3.1.2],
    [1.2.0]
])

B = np.array([
    [1.1.0],
    [2.1.1],
    [3.2.1]])Copy the code

Just A quick review of linear algebra matrix multiplication, in the case of multiplying A and B in the figure above, matrix multiplication requires that the number of columns in the A matrix be the same as the number of rows in the B matrix, So the (I,j) of C is equal to the sum of all the elements in row I of A and all the elements in the JTH column of B.

So how do we speed up matrix multiplication through parallel computation in CUDA?

It is not difficult to see from the above figure that we define A grid shape that is the same as the output C shape, and then the kernel of each position obtains the element value of C matrix by multiplying the row elements corresponding to matrix A and adding the column elements corresponding to matrix B. All these operations are completed independently in one thread.

@cuda.jit
def matmul(A,B,C) :
    1.2 = cuda.grid(2)
    if 1 < 4 and 2 < 4:
        tmp = 0
        for k in range(4)
            tmp = A[1,k]*B[k,2]
        C[1.2] = tmp
Copy the code

This means that for each C element we need to load each of the four values of the A and B matrices from the global share. Next we will introduce A more efficient way to load data, which is through shared memory.

Shared memory helps us understand why blocks are designed. We divide the matrix result into blocks and each block is a square matrix. Each block has its own shared memory.

Cuda.shared. array is created in shared memory after the block is created. The size of the array is the same as the shape of the block, and all threads in the block can access the shared memory.

Create 2 shared memory for each block to store data at the corresponding locations on the A and B matrices, respectively

For example, in the figure above, sA stores the data of the orange part of matrix A, and sB stores the data of the orange part of matrix B.

sA = cuda.shared.array(shape=(TPB,TPB),dtype=float32)
sB = cuda.shared.array(shape=(TPB,TPB),dtype=float32)
Copy the code
x, y = cuda.grid(2) 
tx = cuda.threadIdx.x 
ty = cuda.threadIdx.y 
bpg = cuda.gridDim.x 


Copy the code

Here the TPB is the number of threads per block, cuda.gridDim tells us how many blocks the grid has. Next we loop through each block of the grid and then for each block we put the elements of the matrix A and B into shared memory.

for i in range(bpg): 
    sA[tx, ty] = A[x, ty + i * TPB] 
    sB[tx, ty] = B[tx + i * TPB, y]
Copy the code

After putting the data into sA and sB, you can multiply the two shared matrices to get an intermediate result.

The next step is to traverse the first step of the A matrix and the next column of the B matrix, store the data in the block corresponding to sA and sB and multiply the two shared matrices to get A matrix.

for j in range(TPB):
      tmp += sA[tx, j] * sB[j, ty]
Copy the code

We add two matrices that multiply each other’s positions to get what we want, and you can try that out for yourself.