introduce

Project address: github.com/hijkzzz/neu…

As a beginner in the field of deep learning, many will wonder how deep learning frameworks like TensorFlow and PyTorch work. Without exception, these systems all use CUDA for parallel computation acceleration. Here, I use CUDA to implement a simple CNN network, which is convenient for everyone to learn and understand, and achieve 99.23% accuracy on the undistorted MNIST data set.

Preliminary knowledge

CUDA

CUDA Programming For Beginners

Matrix derivative

Recommend Matrix Derivation shared by Zhihu God

design

storage

thrust.github.io/

Before implementing the neural network, we need to design a storage class for storing parameters and data on the GPU. This is called the Storage class, and for ease of implementation, we use CUDA’s thrust:: Device_vector (similar to STD :: Vector) to manage dynamic arrays on video memory. You will need to add STD ::vector to store Storage. You will need to store Storage at Tensor.

Matrix multiplication

Neural network implementation uses matrix multiplication a lot, so a key of CUDA parallel acceleration is to realize efficient parallel matrix multiplication. Here I am directly using the Shared Memory accelerated matrix multiplication from CUDA Minimalist Tutorial. In fact, it can be further optimized to greatly improve efficiency.

The connection layer

Let X be the input data matrix, where each behavior is a sample. W is the parameter matrix, b is the bias vector, and L is the sample average loss. * denotes matrix multiplication, not element-by-element multiplication, and ^T denotes transpose:

Fully connected forward propagation Y = X * W Back propagation dL/dX = dL/dY * W^T dL/dW = X^T * dL/dY bias forward propagation Y = X + b Back propagation dL/db = sum(dL/dY, 0) sum by sample gradientCopy the code

Convolution layer

Hal. Inria. Fr/file/index /…

In order to facilitate the realization of convolution by matrix multiplication, I refer to Caffe’s convolution principle, namely im2col:

The basic idea is to expand the convolution operation into matrix multiplication, so convolution can be efficiently implemented with parallel accelerated matrix multiplication. Let F be the convolution kernel parameter and the shape be: channel_out*channel_in*kernel_width*kernel_height, X be an input sample and the shape be channel_in*width*height, and b be the bias vector.

Convolutional forward propagation col = im2col(im) Backward propagation dL/dF = dL/dY * col^T dL/d_col = F^T * dL/dY dL/d_im = according to the input graph Y = F * col expansion of IM2COL Col2im (dL/d_col) bias forward propagation Y = X + b sum by channel back propagation dL/db = sum(sum(X, 2), 1) specifies the entire channelCopy the code

Maxpool

The back propagation of Maxpool requires recording the position of the element before pooling and passing back the reverse gradient directly

The activation function

The forward and back propagation of the activation function is the same

ReLU forward propagation Y = ReLU (X) Back propagation dL/dX = ReLU'(X) element_mul dL/dY multiply element by element where relu'(x) = 1 if x > 0 else0 Sigmoid forward propagation Y = Sigmoid (X) back propagation dL/dX = Sigmoid'(X) element_mul dL/dY multiply element by element where sigmoid'(x) = sigmoid(x) * (1 - sigmoid(x))
Copy the code

Softmax

In engineering implementation: in order to prevent the denominator of Softmax overflow, generally use LogSoftmax instead. Set 1_n to be an all-1 column vector

Logsoftmax Forward propagation Y = log_softmax(X) = X -logDL /dX = dL/dY - (dL/dY * 1_n exp(X))/(exp(X) * 1_n) * 1_n^TCopy the code

NLLLoss

NLLLoss is the average negative logarithmic likelihood loss, which is implemented for use with LogSoftmax. Let Y be the sample label matrix, one sample per behavior. N is the number of samples

Forward propagation L = mean(sum(-log_p element_mul Y, 1), 0) L can be expressed as L = 1_n^T * ((-log_p element_mul Y) * 1_k)/N dL/d(log_P) = -y/N NLLLoss+LogSoftmax is the common Softmax loss Substitute dL/ D (log_P) into the LogSoftmax gradient to obtain the gradient of softmax loss: softmax(X) -yCopy the code

RMSProp

To implement a separate optimizer, we need to save the gradient when propagating back, and then use the RMSProp algorithm to compute a uniform moving average for the new gradient. Similarly, optimizers such as Adam can be easily implemented.

implementation

Source structure

SRC Cuda Source minist MNIST DEMOtestCuda CUDA source code unit test cmakelists.txt CMake compilation scriptCopy the code

Due to space constraints, I can only look at the actual code on GitHub. Each layer is encapsulated as a class, and the connect function can be called to connect layers.

The Debug/tuning

The Visual Profiler provided by CUDA can be used to easily identify application performance bottlenecks.

In my experiment, I found that 80% of the execution time is spent waiting for video card I/O, so the operation efficiency is increased by tens of times through Pinned Memory and merge transfer/Memory allocation. Secondly, there is still a large space for optimization of matrix multiplication, but in general, I can run 6W samples of MNIST in tens of seconds on GTX1070, basically achieving my goal.

One day of programming, two days of debugging, debugging is a difficult and important part of development, master the proper tool method will get twice the result with half the effort. CUDA’s Nsight and CUDA-MemCheck are great tools. The printf+ comment method is also a good one.

test

Network structure CONV 1 32 5 relu maxpool 2 CONV 32 64 5 relu maxpool 2 CONV 64 128 3 RELufc 4 * 128 128 relu
fc128 10 relu Softmax nLLLoss Shuffle =trueBatch_size = 128 LEARning_rate = 0.003L2 = 0.0001 beta = 0.99 Accuracy 1 Epoch 93% 10 epochs 99.12% 30 epochs 99.23% 10s / epoch(GTX1070)Copy the code

The resources

  1. High Performance Convolutional Neural Networks for Document Processing
  2. Convolutional Neural network (CNN) back propagation algorithm
  3. Matrix derivation
  4. Caffe
  5. CUDA Toolkit Documents