Preface:
In a previous introductory route to computer vision article, I recommended implementing a convolutional neural network from zero using only a packet like NUMPY without any framework. One of the important factors is that in this process, you will understand how the convolution process is optimized in the bottom layer, and the mainstream method is GEMM. This blog post takes a closer look at what GEMM is and its pros and cons.
I spend most of my time thinking about how to make deep learning of neural networks faster and more efficient. In practice, this means focusing on a function called GEMM. It was part of the BLAS (Basic Linear Algebra Subroutines) library first created in 1979 and I had never heard of it until I started trying to optimize neural networks. To explain why it’s so important, here’s a diagram of my friend Yang Qingjia’s paper, right
All layers beginning with FC (full connect) or convolution) are implemented using GEMM, and almost all of the time (95% of GPU versions, 89% of CPUS) is spent on these layers.
So what is GEMM? It stands for global matrix-to-matrix multiplication, which essentially does exactly what it says on Tins, multiply two input matrices together to get an output matrix. The difference between this and the type of matrix manipulation I use in the 3D graphics world is that the matrices it works on are usually very large.
For example, a single layer in a typical network might need to multiply a 256-row, 1,152-column matrix by a 1,152-row, 192-column matrix to produce a 256-row, 192-column result. Naively, this requires 57 million (256x1152x192) floating point operations, and in modern architectures there can be dozens of such layers, so I often see networks requiring billions of flops to compute a frame. Here’s a diagram I drew to help me visualize how it works:
Click here for more computer vision content
Fully connected layers
The fully connected layer is a classic neural network that has been around for decades, and it’s easiest to start with how to use GEMM. Each output value of the FC layer looks at each value in the input layer, multiplys them all by the corresponding weight of the input index, and adds the results to its output. According to the illustration above, the specific situation is as follows:
There are “K” input values and “N” neurons, and each input value has its own set of learning weights. There is an output value of “N”, and each neuron has an output value, which is calculated by taking the dot product of its weight with its input value.
Convolution layer
Using GEMM for the convolution layer is not an obvious choice. The convolution layer treats its input as a two-dimensional image, with multiple channels for each pixel, much like a classic image with width, height, and depth. Unlike the images I’m used to working with, the number of channels can be in the hundreds, not just RGB or RGBA!
The convolution operation produces its output by taking the weight of some “kernel”. And apply them to the image. Here’s what the input image and a single kernel look like:
Each convolution kernel is a THREE-DIMENSIONAL array with the same depth as the input image, but much less width and height, typically like 7×7. To produce a result, the convolution kernel is applied to the point grid of the input image. At each point where it is applied, all the corresponding input values are multiplied with weights and then added to produce a single output value at that point. Here are the visuals:
You can think of this operation as an edge detector. The convolution kernel contains a weighting pattern and outputs a high value when the part of the input image it is looking at has a similar pattern. When the input does not match the pattern, the result is a low number at that position. Here are some typical patterns that are learned by layer 1 of the network.
Because the input for the first layer is an RGB image, all of these cores can also be visualized as RGB, and they show the original mode that the network is looking for. Each of these 96 cores is applied in grid mode on the input, and the result is a series of 96 two-dimensional arrays that are treated as output images with 96 channel depths. If you are used to image processing operations like the Sobel operator, you can imagine that they are all a bit like edge detectors optimized for different important patterns in the image, so that each channel is a mapping of where those patterns occur in the input.
As you may have noticed, I am very vague about the grid type applied by the kernel. Its key controlling factor is a parameter called “stride,” which defines the spacing between kernel applications. For example, with a stride of 1,256×256 the input image will apply a kernel at each pixel, and the output will be the same width and height as the input image. After just four steps, the same input image can only be applied to the kernel once every four pixels, so the output will be only 64×64. Typical step values are smaller than the size of the kernel, which means that many of these values actually overlap at the edges in diagrams that visualize kernel applications.
How does GEMM handle convolution solutions
This seems like a pretty specialized operation. It involves a lot of multiplication and summation, such as the full connection layer, but it’s not clear how or why to turn it into GEMM’s matrix multiplication. I’ll talk about motivation at the end, but how is this operation represented in terms of matrix multiplication
The first step is to take the input from an image that is actually a three-dimensional array and turn it into a two-dimensional array that we can treat like a matrix. Where each kernel is applied is a small three-dimensional multidimensional cube in the image, so we extract the multidimensional cube for each input value and copy them as a column into a matrix. This is called IM2COL, for image to column, I believe from an original Matlab function, here’s how I visualized it:
If the stride is smaller than the kernel size, you might be surprised at the memory size expansion that occurs when we do this transformation. This means that the pixels contained in the overlapping core sites will be copied in the matrix, which may seem inefficient but actually does more good than harm.
Now that you have the input image in matrix form, you can do the same for each kernel weight, serializing the 3D cube into rows as the second matrix of the multiplication. Here’s what the final GEMM looks like:
The “k” here is the number of values in each patch and kernel, so it is the kernel width * kernel height “depth. The result matrix is the column height of “Patches number”, which is calculated according to the row width of “Kernels number”. This matrix is actually treated by subsequent operations as a three-dimensional array, with kernel dimensions as depth, and then patches divided back into rows and columns based on their original position in the input image.
Why does GEMM work for convolution
Hopefully you can now see how the convolution layer can be represented as matrix multiplication, but it’s still not clear why you want to do this. In short, the answer is that the formatting world of scientific programmers has spent decades optimizing code to perform large matrix-to-matrix multiplication, and the benefits of very regular memory access patterns outweigh the wasted storage costs. This paper from Nvidia (download at the end) does a good job of introducing a few different approaches, but they also describe why they ended up with a modified version of GEMM as their favorite approach. At the same time, there are many advantages to the same kernel batch processing of large input images, this paper on Caffe Con Troll (download method attached at the end of the paper) used very good results. The main competitor to the GEMM method is to operate in the frequency space using the Fourier transform, but the use of stepping in convolution makes it difficult to be so effective.
The good news is that having a single, well understood algorithm (i.e., GEMM) that takes up most of our time provides a very clear path to optimizing the use of speed and power, whether through better software implementations or by customizing hardware to run the operation well. Because the deep Web has proven useful for a wide range of applications across speech, NLP, and computer vision, I expect to see huge improvements over the next few years, just as the widespread demand for 3D gaming has driven the GPU revolution by forcing a revolution in vertex and pixel processing operations.
Paper: cuDNN: Efficient Primitives for Deep Learning
Address: arxiv.org/pdf/1410.07…
Caffe con Troll: Shallow Ideas to Speed Up Deep Learning
Address: arxiv.org/pdf/1504.04…
How to obtain: Reply “0002” in the official account
Link to original article:
Petewarden.com/2015/04/20/…
This article comes from the public account CV technical guide of the paper sharing series.
Welcome to the public account CV technical guide, focusing on the technical summary of computer vision, the latest technology tracking, classical paper interpretation.
To get a PDF summary of the following articles, reply to the official account with the keyword “Technical summary”.
Other articles
Why is 8 bits enough to use deep neural networks?
Classic paper series | target detection – CornerNet & also named anchor boxes of defects
What about the AI bubble
Use Dice Loss for clear boundary detection
PVT– Backbone function without convolution dense prediction
CVPR2021 | open the target detection of the world
Siamese network summary
Past, present and possibility of visual object detection and recognition
What concepts or techniques have you learned as an algorithm engineer that have made you feel like you’ve grown tremendously?
Summary of computer vision terms (1) to build a knowledge system of computer vision
Summary of underfitting and overfitting techniques
Summary of normalization methods
Summary of common ideas of paper innovation
Summary of methods of reading English literature efficiently in CV direction
A review of small sample learning in computer vision
A brief overview of knowledge distillation
Optimize OpenCV video read speed
NMS summary
Technical summary of loss function
Technical summary of attention mechanisms
Summary of feature pyramid technology
Summary of pooling technology
Summary of data enhancement methods
Summary of CNN structure Evolution (I) Classic model
Summary of CNN structure evolution (II) Lightweight model
Summary of CNN structure evolution (III) Design principles
How to view the future of computer vision
Summary of CNN Visualization Technology (I) Visualization of feature map
Summary of CNN visualization technology (2) Visualization of convolution kernel
Summary of CNN Visualization Technology (III) Class visualization
CNN Visualization Technology Summary (IV) Visualization tools and projects