The original author: Tim Dettmers: timdettmers.com/2015/03/26/… Translator: kwii

Convolution is the most important concept in deep learning. It is convolution and convolutional networks that have brought deep learning to the forefront of almost all machine learning tasks. But why is convolution so powerful? How does it work? In this blog post, I will explain convolution and other related concepts to help you understand convolution in depth.

There have been many blogs about convolution on the Web, but I find that they are full of unnecessary mathematical detail and fail to deepen my understanding of convolution in a meaningful way. While there will be a lot of math detail in this blog as well, I’ll use images to show the underlying math and make sure everyone understands it. The first part of this blog is a general overview of convolutional networks in convolution and deep learning. The second part of the blog contains some advanced concepts designed to help deep learning researchers and experts deepen their understanding of convolution.

Part ⅰ : What is convolution?

While the entire blog is built around this question, it’s helpful to know the general direction ahead of time, so, roughly speaking, what is convolution?

You can think of convolution as a mixture of information. Imagine two buckets full of information, one of which is dumped like the other, and then mixed together according to specific rules. Each bucket of information has its own recipe, which describes how information from one bucket is mixed with information from another bucket. So convolution is an ordered process in which two sources of information intersect.

Convolution can also be described mathematically, in fact, it’s a mathematical operation similar to addition, subtraction, multiplication and division, and while the operation itself is complex, it helps simplify more complex equations. In physics and engineering, convolution is widely used to simplify complex equations. In the second part, after a brief mathematical derivation of convolution, we will relate and integrate the concept of convolution in science and deep learning to gain a deeper understanding of convolution. But for now, let’s look at convolution from a practical point of view.

How do we do convolution in an image?

When we convolve in an image (two-dimensional convolve), we convolve in the width and height of the image. We mix two buckets of information: the first is the input image, which has a total of three pixel matrices, the RGB three-channel matrix. The second bucket of information is the convolution kernel, which is a floating point matrix. The size and pattern of the matrix can be regarded as a scheme indicating how to make the input image and the convolution kernel carry out the convolution operation. In deep learning, the output of convolution kernel is called feature map, which is an image after convolution operation. For an RGB representation of an image, each color channel has an eigenmatrix.

We will now demonstrate the intersection of two pieces of information by a convolution operation. One of the ways to do convolution is to take a piece of the input image that is the same size as the convolution kernel, so let’s say we have a 100×100 input image, a 3×3 convolution kernel, so we’re going to take a 3×3 piece of the input image, Then we multiply a small chunk of the image and the convolution kernel using the element-wise multi. The multiplied sum represents one pixel in the eigenmatrix. After calculating the multiplication, slide the center pixel of the image block extractor in the other direction and repeat the calculation. When all the pixels in the eigenmatrix have been calculated, the convolution operation is finished.

Why is image convolution useful in machine learning?

There is a lot of distracting information in the image that is irrelevant to the goal we are trying to achieve (such as image classification). A good example is a project I worked on with Jannek Tohmas at the Burda camp. Burda Boot Camp is a rapid prototyping lab that creates technical products in a hackathon-style environment in a very short amount of time. With nine of my colleagues, we created eleven products in two months. In one project, I want to use autoencoder, a fashion clothing image searcher: you can upload an image of a fashion clothing, and the autoencoder will find the same style of clothing.

Now if you want to distinguish between different styles of clothes, the color of the clothes is not so useful; Other details like the brand’s logo are also unimportant. The most important thing is probably the shape of the clothes. Typically, women’s shirts are in a completely different shape from men’s shirts, jackets, and pants. So if we can filter images for unnecessary information, then our algorithms won’t be distracted by unnecessary details like color and branding. We can easily achieve filtering through image convolution.

My colleague Jannek Thomas preprocessed the data and used a Sobel edge detector (similar to a convolution kernel) to retain the contour information of the object’s shape. That’s why the convolution operation is often called filtering, and the convolution kernel is called a filter. If you want to distinguish between different types of clothing, the feature matrix from the edge detector can be very useful, and only the relevant shape information is retained.

One step further: There are many different convolution kernels that can produce different feature matrices, such as sharpening the image (more detail), blurring the image (less detail), and each feature graph helps our algorithm do a better job at its particular task.

The process of receiving input, transforming it, and feeding the transformed input to an algorithm is called feature engineering. Feature engineering is very difficult and there are almost no resources to help you learn this skill. As a result, very few people can skillfully apply feature engineering to different tasks. Feature engineering is manual and is the most important skill to score well in Kaggle. Feature engineering is so difficult because each kind of data core each kind of problem, good feature is different: the image task is just not in the temporal data; Even if we have two similar image tasks, it’s not easy to extract good features because objects in the image determine which features work and which don’t. It depends a lot on experience.

So feature engineering is very rare, and for every new kind of task, you have to start from scratch. But when we look at pictures, is there a way to automatically find convolution kernels that are suitable for a new task?

Enter the convolutional network

Convolutional network is a method to find convolution kernel automatically. Instead of assigning fixed numbers to our cores, we assign parameters to these cores that will be trained on the data. As we train the convolutional network, the performance of the convolutional kernel in filtering relevant information will become better and better. This process is fully automatic and is called feature learning. Feature learning will automatically generate a convolution kernel suitable for each task: we simply need to train our network to find filters that can extract relevant information. That’s why convolutional networks are so powerful — no feature engineering difficulties!

Generally, in convolutional networks, we do not only learn the parameters of one convolution kernel, but simultaneously learn the parameters of multiple convolution kernels, which are stacked in a hierarchical structure. For example, a 32x16x16 convolution kernel (CxHxW) applied to a 256×256 image will generate 32 241×241 feature maps (this is standard convolution without padding). So, we automatically learn 32 new features that are relevant to our task. These features can also serve as inputs to the next convolution kernel, and once we learn our hierarchical features, we simply pass them to the full connection layer, where a simple neural network combines them and classifies the input image. That’s all you need to know about convolutional networks at a conceptual level (pooling/subsampling is also important, as we’ll cover in another blog post).

Part ⅱ : Advanced Concepts

This part of the content is very cool, translate when you have time.