Introduction to Deep Learning series, using the vernacular way to make you understand and learn fast (chapter 9, chapter 10)

Series of articles:

One into the Hou Door “deep” like the sea, deep learning how deep (one of deep learning introduction series)

Artificial “carbon” is still unknown, intelligent “silicon” is unknown (Introduction to Deep Learning series 2)

Neural networks are too numerous, m-P model seems to be searchable (Introduction to Deep Learning series 3)

“Machine learning” triple door, “Golden mean” Becoming More human (Introduction to Deep Learning Series 4)

Hello World perceptron, I will rest until I know you.

Loss Function for weight loss, Neural Network Weight Adjustment (Introduction to Deep Learning Series 6)

The fastest way down to ask the gradient (Introduction to Deep Learning series 7)

BP algorithm bidirectional transmission, chain derivation the most touching (Introduction to Deep Learning series eight)

————————————————————————————

Convolutional Networks: A Deep insight into where connections are Stuck (Introduction to Deep Learning series 9)

The text continues and the book continues.

Abstract: The eyes of lower animals are long on both sides, so that the field of vision is broad and easy to avoid danger. And human eyes only long in one side, vision has a blind corner, security is difficult to ensure, but why only human “advanced” up, evolution into the master of the earth? Further, is it better to be broad and shallow, or deep and focused? And what does this have to do with convolutional neural networks? Take a look inside. History will tell you.

background

In the previous article, we introduced Back Propagation (BP) algorithm. In essence, BP algorithm is a fully connected neural network. BP algorithm also has many successful applications, but can only be applied to “shallow” network, because of “superficial”, so it limits its ability to characterize, and then also limited its scope of application.

Why is it difficult to be “profound”? A big part of the problem is its “full connectivity”. Isn’t “full connectivity” bad? It’s more comprehensive. Is comprehensiveness a defect?

We’ll leave that aside for now, as readers finish reading this series of articles, and the answer will become clear. In this chapter, we discuss a much broader Network called Convolutional Neural Network, which works marvellously ona wide range of tasks, from image recognition to speech recognition (such as GoogleNet and Microsoft ResNet), Deep learning has exploded in recent years, with CNN taking the lead.

But why is CNN so powerful? The answer lies in history. Fei Xiaotong, a famous anthropologist, once pointed out [1] that what we call “present” actually contains the projection and accumulation of time selection extracted from “past” history. History is not an ornament to be dispensed with, but a practical and indispensable basis on which to move forward.

So let’s talk a little bit about the history of convolutional neural networks, and hopefully get some inspiration from that. Before we go back into history, let’s try to consider this “seemingly beside the point” question: Why do almost all lower animals have eyes on the sides of their heads?

9.1 Where is the eye? Where is the way?

Indeed, if you look closely, lower animals tend to have eyes on both sides. From the perspective of evolution, “natural selection, survival of the fittest”. Nature has a reason for this arrangement. One explanation is that lower animals are able to see up, down, left, right, front, back and forth at the same time because of this “creative arrangement,” so there is no blind spot. It is indeed an extremely safe configuration, and with that security, they are better able to survive on earth.

But what are the limitations of such a configuration? Compared to the lower animals, human eyes are always in the front of the face. Isn’t it bad that you can’t see everything around you in this configuration? But the truth is, only humans have evolved into the most “advanced” animals on the planet.

One puts it this way (and the meaning of this explanation may have more to do with imagery than biology, so there’s no need to jump at it) : The side-effect of lower animals’ never-dead eye configuration, which allows them to focus more comprehensively, is that they are unable to focus their gaze in one place, and of course they are not able to look carefully and for long periods of time, so they are not able to evolve the ability to think deeply. Humans, on the other hand, are able to look ahead because of an eye defect that accepts blind spots in the field of vision, giving deep Insight into what is happening. “Advanced” animals are thus “trained”.

In other words, superficial, comprehensive observations are sometimes better than local insights. Consider The famous Butterfly Effect, which takes a “holistic” view of weather changes: The occasional flutter of a Butterfly’s wings in The Amazon rainforest may cause a tornado in Texas two weeks later. But in real life, who has really used the “butterfly effect” to solve the problem of climate change? Haven’t we waved a plantain fan and used partial air flow to cool ourselves in the heat of summer?

But what does that have to do with our topic today, convolutional neural networks? Of course there is a connection, and the connection is methodological, as I explain.

9.2 History of convolutional neural networks

We know that the so-called “higher” characteristics of animals are embodied in behavioral representations. At a deeper level, they also show up in the evolution of the cerebral cortex. In 1968, neurobiologists David Hunter Hubel and Torsten N. Wiesel made two important and interesting discoveries while studying visual information processing in animals (first cats and then our close relatives, monkeys) [1] : (1) For visual coding, neurons in the cerebral cortex of animals actually have local receptive fields (receptive field). Specifically, they are locally sensitive and directional selective (as shown in Figure 9-1). (2) The animal cerebral cortex is graded and layered. There are two types of cells in the primary visual cortex of the brain: simple cell, complex cell and hyper-complex cell. These different types of cells perform visual perception functions at different levels of abstraction.

(Figure ·9-1 Huber and Weizel’s classic paper)

It was for these important physiological discoveries that Huber and Wiesel won the Nobel Prize for medicine in 1981. The significance of this scientific discovery is not only limited to physiology, but also indirectly contributed to the breakthrough development of artificial intelligence fifty years later.

The implications for AI of huber et al. ‘s work are that artificial neural networks can be designed without considering the “fully connected” patterns of neurons. In this way, the complexity of neural networks can be greatly reduced. Inspired by this concept, Japanese scholar Fukushima proposed the neocognitron (also translated as “new recognition machine”) model [3] in 1980, which is a neural network model using unsupervised learning and training, which is actually the embryonic form of convolutional neural network (as shown in Figure 9-2). As you can see in Figure 9-2, Neocognitron borrows from concepts such as visual Area layering and high-level area association proposed by Huber et al.

(Figure 9-2 paper on neurocognitive machines)

Since then, many computer scientists have done in-depth research and improvement on “neurocognitive machine”, but the results are not satisfactory. Until 1990, Yann LeCun et al., working in AT&T Bell LABS, applied supervised back propagation (BP) algorithm to the architecture proposed by Kunihiko Fukushima et al., thus laying the structure of modern CNN [4].

Compared to the traditional image processing algorithms, LeCun the CNN, put forward to avoid the complex early processing of image (i.e., a large number of artificial image feature extraction work), that is to say, CNN can directly from the original image, after the pretreatment of the very few, can find out from the image visual pattern, and then identify the classification task, That’s what end-to-end means. LeCun et al. rounded the error rate to about 5% for handwritten postcodes (see figure 9-3). With a strict theoretical foundation and successful business cases, convolutional neural networks exploded in academia and industry.

(FIG. 9-3 LeCun et al. ‘s use of convolutional neural network to identify handwritten postal codes)

LeCun was also a narcissist. He named his work LeNet, which was updated several times and finally settled on LeNet 5. The LeNet architecture was all the rage at the time, but its core business was for character recognition tasks, such as reading zip codes, numbers, and so on.

The problem is that 30 years have passed since 1990. Why is technology 30 years old suddenly back in vogue in the form of deep learning?

Mr Ng has a vivid metaphor for deep learning: the process of deep learning is like launching a rocket. Rockets need two things to make them fly: an engine and fuel. For deep learning, its engine is “big computing” and its fuel is “big data”.

LeCun et al. proposed CNN 30 years ago, but its performance was severely limited by the general environment at that time: there was no large-scale training data and no computational capacity to keep up with, which led to the time-consuming training of CNN network at that time and the recognition effect was limited.

Vapnik (Wapnik), LeCun’s old enemy in the same lab, developed and expanded the Support Vector Machine (SVM). This SVM is amazing. In 1998, LeCun et al. reduced the recognition error rate of similar tasks to 0.8%, which is far higher than the convolutional neural network at the same time. In this way, in the rise of SVM, neural network research silent in a new low tide!

Someone translated the Chinese name of Yann LeCun into Yan Lecchun. Indeed, when “big computing” and “big data” are no longer problems, Yan Lechun ushered in another “spring” after 30 years. At present, Yan lechun attended major conferences as a master of deep learning and made one keynote speech after another. He vividly interpreted the “situation is stronger than people” with his own personal experience.

LeNet, proposed by LeCun, plays an important role in promoting the development of deep learning. Later convolutional neural networks have also made many improvements based on LeNet 5, such as the new activation function (ReLU) adopted by Professor Hinton in 2012. The current mainstream convolutional neural network structure is shown in Figure 9-4, and its essence is roughly reflected in three core operations and three concepts. The three cores are Convolution, Poling, and ReLU. The three concepts are: Local receptive filed, Weight sharing and Subsampling.

(Figure 9-4 Basic structure of CNN)

In the following chapters, we will vividly introduce each of these concepts, so stay tuned.

9.3 Summary and Thinking

In this section, we mainly review the development history of convolutional neural networks. From the historical context, we can draw an imago conclusion: deep and insightful attention may be better than broad and superficial observation. This also partly explains the success of “deep learning” from methodology.

Further, consider that the layout of the human eye is flawed because it creates insecurity. Li Xiaolai believes that only by giving up part of the sense of security can people make progress. To compensate for this disadvantage, people can turn it into an advantage through effective social interaction and collaboration.

Social networking in turn gave rise to community, but with more than 150 people in the community, people simply cannot socialize. In a Brief History of Humankind, the budding historian Yuval Harari argues that humans need to “tell stories” to collaborate on a large scale. Do you agree with this view? Why is that?

Leave a message and write down your feelings. I wish you harvest every day.

reference

[1] Fei Xiaotong. Rural China. Peking University Press. October 2012

[2] Hubel D H, Wiesel T N. Receptive fields and functional architecture of monkey striate cortex[J]. The Journal of physiology, 1968, 195 (1) : 215-243.

[3] Fukushima K, Miyake S. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition[M]//Competition and cooperation in neural nets. Springer, Berlin, Heidelberg, 1982: 267-285.

[4] LeCun Y, Boser B E, Denker J S, et al. Handwritten digit recognition with a back-propagation network[C]//Advances in neural information processing systems. 1990: 396-404.

Zhang Yuhong is the author of Savoring Big Data. I am the theme song brother.

——————— gorgeous dividing line ———————

When the wind blows away, it becomes beautiful and picturesque. (Deep Learning Introduction Series 10)

Abstract: “This feeling can wait to be recalled.” But what exactly is “memory”? If I tell you that “memory” is a convolution, don’t believe me. Convolution is not a mystery, it’s in your life and mine, it’s in deep learning! This is probably the most accessible introduction to convolution ever written, so check it out.

In the previous chapters, we briefly introduced the ins and outs of convolutional neural networks. Let’s take a look at some of the core elements of its success. Convolutional neural networks get their name from the convolutional operations. So when you talk about convolutional neural networks, the core concept is probably “What is convolution?”

10.1 Mathematical definition of convolution

Away from the context of convolutional neural networks, “convolution” is actually a standard mathematical concept. As early as in Section 3.4, we have mentioned the concept of “convolution” : the so-called convolution is nothing more than the weighted “superposition” of one function and another function on a certain dimension [1]. In order to better understand the mathematical significance of convolution operation, a specific case is listed to illustrate it [2].

Suppose our mission is to monitor a spacecraft in real time. The spacecraft is equipped with a laser transmitter. The laser emitter outputs signal F (t) in real time at any time t, where F (t) represents the position of the spacecraft at any time T. Generally speaking, the laser signal will be mixed with a certain noise signal G (T). In order to more accurately measure the position of the spacecraft, it is necessary to reduce the influence of noise, so we need to smooth the acquired range signal X (t).

Obviously, for the output result of adjacent time, the output close to the current time, they also have a greater influence on the output of the result (assign a larger weight). Conversely, the farther away they are from the current time, the less influence they have on the current result (assigning smaller weights). Therefore, the weighted average position of the spacecraft S (t) can be expressed as formula (10.1) :

Such operations are called convolution operations over the continuous domain. This operation is often simply denoted by formula (10.2) :

In Formula (10.2), function F is usually called input function, g is called filter or convolution kernel, and the superposition result of these two functions is called feature graph or feature map.

In theory, the input function can be continuous, so a continuous convolution can be obtained by integration. But in fact, almost all computers today are digital, and such computers cannot handle continuous (analog) signals. So we need to discretize the continuous function.

In general, we do not need to record data at any time, but take samples at a certain time interval (that is, frequency). There is a theoretical basis for saying so. According to Shannon sampling theorem, when the sampling frequency should not be less than 2 times of the highest frequency in the analog signal spectrum, the analog signal can be restored without distortion. For discrete signals, the convolution operation can be expressed as formula (10.3) :

Of course, the definition of discrete convolution extends to higher dimensions. For example, the two-dimensional formula can be expressed as formula (10.4) :

10.2 Convolution in Life

The concept of convolution seems abstract. Fortunately, theory comes from the induction and abstraction of reality. To help us understand this concept, we can use real life examples to illustrate this concept in reverse.

In the description of the previous chapter, we have mentioned that a function is a function and a function is a function. The weighted superposition of functions, more generally speaking, is the superposition of functions. If the function is abstract, the function is concrete. We can easily find convolution in our lives to explain this concept more vividly. In this respect, Academician Li Deyi is a master.

In the special invited report of 2015 China Computer Conference, the author had the honor to listen to the keynote speech delivered by Academician Li Deyi, president of Chinese Association for Artificial Intelligence. In his report, Academician Li mentioned the understanding of convolution, which was very interesting [3].

And he said, what is convolution? If, for example, a wire is constantly bent somewhere, and if the heat function is f(t) and the heat dissipation function is G (t), the temperature at this point is the convolution of F (t) with G (t). In a particular environment, the sound source function of the voicing body is F (t), and the reflection effect function of the sound source in this environment is G (t), so the received sound in this environment is the convolution of F (t) and G (t).

Similarly, memory is the result of convolution. Assuming that the cognitive function is F (t), which represents the understanding and digestion of existing things, and the forgetting function generated over time is G (t), then the memory function h(t) in human brain is the convolution of F (t) and G (t), which can be expressed by the following formula.

Finally, Academician Li said, if we computer workers want to understand convolution, we need to understand convolutional neural networks. And that’s very relevant to what we’re going to talk about today, so let’s get back to volume, and then talk about convolutional neural networks.

10.3 Convolution in image processing

Image recognition is the holy land of convolutional neural network. So let’s take image processing as an example to illustrate the role of convolution.

As for the image on the left as shown in FIG. 10-1, normal people can easily identify a number “8” and a cat respectively. But for computers, seeing a matrix of numbers (each element has a pixel value between 0 and 255) and determining whether it is the number eight or the cat depends on computer algorithms, which is where ARTIFICIAL intelligence is headed.

(Figure 10-1 Computer “eye” image)

In the matrix shown in Figure 10-1, each element represents the brightness intensity of pixels. Here, 0 is black, 255 is white, and the smaller the number, the closer it is to black. In grayscale images, each pixel value represents only the intensity of one color. In other words, it has only one channel. In color image, there can be 3 channels, namely RGB (red, green, blue). In this case, the color image can be described by stacking a matrix of pixels from three different channels.

The main purpose of convolution operation in image processing is to extract features from images. Convolution can easily learn the features of an image from a small piece of input data matrix (that is, a small piece of image) and preserve the spatial relations between pixels. The following example illustrates the process of using convolution in a two-dimensional image.

In Figure 10-2, in order to facilitate readers’ understanding, the pixel values of the image data matrix are replaced by letters such as A-B-C-D, and the convolution kernel is a small matrix of 2×2. It should be noted that in other contexts, this small matrix is also called a “filter” or a “feature detector”.

If the convolution kernel is applied to the data matrix of the input image and the convolution operations are performed from left to right and from top to bottom respectively, the feature map of this image can be obtained. In various academic papers, the term feature maps are also referred to as “convolved features” or “activation maps”.

(FIG. 10-2 Example of convolution operation on 2d image data)

As shown in FIG. 10-2, discrete convolution is essentially a linear operation. Therefore, such convolution operations are also known as linear filtering. By “linear” we mean that we replace each pixel with a linear combination of its neighbors. In fact, the convolution operation also has shift-invariant. This “translation invariance” means that the same operation is performed at every position in the image.

This process is not easy to understand, so let’s use a more simple dynamic diagram to illustrate the convolution process. As mentioned above, each image can be viewed as a numerical matrix of pixel values. For grayscale images, the pixel value ranges from 0 to 255. For simplicity, consider a minimalist image given 5×5 with a pixel value of either 0 or 1. Similarly, the convolution kernel is a 3×3 minimalist matrix, as shown in Figure 10-3.

(FIG. 10-3 Simplified version of image matrix kernel convolution kernel)

Now let’s see how the convolution calculation works. We use an orange matrix to slide left to right and top to bottom on the original image (shown in green), one pixel at a time, and the stride is called the “stride.” At each position, we can calculate the product of the corresponding elements between the two matrices and store the sum of the dot product results in each cell of the output matrix (shown in pink), thus obtaining the feature map (or convolution feature) matrix [5].

(FIG. 10-4 Realization process of convolution)

10.4 Application of convolution in image processing

So far, we’ve only done some simple matrix operations, and it’s not quite clear what the benefits are. To put it simply, the purpose of doing this is to “convolve” the pixel values of the adjacent sub-areas of the image with the convolution kernel, so as to obtain the statistical relationship between the adjacent data, so as to mine some important features in the image.

So again, very abstract, what are these features? Next, we use several image cases to illustrate this concept vividly [6], as shown in FIG. 10-5.

(Figure 10-5 “Magic” convolution kernel)

Here is a brief introduction to the commonly used “time-tested” convolution kernel.

(1) Identity. As you can see from Figure 10-5, the filter does nothing, and the resulting image is the same as the original. Because the core only has a value of 1 at the center. The weights of neighborhood points are all 0, so there is no influence on the filtered values.

(2) Edge Detection, also known as Gauss-Laplacian operator. Note that the sum of the elements of this kernel matrix is 0 (i.e. the middle element is 8 and the sum of the surrounding 8 elements is -8), so the filtered image will be very dark and only the edges will be bright.

(3) Image Sharpness Filter. Image sharpening and edge detection are similar. Find the edges first, and then add the edges to the original image. This enhances the edges of the image and makes it look sharper.

(4) Box Blur. Each element of the kernel matrix has a value of 1, its current pixel is averaged with the pixels in its neighborhood, and then divided by 9. Mean blur is simple, but the image processing is not smooth enough. Therefore, Gaussian Blur, which is widely used for image denoising, can also be used.

In fact, there are many interesting convolutional cores, such as embossing Filter, which can create an artistic 3D shadow effect for the image, as shown in Figure 10-6. The emrelief core subtracts the pixel value on one side of the center from that on the other. In this case, the pixel value generated by convolution may be negative, so we can take the negative as shadow and the positive as light, and then add a certain value of offset to the resulting image.

(Figure 10-6 Application of relief core)

10.5 summary

Now, to summarize the chapter, we first gave a mathematical definition of convolution, and then explained the concept in reverse deductive terms with similar cases in life. Finally we demonstrate the application of convolution in image processing with several famous convolution kernels.

In the following chapters, we will introduce the important structure of Convolutional neural network in detail, including Convolutional Layer, Activation Layer, Related to ReLU, Pooling Layer and Full Connected Layer.

Please pay attention.

10.6 Please think

Through the previous study, please consider the following questions:

(1) How is distributed feature representation, as we often say, reflected in convolutional neural network?

(2) In addition to the common convolutional kernels described in this paper, what other convolutional kernels are commonly used for image processing?

(3) It is very popular to paint by computer now. Whether it’s the Inceptionism of Google (” Dreamtionism “[7]) or the” Deep Style “[8] currently used by David Aslan (as shown in Figure 10-7), it is an artistic painting Style based on neural network. Do you know what kind of convolution kernels they use?

(Figure 10-7 Painting style of deep style)

Write down your perception, I wish you a harvest every day!

reference

[1] Zhang Yuhong. Cloud communities. Neural Networks are too numerous and M-P models are likely to be found (Introduction to Deep Learning series 3)

[2] Anbu Huang. Deep learning. China Industry and Information Technology Press.2017.6

[3] Li Deyi. From brain cognition to artificial intelligence. China Computer Conference. October, 2015

[4] Savan Visalpara. How do computers see an image ?

[5] Feature extraction using convolution

[6] Ujjwal Karn. An Intuitive Explanation of Convolutional Neural Networks

[7] Alexander Mordvintsev, Christopher Olah, Mike Tyka.

Inceptionism: Going Deeper into Neural Networks.

[8] David Aslan. How Artists Can Use Neural Networks to Make Art