A brief history of deep learning

Start with the machine learning genre

If you want to divide machine learning into schools, the initial division can be divided into “inductive learning” and “statistical learning” two categories. Inductive learning is similar to the way we learn by induction. It is also called “learning by example”. Inductive learning is divided into two categories, one is like our inductive knowledge points, decompose knowledge into a point, and then learn. It is also called “symbolist learning” because it is ultimately represented as a sign. The other group takes a different approach, doesn’t care what knowledge is, but simulates what the human brain learns, and we learn what the human brain learns. This kind of thinking mimics the human nervous system, which is also called connectionist learning because of its connected neural networks. “Statistical learning” is a new school of thought that emerged in the 1990s. It is a new way of learning by applying mathematics and statistics. I don’t care about what I’m learning, I don’t care about simulating the human brain, I care about statistical probability. This is a way of getting rid of the subjective and basically relying on the objective.

Connectionist school

The connectionist school’s original idea was to learn like the human brain.

Let’s start with our physiology class and look at the most basic building blocks of the human brain — neurons.

As shown in the figure above, a neuron is made up of three main parts: a cell body in the middle, surrounded by dendrites that receive signals, and a long axon that transmits signals to other cells in the distance.

After the nerve cell receives all the signals from the dendrites, it starts a chemical reaction and decides whether to send them to other cells through the axon. Sensory cells in the skin, for example, are stimulated and send signals to dendrites of nearby nerve cells. At a certain level, nerve cells pass on to the next nerve cell through axons, all the way to the brain. After the brain responds, the axons of motor neurons stimulate the muscles to respond. One of the things worth mentioning is the Hebb theory. This was proposed by Hebb, A Canadian psychologist, in behavioral Histology published in 1949. The content is that: if one neuron B is near the axon of another neuron A and is activated by A’s signal, then either A or B will have corresponding growth changes, making the connection strengthened. It was 51 years later, in 2000, that this theory was confirmed by The animal experiments of Nobel Prize winner Dr. Kendall. But until proven, unsupervised machine learning algorithms are variations of Hebb’s rule. It was widely used before it was proven.

M-P neuron model

In 1943, six years before Hebb’s principle was put forward, although electronic computers had not yet been invented, it was far from our great idol Alan. It was only three years before Turing came up with the “Turing machine test”, and two legendary figures, McCulloch and Pitts, published a paper on simulating neural networks with algorithms. That year, pitts was only 20 years old! Pitts came from a difficult background. He ran away from home at the age of 15 in a fit of rage when his father told him to quit school. By that time, he had finished reading Russell’s Principia mathematica, a college textbook. Russell later recommended Pitts to the famous philosopher, The representative of the Vienna School carnap. We’ll talk about Carnap later when we talk about inductive learning and inductive logic. Carnap sent his philosophical work, The Logical Syntax of Language, to pitts, a junior high school student, who finished it in a month. So Carnap, surprised, invited Pitts to the University of Chicago… Clean the toilet! Then McCulloch, a doctor and neuroscientist, needed a mathematical collaborator to work with in neuroscience, so he chose Pitts, a 17-year-old janitor. Later they became students of Wiener, the founder of cybernetics. Pitts fell out with Wiener over a rumor and died young at 46. The basis of neural network is still the model proposed by McCulloch and Pitts, called m-P model.

Perceptron – the first highs and lows of artificial neural networks

In 1954, IBM introduced the IBM704 computer with an algorithmic language like Fortran. Four years later, in 1958, an experimental psychologist at Cornell University, Frank. Rosenblat realized the first artificial neural network model-perceptron based on m-P model. The proposal of the perceptron made it possible for human beings to have the first model that can simulate human brain neural activity, which caused a sensation quickly. Ushered in the first high tide of artificial neural networks.

The perceptron model is shown in the figure below:



The perceptron consists of three parts:

  1. Input: includes the input strength and weight of the signal
  2. Sum: The sum will be entered
  3. Activation function: Determines the output value based on the sum result.

The great thing about perceptrons is that you don’t need any prior knowledge, just the ability to divide the problem you’re trying to solve into two parts by a straight line. This kind of problem is called linearly separable. For example, if some buildings are north of Chang ‘an Avenue and some are south of Chang ‘an Avenue, the perceptron can separate the two parts of the building, even though the perceptron does not know what Chang ‘an Avenue is, east, west, east and west.

As shown above, because x and O can find a straight line separating them, the perceptron model solves it.

And if you can’t separate red and blue dots like this, you can’t use a perceptron to tell them apart.

Rosenblat was more of an elite student than peitz the sweeper. His Bronx High School of Science in New York has won eight Nobel Prizes and six Pulitzers. It’s the same school. There’s a guy a year ahead of him named Marvin. Minsky was one of the founders of artificial intelligence. Just as the perceptron was in its heyday, Minsky published his famous book, The Perceptron, proving that it could not solve even the most basic logical operations, xor. Because the xOR problem is not linearly separable and requires two straight lines, the perceptron model really can’t solve it. This fatal blow, the first high tide of artificial neural network was quickly pushed into the trough.

However, it is worth mentioning. Later, the development of deep learning became less and less about simulating the human brain. The academic circles think that it should not be called “artificial neural network”, might as well be called multi-layer perceptron MLP.

The second highs and lows of artificial neural networks

Is it possible to combine multiple perceptrons together to solve problems that cannot be solved by a single perceptron? Yes. In 1974, Harvard University student Paul. In his doctoral thesis, Walpers proposed the back propagation algorithm (BP algorithm for short), which successfully solved the problem that the perceptron could not realize xOR. And the way to do that is basically, one line isn’t enough, just add another line. But that was the first trough of artificial neural networks, and even if you were a Harvard grad, no one was interested. This important achievement had little impact at the time.

Hopfield, a Caltech physicist, implemented a model of a recurrent neural network that he had proposed two years earlier, in 1984, a decade after Walpos’s paper was published and the year Jobs famously introduced Apple’s first MAC computer. This important achievement has renewed enthusiasm for artificial neural networks. Two years later, in 1986, the second artificial neural network boom rediscovered the BP algorithm proposed by Walpers. This further promotes the development of artificial neural networks. The limitation of the perceptron is that it has only two small networks. The BP algorithm makes it possible to create more layers and larger networks. The basic idea of BP algorithm is: 1. Signal forward propagation. 2. The error propagates back to every neuron in the upper layer.

The brainless fully connected network, which we built in the first lecture, is the technology of our time. Let’s review:

Def init_weights(shape): return tf.variable (tf.random_normal(shape, stddev=0.01)) def model(X, w_h, w_o): h = tf.nn.sigmoid(tf.matmul(X, w_h)) return tf.matmul(h, w_o)Copy the code

These functions related to artificial neural network are defined in tF.NN module, including activation function and convolution.

Through BP algorithm, the neural network has achieved 5 layers successfully. Beyond 5 floors, however, difficulties are encountered. This difficulty has puzzled researchers for 20 years. This difficulty mainly has two aspects. On the one hand, as the number of layers increases, the impact of feedback error on the upper layer becomes less and less. Second, as the number of layers increases, it is easy to be trained to a local optimal value and cannot continue.

Faced with this difficulty, most researchers turned to studying how to break through at fewer levels. As we mentioned above, another major school of machine learning “statistical learning” made breakthrough progress in this era, and its representative work is “support vector machine” -SVM.

The age of deep learning

But few researchers remained on the sidelines during the second wave of artificial neural networks. Twenty years later, in 2006, The Canadian academic Jeffrey. Hinton proposed an effective training method for solving multilayer neural networks. His method is to pre-train each layer as a restricted Boltzmann machine with unsupervised learning to extract features, and then use BP algorithm for training. In this way, these constrained Boltzmann machines can be built as high as building blocks. These networks constructed by constrained Boltzmann machines are called deep belief networks or deep belief networks. This model using deep belief networks came to be known as “deep learning”.

Of course, Hinton isn’t alone. He has a postdoctoral fellow named Yann Lecun. In 1989, three years after the rediscovery of BP algorithm, Lecun successfully applied BP algorithm to the convolutional neural network CNN. LeNet was invented by Yann Lecun in 1998 after ten years of effort. But note the timing, before Hinton changed the world in 2006, and the king of machine learning was support vector machines (SVM).

However, opportunities are for those who are prepared. On the one hand, the key technology points of CNN, ReLU and Dropout, are constantly solved. On the other hand, the breakthrough of computing power triggered by big data and cloud computing enables CNN to use more powerful computing power to complete previously unimaginable tasks.

In lecture 1, we looked at the use of ReLU and Dropout techniques for a simple fully connected network with a hidden layer:

def init_weights(shape): Return tf.variable (tf.random_normal(shape, stddev=0.01)) def model(X, w_h, w_h2, w_o, p_keep_input, p_keep_hidden): X = tf.nn.dropout(X, p_keep_input) h = tf.nn.relu(tf.matmul(X, w_h)) h = tf.nn.dropout(h, p_keep_hidden) h2 = tf.nn.relu(tf.matmul(h, w_h2)) h2 = tf.nn.dropout(h2, p_keep_hidden) return tf.matmul(h2, w_o)Copy the code

Tensorflow encapsulates ReLU and Dropout for us in the tF.NN module. Just call it.

In 2012, the miracle Hinton and his student Alex Krizhevsky, AlexNet, based on LeNet, won the ImageNet image classification champion and set a new world record. Convolutional neural network becomes the most powerful weapon in image processing.

There are four main reasons for AlexNet’s great progress:

  1. To prevent overfitting, Dropout and data enhancement techniques are used
  2. The nonlinear activation function ReLU is used
  3. Big data training (the role of the Big Data era!)
  4. GPU training acceleration (hardware advances)

Here is the structure of the Alex network:

Let’s take a look at Tensorflow’s abridged version of AlexNet’s reference implementation:

def inference(images): parameters = [] # conv1 with tf.name_scope('conv1') as scope: kernel = tf.Variable(tf.truncated_normal([11, 11, 3, 64], dtype=tf.float32, stddev=1e-1), Name ='weights') conv = tf.nn. Conv2d (images, kernel, [1, 4, 4, 1], padding='SAME') biases = tf.variable (tf.constant(0.0, shape=[64], dtype=tf.float32), trainable=True, name='biases') bias = tf.nn.bias_add(conv, biases) conv1 = tf.nn.relu(bias, name=scope) parameters += [kernel, biases] # lrn1 with tf.name_scope('lrn1') as scope: Lrn1 = TF.nn.local_response_normalization (conv1, alpha= 1E-4, Beta =0.75, depth_RADIUS =2) Bias =2.0) # strides=[1, 3, 3, 1], padding='VALID', name='pool1') # conv2 with tf.name_scope('conv2') as scope: kernel = tf.Variable(tf.truncated_normal([5, 5, 64, 192], dtype=tf.float32, stddev=1e-1), Name ='weights') conv = tf.nn. Conv2d (pool1, kernel, [1, 1, 1, 1], padding='SAME') biases = tf.variable (tf.constant(0.0, shape=[192], dtype=tf.float32), trainable=True, name='biases') bias = tf.nn.bias_add(conv, biases) conv2 = tf.nn.relu(bias, name=scope) parameters += [kernel, biases] # lrn2 with tf.name_scope('lrn2') as scope: Normalization lRn2 = TF.nn. Local_response_normalization (conv2, alpha= 1E-4, Beta =0.75, depth_RADIUS =2, Normalization lRn2 = TF.nn. Bias =2.0) # strides=[1, 3, 3, 1] # strides=[1, 2, 2, 1] name='pool2') # conv3 with tf.name_scope('conv3') as scope: kernel = tf.Variable(tf.truncated_normal([3, 3, 192, 384], dtype=tf.float32, stddev=1e-1), Name ='weights') conv = tf.nn. Conv2d (pool2, kernel, [1, 1, 1, 1], padding='SAME') biases = tf.variable (tf.constant(0.0, shape=[384], dtype=tf.float32), trainable=True, name='biases') bias = tf.nn.bias_add(conv, biases) conv3 = tf.nn.relu(bias, name=scope) parameters += [kernel, biases] # conv4 with tf.name_scope('conv4') as scope: kernel = tf.Variable(tf.truncated_normal([3, 3, 384, 256], dtype=tf.float32, stddev=1e-1), Name ='weights') conv = tf.nn. Conv2d (conv3, kernel, [1, 1, 1, 1], padding='SAME') biases = tf.variable (tf.constant(0.0, shape=[256], dtype=tf.float32), trainable=True, name='biases') bias = tf.nn.bias_add(conv, biases) conv4 = tf.nn.relu(bias, name=scope) parameters += [kernel, biases] # conv5 with tf.name_scope('conv5') as scope: kernel = tf.Variable(tf.truncated_normal([3, 3, 256, 256], dtype=tf.float32, stddev=1e-1), Name ='weights') conv = tf.nn. Conv2d (conv4, kernel, [1, 1, 1, 1], padding='SAME') biases = tf.variable (tf.constant(0.0, shape=[256], dtype=tf.float32), trainable=True, name='biases') bias = tf.nn.bias_add(conv, biases) conv5 = tf.nn.relu(bias, name=scope) parameters += [kernel, biases] # pool5 pool5 = tf.nn.max_pool(conv5, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='VALID', name='pool5') return pool5, parametersCopy the code

Convolutional neural network is a weight sharing network, which greatly reduces the complexity of its model. So what is convolution? Convolution is a mathematical method of integral transformation in functional analysis. Two functions are used to generate a third function, and the area of the overlapping part of the two functions is represented by the inversion and translation. In the traditional recognition algorithm, we need to carry out feature extraction and data reconstruction on the input data, but the convolutional neural network can directly take the picture as the network input, automatic feature extraction. Its superior characteristics lie in the translation of the picture, scale, tilt deformation has a very good adaptability. This technology is made for graphics and speech. From then on, whether the picture is put forward or backward or randomly change the Angle, far or near point is no longer a problem, making the recognition rate significantly improved to a usable degree.

The combination of DBN and CNN has successfully led to a revolution in both graphic and audio fields. Image recognition and speech recognition technology rapid replacement. But there’s another problem, natural language processing and machine translation. This is also a difficult problem, we just think about how difficult it is. When Yann LeCun published his famous paper, the third author was named Yoshua Bengio. In the 1990s, when neural networks were at a low ebb, Hinton studied DBN and LeCun studied CNN, while Yoshua studied RNN, a recurrent neural network, and initiated the study of natural language processing by neural networks. Later, LSTM, an improved model of RNN, successfully solved the problem of RNN gradient disappearing, and has since become a powerful tool in natural language processing and machine translation.

Hinton, Yann LeCun and Yoshua Bengio are three legendary figures known as the “Big Three of deep learning” in China. Together, they stuck to the direction they believed in in the cold winter of the second low tide of neural network, and finally changed the world together.

Summary of deep learning