This article is the notes section of Ng’s deep Learning course [1].

Author: Huang Haiguang [2]

Main Author: Haiguang Huang, Xingmu Lin (All papers 4, Lesson 5 week 1 and 2, ZhuYanSen: (the third class, and the third, three weeks ago) all papers), He Zhiyao (third week lesson five papers), wang xiang, Hu Han, laughing, Zheng Hao, Li Huaisong, Zhu Yuepeng, Chen Weihe, the cao, LuHaoXiang, Qiu Muchen, Tang Tianze, zhang hao, victor chan, endure, jersey, Shen Weichen, Gu Hongshun, when the super, Annie, Zhao Yifan, Hu Xiaoyang, Duan Xi, Yu Chong, Zhang Xinqian

Editorial staff: Huang Haiguang, Chen Kangkai, Shi Qinglu, Zhong Boyan, Xiang Wei, Yan Fenglong, Liu Cheng, He Zhiyao, Duan Xi, Chen Yao, Lin Jianyong, Wang Xiang, Xie Shichen, Jiang Peng

Note: Notes and assignments (including data and original assignment files) and videos can be downloaded on Github [3].

I will publish the course notes on the official account “Machine Learning Beginners”, please pay attention.

Week 4: Deep Neural Networks

4.1 Deep L-Layer neural network

So far we’ve learned about forward propagation and back propagation of a neural network with a single hidden layer, logistic regression, and you’ve also learned about vectoring, which is important when you’re randomly initializing weights.

All you need to do this week is put these ideas together to implement your own deep neural network.

To review the first three weeks of class:

  1. Logistic regression is structured on the left side of the figure below. A hidden layer of neural network, structure below right:

\

Note that the number of layers of a neural network is defined in this way: from left to right, starting from 0, as shown in the upper right figure,,,,, this layer is layer 0, the hidden layer to the left of this layer is layer 1, and so on. In the figure below, the neural network with two hidden layers is on the left, and the neural network with five hidden layers is on the right.

Logistic regression is also strictly a one-layer neural network, whereas in the much deeper model on the upper right, shallow and deep are just degrees. Keep the following points in mind:

A neural network with a hidden layer, a two-layer neural network. Remember when we count the layers of the neural network, we don’t count the input layer, we only count the hidden layer and the output layer.

But in the past few years, the DLI (Deep Learning Institute) has realized that there are functions that only very deep neural networks can learn, and shallower models can’t. Although for any given problem is hard to predict in advance exactly how deep need of neural network, so first to logistic regression, try a layer and then two hidden layers, and then put the number of hidden layer as another are free to choose the size of the super parameter, and then keep on cross validation data evaluation, or use to assess your development set.

Let’s look at the symbolic definition of deep learning:

The image above shows a four-layer neural network with three hidden layers. We can see that the first layer (the second layer on the left, because the input layer is layer 0) has five neurons, five neurons in the second layer, and three neurons in the third layer.

We use L to represent the number of layers, as shown in the figure above: the index of the input layer is “0”, and the first hidden layer means that there are 5 hidden neurons, similarly,, = (the output unit is 1). And the input layer,

The number of neurons that you have in the different layers, for each layer L is used as the result of layer L activation, and we’ll see later in the forward propagation, eventually you’ll figure out.

By calculating with the activation function, which is also indexed to the number of layers, which we then use to write down the weight of the values computed at the L layer. Similarly, the equation is the same.

Finally, the following symbolic conventions are summarized:

The characteristics of the input are called, but it is also the activation function of layer 0, so.

The activation function of the last layer, therefore, is equal to the output predicted by this neural network.

4.2 Forward and Backward Propagation

We’ve seen the basic building blocks that make up deep neural networks, like each layer has a forward propagation step and a reverse backward propagation step, and this video is going to show you how to implement those steps.

So forward propagation, input, output is, cache is; From an implementation point of view, we can cache and, which makes it easier to call functions in different sections.

So the steps of forward propagation can be written as:

 

The realization process of vectorization can be written as:

 

Forward propagation requires feeding, that is, to initialize; What is initialized is the input value of the first layer. Corresponds to the input characteristics of one training sample, corresponds to the input characteristics of the entire training sample, so this is the input of the first forward function of this chain, and repeating this step computes the forward propagation from left to right.

Here are the steps for back propagation:

Input is, output is,,

So the steps of back propagation can be written as:

(1)

(2)

(3)

(4)

(5)

The formula (5) is obtained by substituting the formula (4) into the formula (1), and the first four formulas can realize the reverse function.

The realization process of vectorization can be written as:

(6)

(7)

(8)

(9)

To sum up:

The first layer you might have a ReLU activation function, the second layer you might have another ReLU activation function, and the third layer you might have a sigmoid function (if you do dichotomies), whose output is, to calculate the loss; So you can iterate back and take the derivative to find , , , , ,. At the time of calculation, the cache will get passed, and then back, and can be used to calculate, but we don’t use it, it tells the story of a network of three layers of forward and reverse transmission, there will be a detail didn’t tell is the forward recursive — with the input data to initialize, then reverse recursive (using Logistic regression for binary classification) – the derivation.

Word of advice: make up calculus and linear algebra, do more derivation, do more practice.

4.3 Forward Propagation in a Deep Network

As usual, we will first look at how forward propagation is applied to one of the training samples and then discuss the vectorization version.

The first layer needs to be computed.

The second layer needs to be calculated,

And so on,

The fourth layer is,

Forward propagation can be summarized as multiple iterations.

The realization process of vectorization can be written as:

, (

There’s only one explicit for loop, going from 1 to 1, and then counting layer by layer. The next section is about avoiding bugs in my code, one of the most important things I do.

4.4 Checking the dimensions of your matrix (Getting your matrix dimensions Right)

When implementing deep neural networks, one of the ways I often check my code is to take out a piece of paper and run through the dimensions of the matrix in the algorithm.

The dimension of is (dimension of the next layer, dimension of the previous layer), namely: (,);

The dimension of is (dimension of the next layer, 1), namely:

: (;

:;

Is the same as dimension, is the same as dimension, and the and vectorization dimension does not change, but the dimension of, and changes after vectorization.

After vectorization:

Can be viewed as resulting from each individual superposition,

For the training set size, all dimensions are no longer, but.

:,

When you do deep neural network back propagation, make sure that all matrix dimensions are consistent, which can greatly improve the code pass rate. In the next section, we’ll talk about why the deep web is better than the shallow web for many things.

4.5 Why Use deep Representation? (Why deep representations?)

We all know that deep neural network can solve a lot of problems. In fact, it doesn’t need to be a big neural network, but it needs to have depth and more hidden layers. Why is this? Let’s look at a couple of examples to help us understand why deep neural networks are useful.

First, what is the deep web computing?

First, what exactly is the deep Web computing? If you’re building a face recognition or face detection system, what deep neural networks do is, when you put in a picture of a face, then you can use the first layer of the deep neural network as a feature detector or an edge detector. In this case, I’m going to build a deep neural network of about 20 hidden units, how it’s computed against this graph. Hidden unit is the figure in the small squares (a piece of a larger image, for example, the small squares (the first line of the first column) is a hidden unit, it will look for this photo “|” the direction of the edge. So the hidden cell (row 4, column 4) is probably looking for the horizontal edge. Later in the course, we’ll talk about convolutional neural networks that do this kind of recognition, and we’ll talk more about why the small units are represented this way. You can look at the first layer of the neural network as a map, and then look for the edges of the picture. We can look at the pixels that make up the edges of the image together, and it can combine the detected edges into different parts of the face (second large image). You might have one neuron looking for the eye, for example, and another looking for the nose, and then you combine those edges and you can start to detect different parts of the face. Finally, by putting these parts together, such as nose, eyes and chin, different faces can be recognized or detected (third large image).

You can intuitively use the first layers of this neural network to detect simple functions, such as edges, and then combine them with the next layers to learn more complex functions overall. And what these graphs mean, we’ll see more when we look at convolutional neural networks. Another technical detail to understand is that edge detectors are actually relatively small areas of the image. Like this one, these are very small regions. The face detector will focus on larger areas, but the main idea is that you typically start with smaller details, like edges, and then work your way up to larger and more complex areas, like an eye or a nose, and then put the eyes and nose together to make a more complex part.

This simple to complex pyramidal representation or composition method can also be applied to other data than images or face recognition. For example, when you want to build a speech recognition system, need to be solved is how to visualize the pronunciation, such as you enter a audio clips, then the first layer of the neural network may be going to start trying to detect low levels of audio waveform features, such as tone is high or low, to distinguish between white noise and hissed hissing sound, or tone, You can take these low relative wave features and combine them to detect the basic units of sound. In linguistics, a concept called phonemes, words such as ca, c pronunciation, “ke” is a phoneme, a “ah” is a phoneme pronunciation, pronunciation “, “t is also a phoneme, a voice of the basic unit, together, you will be able to identify the audio of words, words together again and then can identify phrases, to complete the sentence.

So in these hidden layers of deep neural networks, the first few layers can learn some low-level simple features, and then the second few layers can combine those simple features to detect more complex things. Like a word, a phrase or a sentence that you record on audio, and then it runs voice recognition. And the first few layers that we computed were relatively simple input functions, like the edges of the image cells and so on. When you go deep into the network, you can actually do a lot of complicated things like detect faces or detect words or phrases or sentences.

Some people like to draw an analogy between deep neural networks and the human brain. These neuroscientists think that the human brain also detects simple things, like edges that your eyes can see, and then combines them to detect complex objects, like faces. This deep learning comparison to the human brain is sometimes dangerous. But what is undeniable is that our understanding of the brain mechanism of great value, the brain may be start with something simple, such as edge, and then assembled into a complete complex objects, such as simple as complex process, also some other deep learning inspiration, after the video we will also continue to talk or biological understanding of human brain.

Small: The number of hidden units is relatively Small

Deep: The number of hidden layers is large

The number of hidden units in the deep network is relatively small and the number of hidden layers is large. If the shallow network wants to achieve the same calculation result, the number of hidden units needs to increase exponentially.

The other theory, the theory of why neural networks work, comes from circuit theory, and it has to do with what functions you can compute with circuit components. According to different basic logic gates, such as and gate, or gate and not gate. In informal situations, these functions can be used is relatively small, but very deep neural network to calculate, small here means the number of hidden units is relatively small, but if you use a light some neural network to calculate the same function, that is when we can’t use a lot of hidden layer, you will need to exponential growth of unit quantity can achieve the same results.

Let me give you another example, introducing this concept in less formal language. Suppose you want to compute xor or parity for input features, you can compute, suppose you have a feature or, if you draw a tree of xor, you compute xor and then sum. Technically if you use only the OR gate and the not gate, you might need several layers to compute the xOR function, but with relatively small circuits, you should be able to compute the xor. Then you can go ahead and build an xOR tree like this (above left), and you’ll end up with a circuit that outputs the result, the xor, or parity of the input features, to compute the xor relationship. The depth of the network should be so that the number of nodes and the number of circuit parts, or the number of gates, is not very large, and you don’t need too many gates to calculate xOR.

But if you can’t use more of the hidden layer neural network, in this case, the number of hidden layer, for example you are forced to can only be used to calculate the single hidden layer, here are all pointing here, from behind these hidden units to the output again, then to calculate the parity, or relationship or the function requires a hidden layer box section (above right) unit number exponential growth, Because essentially you need to enumerate all the possible configurations, or all the configurations of the input bits. The xor operation ends up with a 1 or 0, and you end up with a hidden layer where the number of units increases exponentially with the input bits. To be precise, it’s the number of hidden units, which is.

I hope that gives you a sense that there are a lot of mathematical functions that are much easier to compute on a deep network than on a shallow network, and I personally don’t think this circuit theory is that useful for training intuitive thinking, but it’s a result that’s often cited to explain why you need deeper networks.

Besides these reasons, TO be honest, I think the name “deep learning” is a bit spooky. These concepts used to be called neural networks with lots of hidden layers, but deep learning sounds so fancy, so esoteric, right? When the word got around, it was a repackaging of neural networks or multiple hidden layers of neural networks that captured the public imagination. All this pr repackaging aside, the deep Web works well, and sometimes people take it literally and use a lot of hidden layers. But when I start to solve a new problem, I usually start with logistic regression, try one or two hidden layers, and debug the number of hidden layers as parameters, as hyperparameters, so as to find the appropriate depth. But in recent years, some people have tended to use very, very deep neural networks, like dozens of layers, that are the best model for some problems.

So that’s what I want to talk about, the intuitive explanation for why deep learning works so well, and now let’s look at how back propagation works in addition to forward propagation.

4.6 Building blocks of deep neural networks

In the last couple of videos this week and the last couple of videos, you’ve seen the basic building blocks of forward and back propagation, and they’re also important building blocks of a deep neural network, and now we’re going to use them to build a deep neural network.

This is a neural network with fewer layers. We choose one layer (the box part) and start with the calculation of this layer. In the first layer you have the parameters and in the forward propagation you have the activation function of the input, the input is the previous layer, and the output is, as we talked about earlier, so this is how you go from the input to the output. You can then cache the values of, and I’m going to include that in the cache here, because caching is very useful for future forward and back propagation steps.

And then there’s the backward step or the backward propagation step, which is also the level one calculation, where you have to implement a function whose input is the output function. One small detail to note is that the input here is actually the cached value, and the calculated value, in addition to the output value, you also need to output the gradient sum that you want, in order to implement gradient descent learning.

This is the structure of the basic forward step, which I call the forward function, similar to what I would call the reverse function in the reverse step. So in summary, at the L level, you have forward functions, input and output, and in order to compute the result you need to use sum and output to the cache. And then you use it as the inverse function of the back propagation, which is another function, input, output, and you get the derivative with respect to the activation function, which is the desired derivative. It’s going to change, the derivative of the activation function from the previous layer. In this square (second) you have to sum, and the last thing you have to calculate is. And then in this square (third), the reverse function computes the sum of the outputs. I’m going to indicate the reverse step with red arrows, and I can color these arrows red if you like.

Then if these two functions (forward and reverse) are implemented, then the computation of the neural network will look like this:

Take the input characteristics, put them in the first layer and compute the activation function of the first layer, in terms of representation, you need and to compute, and then also cache the values. And then you feed it to the second layer, and in the second layer, you need to use the sum, and you need to compute the activation function for the second layer. And so on and so forth, until you finally get the final output of the first level. We cache all the values in this process, and this is the forward propagation step.

For back propagation step, we need to calculate a series of reverse iteration, this is reverse gradient calculation, you need to put the value of here, and then will give us the value of the square, and so on, until we get, and you can also calculate one more output value, is, but it is actually the derivative of the input features, you is not important, At least for training supervised learning weight is not important, you can stop there. The back propagation step also prints the sum, which prints the sum and so on. So you’ve got all the derivatives you need so far, so fill in the flow chart a little bit.

One step of training for the neural network involves starting with, that is, and then going through a series of forward propagation calculations, and then using the output values to compute this (last square in line 2), and then carrying out the back propagation. Now you have all the derivatives, which are updated at each level, and again, the back propagation is all calculated, and we have all the derivatives, so this is a gradient descent cycle of the neural network.

One more detail, which would be helpful conceptually before moving on, is to cache the value computed by the reverse function. And when you implement it when you do a programming exercise, you’ll find that caching can be handy, you can get the values of sums very quickly, a very handy way to do that, you cache in a programming exercise, and sums, right? From an implementation point of view, I think it’s a very convenient way to copy the parameters to where you need them when you calculate back propagation. Ok, so that’s the implementation details, and you’ll use them in your programming exercises.

Now you’ve seen the basic building blocks for implementing a deep neural network, with a forward propagation step in each layer, and a corresponding back propagation step, and a cache for passing information from one step to another. In the next video we’re going to look at how to implement all of these things, so let’s watch the next video.

4.7 Parameters VS Hyperparameters

For your deep neural network to work well, you also need to plan your parameters and your hyperparameters.

What is a hyperparameter?

For example, you can set the learning rate, iterations(number of gradient descent cycles), (number of hidden layers), (number of hidden layer units), and choice of activation function in the algorithm. These numbers actually control the value of the final argument sum, so they are called hyperparameters.

In fact, there are many different hyperparameters in deep learning. Later, we will introduce some other hyperparameters, such as Momentum, mini Batch size, regularization Parameters and so on.

How to find the optimal value of the hyperparameter?

Go through the idea-code-experiment-idea cycle, try different parameters, implement the model and see if it works, then iterate again.

Deep learning applications today, it’s still a very empirical process, and often you have an idea, like you might have a rough idea of what the best learning rate is, maybe the best, and I’d like to try it out, and then you can actually try it out, practice it and see how it goes. And then, based on the results of the trial, you think it would be good to increase the learning rate to 0.05. If you’re not sure what the best value is, you can always try a learning rate and see if the loss function J goes down. And then you can try something bigger, and see that the loss function goes up and diverges. Then you might try other numbers to see if the result drops off very quickly or converges to a higher position. You might try different ones and watch the loss function change this way, try a set of values, and then maybe the loss function will look like this, and this value will speed up the learning process and converge to a lower value of the loss function, and I’m going to use this value.

In the previous pages, there are many more different hyperparameters. However, when you start developing a new application, it’s hard to know up front exactly what the optimal value of the hyperparameter should be. So usually, you have to try a lot of different values and go through this loop, trying out various parameters. Try 5 hidden layers, this number of hidden units, implement the model and see if it works, then iterate again. The title of this page is the field of applied Deep learning, a process that’s largely based on experience, and empirical process, in layman’s terms, is trying until you find the right number.

Another recent impact of deep learning is its use to solve a wide range of problems, from computer vision to speech recognition to natural language processing to many structured data applications, such as web advertising or web search or product recommendations. I’ve seen a lot of researchers in one of these fields, one of these fields, try different Settings, and sometimes the intuition of setting hyperparameters generalizes, but sometimes it doesn’t. So I always advise people, especially if they’re just starting out with a new problem, to try a range of values and see what happens. And then in the next class, we’re going to take a more systematic approach, and we’re going to systematically try out various hyperparameter values. Then the second, and even you have used for a long time in the model, may you in network advertising application, in your development way, vector is likely to be the optimal value and other parameters of optimal value will be changed, so even if you use every day in the current optimal parameter debugging your system, you will find that the optimal value of one year will change, because the computer infrastructure, CPU or GPU can vary a lot. So there’s a rule of thumb that might change every few months. If you are working on a problem that will take many years, just try different hyperparameters frequently and check the results to see if there is a better value for the hyperparameter, and trust that you will get an intuition of what values are best for your problem.

This may indeed be one of the more frustrating aspects of deep learning, which is that you have to try different possibilities multiple times. But the parameters set in this field, deep learning research is still in progress, so may over time there will be a better way to determine parameters of value, is also likely due to the CPU, GPU, network and data are changing, this guide may only work for a period of time, as long as you keep on trying and trying to keep the cross check or similar testing method, Then pick a number that works best for your problem.

Recently, deep learning has changed many fields, from computer vision to speech recognition to natural language processing to many structured data applications, such as online advertising, web search, product recommendations, etc. Some intuition for setting hyperparameters in the same field can be generalized, but sometimes it can’t, especially for people who are just starting to study a new problem and should try out the results within a certain range. Even the learning rate of a model that has been used for a long time or the optimal value of other hyperparameters may change.

In the next lecture we’re going to try to take a systematic approach to various hyperparameter values. As a rule of thumb: Try different hyperparameters often, check the results for better values, and you’ll get an intuition for setting hyperparameters.

What does this have to do with the brain?

Is there any connection between deep learning and the brain?

Not much.

So why do people say deep learning is about the brain?

When you in the realization of a neural network, the formula is what are you doing, what would you do before to the spread and back propagation, the gradient descent method, specific what actually it is difficult to describe these formula, deep learning like brain analogy is overly simplified concrete in our brains do, but because the form is very concise, also can let the ordinary people are more willing to open discussion, It makes for good news coverage and grabs the public’s attention, but the analogy is wildly inaccurate.

The logical unit of a neural network can be seen as an oversimplification of a biological neuron, but until now even neuroscientists have had a hard time explaining what a neuron can do. It can be extremely complex; Some of its functions may indeed resemble logistic regression, but no one can really explain what individual neurons are doing.

Deep learning is really a great tool to learn all kinds of very flexible and complex functions, to learn mappings from to, to learn mappings from input to output in supervised learning.

But the analogy is very rough, this is the SigmoID activation function of a logistic regression unit, and this is a neuron in the brain, this biological neuron in the picture, a cell in your brain, that can receive electrical signals from other neurons, for example, or possibly from other neurons. There’s a simple critical calculation, if this neuron fires all of a sudden, it will send electrical impulses down this long axon, or a wire, to another neuron.

So this is an oversimplified comparison, comparing the logical unit of a neural network to the biological neuron on the right. So far, even neuroscientists have had a hard time explaining what a single neuron can do. A small neurons in fact is very complicated, so that we can’t describe clearly in the neuroscience point of view, some of its function, may really is similar to the operation of logistic regression, but individual neurons is doing exactly what, there is no one can really explain, neurons in the brain how learning, the process of this is still a mystery. Whether the brain uses algorithms like backward propagation or gradient descent, or whether the human brain learns using entirely different principles.

So although deep learning is really a great tool, you can learn all kinds of very flexible and complex functions to learn mappings from x to Y. In supervised learning, you learn input-to-output mapping, but this analogy with the human brain is probably worth mentioning in the early days of the field. But now that analogy is out of date, I try to use it less myself.

This is the relationship between neural networks and the brain, and I’m sure it’s been inspired by the human brain in computer vision and other disciplines, and other areas of deep learning. But personally, I use this analogy of the human brain less and less often.

The resources

[1] Deep Learning Courses:Mooc.study.163.com/university/…[2] Huang Hai-Guang:github.com/fengdu78[3]github: Github.com/fengdu78/de…