This article is the notes section of Ng’s deep Learning course [1].

Author: Huang Haiguang [2]

Main Author: Haiguang Huang, Xingmu Lin (All papers 4, Lesson 5 week 1 and 2, ZhuYanSen: (the third class, and the third, three weeks ago) all papers), He Zhiyao (third week lesson five papers), wang xiang, Hu Han, laughing, Zheng Hao, Li Huaisong, Zhu Yuepeng, Chen Weihe, the cao, LuHaoXiang, Qiu Muchen, Tang Tianze, zhang hao, victor chan, endure, jersey, Shen Weichen, Gu Hongshun, when the super, Annie, Zhao Yifan, Hu Xiaoyang, Duan Xi, Yu Chong, Zhang Xinqian

Editorial staff: Huang Haiguang, Chen Kangkai, Shi Qinglu, Zhong Boyan, Xiang Wei, Yan Fenglong, Liu Cheng, He Zhiyao, Duan Xi, Chen Yao, Lin Jianyong, Wang Xiang, Xie Shichen, Jiang Peng

Note: Notes and assignments (including data and original assignment files) and videos can be downloaded on Github [3].

I will publish the course notes on the official account “Machine Learning Beginners”, please pay attention.

The second course: Improving Deep Neural Networks:Hyperparameter Tuning, Regularization and Optimization

Week 1: Practical Aspects of Deep Learning

Train/Dev/Test sets

You probably already know that, so this week, we’re going to continue to learn how to run neural networks effectively, covering hyperparametric tuning, how to construct data, and how to make sure that optimization algorithms run fast so that learning algorithms learn themselves in a reasonable amount of time.

In the first week, we’re going to start by talking about neural networks in machine learning, and then we’re going to talk about stochastic neural networks, and we’re going to look at some techniques for making sure that neural networks work correctly, and with that in mind, we’re going to start today’s lecture.

Making the right decisions in configuring training, validation, and testing data sets goes a long way toward creating efficient neural networks. When we train neural networks, we need to make many decisions, such as:

  1. How many layers does a neural network have
  2. How many hidden units does each layer contain
  3. What is the learning rate
  4. Which activation functions are used by each layer

It’s impossible to accurately predict this information and other super parameters from the start when building a new application. In fact, applied machine learning is a highly iterative process, usually in the project starts, we will have a preliminary idea, such as building a contains a certain layer, the hidden unit number or number of data sets and so on neural network, then code, and try to run the code, is obtained by run and test the neural network or the operations of the configuration information as a result, Depending on the output, you might refine your ideas, change your strategy, or iterate on your solution to find a better neural network.

Deep learning has achieved great success in natural language processing, computer vision, speech recognition and structured data applications. Structured data covers everything from advertising to web search. Web search includes not only web search engines, but also shopping sites, from all sites that transmit results based on the terms in the search bar. To computer security, logistics, such as determining where a driver is going to pick up goods, the list goes on.

I could have found that in natural language processing, want to set foot in computer vision field, or experienced speech recognition experts want to in the advertising industry, or, some people want to jump to the logistics industry from a computer security domain, in my opinion, from a domain or application field of intuition experience, usually cannot be transferred to other applications, The best decision depends on the amount of data you have, the number of input features in the computer configuration, whether to train with the GPU or CPU, the specific configuration of the GPU and CPU, and many other factors.

So far, I think, for many applications, even seasoned deep learning expert is unlikely to default from the start to match the most the super parameter, so deep learning is a typical application of the iterative process, requiring multiple cycles, to find a gratified for the application of neural networks, Therefore, the efficiency of the cycling process is a key factor in determining the speed of project progress, and the creation of high-quality training data sets, validation sets, and test sets can also help improve the cycling efficiency.

Suppose this is the training data, I use a rectangle said, we usually put these data into several parts, part as the training set, a part as a simple cross validation set, sometimes also called a validation set, convenience, I would call it a validation set (dev set), are the same concept, the last part of it as a test set.

Next, we start to execute the algorithm on the training set, and select the best model through the validation set or simple cross-validation set. After sufficient validation, we select the final model, and then we can evaluate on the test set, in order to evaluate the running condition of the algorithm unbiased.

In the era of small data volumes for machine learning, it is common practice to divide all data into 30 and 70 parts, which is what people often say: 70% training set, 30% test set. If a verification set is specified, it can be divided into 60% training set, 20% verification set, and 20% test set. This was the accepted best practice in machine learning a few years ago.

If there are only 100, 1,000, or 10,000 pieces of data, then the ratio is very reasonable.

But in the era of big data, where we may have millions of data now, validation sets and test sets tend to be smaller as a percentage of total data. Because the purpose of a validation set is to verify different algorithms, to see which algorithm is more efficient, therefore, a validation set should be large enough to evaluate different algorithms, such as two or even ten different algorithms, and quickly determine which algorithm is more efficient. We may not need to take 20% of the data as a validation set.

Let’s say we have a million pieces of data, 10,000 pieces of data is enough to evaluate and find the best one or two algorithms. Similarly, the main purpose of the test set is to correctly evaluate the performance of the classifier, depending on which classifier is ultimately selected, so if we have millions of data pieces, we only need 1000 pieces of data to evaluate a single classifier and accurately evaluate the performance of that classifier. Suppose we have 1 million pieces of data, 10,000 pieces of which are used as validation sets, 10,000 pieces of which are used as test sets, and 10,000 pieces of which are used as test sets. The proportion is 1%, that is, training sets account for 98%, verification sets and test sets account for 1% each. For applications with more than a million data sets, training sets can be 99.5%, validation and test sets 0.25% each, or validation sets 0.4% and test sets 0.1%.

To sum up, in machine learning, we usually divide samples into training set, validation set and test set. The data set size is relatively small, and the traditional partition ratio is applicable. For larger data set size, validation set and test set should be less than 20% or 10% of the total data. I’ll give specific guidance on how to divide validation and test sets later.

Modern deep learning another trend is more and more people in the training and test sets distribution does not match the situation for training, if you want to build a user can upload a lot of pictures of the application, the purpose is to find and render all cat pictures, can you love cats, users are training set may be downloaded from the Internet cat pictures, The validation set and test set are pictures of cats uploaded by users to the app. That is to say, the training set may be pictures captured from the Internet. The validation set and test set are images uploaded by users. As a result, many cat high-resolution images on the web, very professional, post-production is excellent, and users to upload photos could be taken at random on a cell phone, low pixels, vague, these two types of data is different, for this kind of circumstance, according to the experience, I recommend to ensure that data from the same distribution of the validation set and test set, I will tell more about this problem. Because you’re going to evaluate different models with validation sets to maximize performance. It would be nice if the validation and test sets came from the same distribution.

However, as deep learning algorithms need a large amount of training data, in order to obtain a larger training data set, we can adopt various popular creative strategies, such as web crawling, at the cost of training set data and verification set and test set data may not come from the same distribution. But just follow this rule of thumb and you’ll see that machine learning algorithms get faster. I’ll explain this rule of thumb in more detail later in the course.

Finally, it does not matter if there is no test set. The purpose of the test set is to make an unbiased estimate of the neural network system finally selected. If the unbiased estimate is not needed, the test set can be set without it. So if we have a validation set and no test set, what we do is, we train on the training set, we try different model frameworks, we evaluate those models on the validation set, and then we iterate and select the models that work. Because the validation set already covers test set data, it no longer provides an unbiased performance estimate. Of course, it’s great if you don’t need unbiased estimates.

In machine learning, if only one training set and a validation set, and no independent test set, in this case, the training set, also known as the training set and validation set is referred to as test set, but in practice, people just take test set as simple using cross validation set, and not fully realize the function of the term, Because they overfit the validation set data into the test set. If a team with you say they are only set up a training set and a test set, I will be very cautious, training thought if they really have a validation set, because they have excessive validation set data fitting to the test set, let the team changed name, renamed as “training” validation set, rather than the “training” test set, may not be easy. Even though I think “training validation set” is a more accurate technical term. In fact, this is fine if you don’t need to evaluate algorithm performance unbiased.

Therefore, the establishment of training verification set and test set can accelerate the integration of neural network, and also can more effectively measure the deviation and variance of the algorithm, thus helping us to choose the appropriate method to optimize the algorithm more efficiently.

1.2 Bias /Variance

I’ve noticed that almost all machine learning practitioners expect a deep understanding of bias and variance, two concepts that are easy to learn and hard to master, and even if you think you understand the basic concepts of bias and variance, something unexpected always comes up. Another trend with respect to error in deep learning is that the tradeoff between bias and variance is very little studied. You’ve probably heard of these two concepts, but errors in deep learning rarely weigh the two. We always think about bias and variance separately, but we rarely talk about the tradeoff between bias and variance.

Assuming that this is the data set, if you fit a straight line to the data set, you may get a logistic regression fit, but it does not fit the data well. This is the case of high bias, which is called “underfitting”.

On the contrary, if we fit a very complex classifier, such as a deep neural network or a neural network with hidden units, it may be very suitable for this data set. However, this does not seem to be a good fitting method. The classifier has high variance and the data is overfitting.

In between, there may be some classifiers like the one in the figure that have moderate complexity and moderate data fit, and the data fit seems to be more reasonable, and we call it “just right”, which is the category between over-fit and under-fit.

In such a two-dimensional data set with only and two features, we can plot the data and visualize the bias and variance. In multi-dimensional data, it is impossible to draw data and visually divide boundaries, but we can study biases and variances through several indicators.

Let’s use the example of cat picture classification, the one on the left is a cat picture, and the one on the right is not. The two key data for understanding bias and variance are Train set error and Dev set error. For the sake of argument, assuming that we can identify the kitten in the picture, we can hardly make mistakes with our naked eyes.

Assume that the training set error is 1%, for the convenience of argumentation, assume that the validation set error is 11%, it can be seen that the training set is set very well, and the validation set set relatively poor, we may over fitting the training set, to some extent, the validation set and not make full use of the effect of cross validation set, like this kind of situation, we call it “high variance”.

By looking at the training set error and validation set error, we can diagnose whether the algorithm has high variance. That is to say, different conclusions can be drawn by measuring the errors of training set and verification set.

Let’s say the training set error is 15%, let’s put the training set error on the first line, the validation set error is 16%, and let’s say the human error rate in this case is almost 0%, and people look at these pictures and they can tell if it’s a cat. The algorithm is not well trained in the training set. If the fitting degree of the training data is not high, it means that the data is not fit, so it can be said that the algorithm has high deviation. On the contrary, it produces reasonable results for the validation set, where the error rate is only 1% higher than for the training set, so this algorithm is highly biased because it can’t even fit the training set, similar to the image on the far left of the previous slide.

For another example, the training set has an error of 15%, which is quite high, but the validation set has an even worse evaluation with an error rate of 30%. In this case, I would consider this algorithm to be highly biased, because it has a bad result on the training set and a high variance, which is a situation where the variance bias is very bad.

For the last example, the training set error is 0.5% and the validation set error is 1%. The user will be happy to see this result. The cat classifier has only 1% error rate, and the bias and variance are very low.

One thing I first in this simple way, specific left behind in the course of speaking, these analyses are based on the assumption that forecasts assume that the human eye to distinguish the error rate of close to 0%, in general, the optimal error is also known as bayes error, so, the optimal error close to 0%, I spoke here is fine, if the optimal error or bayesian is very high, such as 15%. Let’s look at the classifier (training error 15%, verification error 16%), 15% error rate is also very reasonable for the training set, the deviation is not high, the variance is very low.

How do you analyze bias and variance when all classifiers are not applicable? Picture is very fuzzy, for example, even to the human eye, or no system can accurately recognize pictures, in this case, the optimal error will be higher, so the analysis process is going to make some changes, we temporarily first not to discuss these nuances, focus is by looking at the training set error, we can judge the data fitting, at least for the training data, You can determine if there is a bias problem and then see how high the error rate is. When we complete the training set and start to use the verification set, we can determine whether the variance is too high. During the process from the training set to the verification set, we can determine whether the variance is too high.

The above analysis is based on the assumption that the basic error is small and the data of the training set and verification set come from the same distribution. Without these assumptions, the analysis process will be more complicated, which will be discussed later in the course.

In the last slide, we talked about high bias and high variance, so you should have some idea what a good classifier looks like, what does high bias and high variance look like? This is very bad for both measures.

As we’ve seen before, a classifier like this, it has a high bias because it has a low data fit, and a classifier like this, which is close to linear, has a low data fit.

But if we change the classifier a little bit, and I’ll do it in purple, it overfits some of the data, and the classifier drawn in purple has high bias and high variance, high bias because it’s almost a linear classifier, it doesn’t fit the data.

This conic fits the data well.

The middle part of the curve is very flexible, but it overfits these two samples, and this kind of classifier is highly biased because it’s almost linear.

However, using curve function or quadratic function will produce high variance, because its curve is too flexible to fit the two error samples and the active data in the middle.

This may seem a little unnatural, not very natural in both dimensions, but for high-dimensional data, some data have high regional bias, some data have high regional variance, so using this classifier for high-dimensional data doesn’t seem so far-fetched.

To summarize, we talked about how to through the analysis on the training set training algorithm to produce the error of the verification and validation set error of the algorithm to diagnose algorithm exists high bias and variance, and whether the two values are high, or the two values are not high, according to the actual circumstances of the deviation and variance algorithm decide next, you need to do, the next class, I will explain some basic methods of machine learning based on the bias and variance of the algorithm, to help you optimize the algorithm in a more systematic way, and we will see you next time.

1.3 Basic Recipe for Machine Learning

Last time we talked about how training errors and validation set errors to determine whether the algorithm is biased or variance is too high, which helps us to use these methods more systematically in machine learning to optimize algorithm performance.

Here is the basic method I use to train neural networks :(try these methods, they may or may not work)

This is the basic method I use when training neural networks. After the initial model training is completed, I first need to know whether the algorithm has a high deviation or not. If the deviation is high, I try to evaluate the performance of the training set or training data. If the bias is so high that you can’t even fit the training set, then all you have to do is choose a new network, such as one with more hidden layers or hidden units, or spend more time training the network, or try a more advanced optimization algorithm, which we’ll talk about later. You can also try other methods, which may or may not work.

We will see many different neural network architecture for a while, maybe you can find a more appropriate solution to the problem of the new network architecture, parentheses, because one of them is you have to try, is likely to be useful, also may be useless, but usually with a larger network will help, extend the training time is not necessarily useful, but also do no harm. When I train my learning algorithm, I try these things over and over again until I get rid of the bias problem, which is the minimum standard, and over and over again until I can fit the data, at least the training set.

If the network is big enough, you can usually fit the training set very well, as long as you can scale up the network, if the image is fuzzy, the algorithm may not fit the image, but if someone can distinguish the image, if you don’t think the fundamental error is very high, then training a larger network, you should be able to… You can at least fit the training set very well, or at least fit or overfit the training set. Once reduced to acceptable values deviation, check the variance if you have any questions, in order to assess the variance, we need to check the validation set performance, we can infer from an ideal training sets performance that out of the performance of the validation set is also ideal, if the variance is high, the best solution is to adopt more data, if you can do it, there will be some help, but sometimes, We can’t get any more data, but we can also try to reduce overfitting by regularization, which we’ll talk about next time. Sometimes we have to do trial and error, but sometimes it can kill two birds with one stone, reducing both variance and bias, if we can find a more suitable neural network framework. How do you do that? It’s hard to say how to do it in a systematic way, but you just keep trying until you find a framework with low bias, low variance, and then you succeed.

There are two points to note:

Firstly, high bias and high variance are two different situations, and the methods we will try later may be completely different. I usually use the training verification set to diagnose whether there is bias or variance problem in the algorithm, and then select some methods to try according to the results. For example, if the algorithm has a high bias problem, preparing more training data is not really helpful, at least not a more efficient method, so it is important to know whether the problem is bias, variance, or both. Knowing this will help us choose the most effective method.

Second, in the early days of machine learning, there was a lot of talk about so-called bias variance tradeoffs, because there were so many things we could try. You can increase bias and decrease variance, or decrease bias and increase variance, but in the early stages of deep learning, we don’t have a lot of tools that can just decrease bias or variance without affecting the other party. However, in the current era of deep learning and big data, as long as a larger network is continuously trained and as long as more data is prepared, these two situations are not the only ones. We assume that this is the case. Then, as long as the regularity is moderate, it is usually possible to build a larger network and reduce the bias without affecting the variance. Using more data can usually reduce variance without affecting much bias. What these two steps actually do is: train the network, choose the network or prepare more data, and now we have tools that can reduce bias or variance without having too much adverse impact on the other side. And I think that’s a big reason why deep learning is really good for supervised learning, and why we don’t have to worry too much about how to balance bias and variance, but sometimes we have a lot of options to reduce bias or variance without adding the other one. What we end up with is a very normalized network. Starting next time, we’ll look at regularization. There’s almost no downside to training a larger network, and the main cost of training a large neural network is just computing time, provided the network is fairly normalized.

Today we have talked about the basic methods of organizing machine learning to diagnose bias and variance, and then choose the correct actions to solve the problem. I mentioned more than once in class regularization, it is a very practical method to minimize the variance of regularization variance weighing problem occurs deviation, deviation may be a slight increase, if the network is enough big, usually do not increase too high, we again next class, so that we better understand how to achieve the regularization of the neural circuits.

1.4 Regularization (Regularization)

Deep learning there may be a fitting problem – high variance, there are two solutions, one is the regularization, the other is to prepare more data, which is a very reliable method, but you may not always prepare enough training data or more data acquisition cost is very high, but the regularization usually helps to avoid over fitting or reduce your network error.

If you suspect that excessive neural network fitting the data, namely high variance problems, may be the first to think of way regularization, another solution to the high variance is to prepare more data, it is also a very reliable way, but you may not be ready to enough training data, or, more data acquisition cost is very high, However, regularization is helpful to avoid over-fitting or reduce network errors. The following is how regularization works.

We use logistic regression to realize these assumptions, find the minimum of the cost function, which is the cost function we define, and the parameters include some training data and the loss predicted by the individual in different data, and are the two parameters of logistic regression, is a multidimensional parameter vector, is a real number. To add regularization to a logistic regression function, you simply add the parameter λ, the regularization parameter, more on that in a moment.

Times the square of the norm, where the square of the Euclidean norm is equal to the sum of the squares, which can also be expressed as the square of the Euclidean norm of the vector parameters, which is called regularization because it uses the Euclidean norm, which is called the norm of the vector parameters.

Why only regularize parameters? Why not add parameters? You can do it, but I am used to omit don’t write, because is usually a high dimension parameter vector, already can express high deviation problem, might contain a lot of parameters, we can’t fit all the parameters, and is a single number, so almost covers all the parameters, instead, if added a parameter, there is not much, Because it’s just one of many parameters, I usually omit it, but if you want to add this parameter, that’s fine.

Regularization is the most common type of regularization, you’ve probably heard of regularization, regularization, instead of adding a norm, you add the regular term times the norm of a parameter vector, also known as the norm of a parameter vector, whatever the denominator is or is, it’s a proportional constant.

If you use regularization, you end up sparse, which means you have a lot of zeros in the vector, and some people say that’s good for compressing the model, because the set has zero parameters, so it takes less memory to store the model. In fact, even though regularization thinns the model, it doesn’t reduce the memory by much, so I don’t think that’s the purpose of regularization, at least not to compress the model, and people tend to use regularization more and more when they’re training networks.

We come to the last detail, is the regularization parameter, we usually use validation set or cross validation set to configure the parameters, to try all kinds of data, find the best parameters, we want to consider the tradeoff between the training set, the parameter is set to a smaller value, so that you can avoid over fitting, so lambda is another need to adjust the parameters, by the way, In Python, for the sake of writing code, is a reserved field, so when we write code, we delete it and write it so that it doesn’t conflict with the reserved field in Python, and that’s how we implement regularization in logistic regression functions. How do we implement regularization in neural networks?

Neural network has a cost function, the function contains, that all the parameters, the letter is neural network contains the number of layers, so the cost function is equal to the sum of the training sample loss function is multiplied, regularization, we call for norm square, the matrix norm (namely square norm), is defined as the square sum of all elements in the matrix

So let’s look at the parameters of the summation formula, the first summation sign has a value from 1 to 1, the second summation sign has a value from 1 to 1, because it’s a multidimensional matrix, the number of cells in the layer, the number of hidden cells in the layer.

The matrix norm, known as “the frobenius norm”, use subscript mark “, in view of some mysterious reason in linear algebra, we don’t call it “matrix norm”, and called it “frobenius norm,” sounds more natural matrix norm, but given that some people do not need to know the special reason, by convention, We call this the “Frobenius norm” and it represents the sum of squares of all the elements in a matrix.

How do you use this norm to achieve gradient descent?

The value calculated by Backprop, backprop will give you the partial derivative of phi with respect to phi, which is, in effect, minus the learning rate times phi.

Before this is our additional regularization item, now that you have increased the regular items, now we have to do is to add this item, and then calculate the update item, using the new definition, it contains the definition of a cost function derivative and related parameters, and finally to add additional regularization, the regularization is sometimes referred to as “attenuation” weight.

We replace here with the definition of, and you can see that the definition of is updated to be minus the learning rate times the backprop plus.

The regularization, no matter what, we are trying to make it more small, in fact, as we give weight matrix W x times, matrix minus The Times it is multiplied by the coefficient matrix, the coefficient is less than 1, so Norm regularization is also called “attenuation” weight, because it is like a general gradient descent, Was updated to subtract the initial gradient value multiplied by the backprop output, and also multiplied by this coefficient, which is less than 1, so the regularization is also called “weight attenuation”.

I’m not going to call it that, but it’s called weight decay because these two things are equal, and the weight index is multiplied by a coefficient less than one.

So that’s the process of applying regularization to a neural network, and some people ask me why regularization prevents overfitting, and we’ll talk about that next time, and get a sense of how regularization prevents overfitting.

1.5 Why does regularization help prevent overfitting? (Why regularization reduces overfitting?)

Why does regularization help prevent overfitting? Why does it reduce the variance problem? Let’s do a couple of examples to get an idea.

High deviation on the left, high variance on the Right, and Just Right in the middle, which we’ve seen in previous lectures.

Now let’s look at this huge deep fitting neural network. I know it’s not big enough, it’s not deep enough, but you can imagine that this is an overfitted neural network. This is our cost function, which takes parameters,. We add the regular term, which prevents the data weight matrix from being too large, that’s the Frobenius norm, why does the compressed norm, or the Frobenius norm or the parameter reduce overfitting?

The intuitive understanding is that if the regularization is set large enough, the weight matrix is set to a value close to 0. The intuitive understanding is to set the weight of multiple hidden elements to 0, thus basically eliminating many effects of these hidden elements. If this were the case, the greatly simplified neural network would become a very small network, as small as a logistic regression unit, but very deep, which would move the network from an over-fitting state closer to the high-deviation state on the left.

But there will be an intermediate value, so there will be an intermediate state close to “Just Right”.

Intuitive understanding is increased to large enough to be close to zero, in fact is not going to happen this situation, we try to eliminate or at least reduce the influence of many hidden units, and the network will eventually become more simple, the neural network is more and more close to the logistic regression, we intuitively think that a lot of hidden units have been completely eliminated, actually otherwise, Virtually all of the hidden elements of the neural network are still there, but their influence is much smaller. Neural networks are getting simpler, and it seems like they’re less likely to fit, so I’m not sure if this intuitive lesson is useful, but when you do regularization in your programming, you actually see some reduction in variance.

And just to get a sense of what regularization can prevent overfitting, let’s say we use a hyperbolic activation function like this.

In terms of representation, we find that if z is very small, such as z involves only a small range of parameters (the red area near the central point of the figure), and here we take advantage of the linear state of the hyperbolic tangent function, as long as it can be extended to such larger or smaller values, the activation function starts to become nonlinear.

Now you should reject this intuition, if the regularization parameter λ is large, the parameters of the activation function will be relatively small, because the parameters in the cost function are larger. If they are small,

If it’s small, relatively speaking, it’s small.

In particular, if the values of alpha and beta end up in this range, are relatively small values, roughly linear, and each layer is almost linear, just like a linear regression function.

The first class we talked about, if each layer is linear, then the whole network is a linear network, even if is a very deep deep web, because has the characteristics of linear activation function, at the end of the day, we can only calculate the linear function, therefore, it is not suitable for very complicated decision-making, and excessive fitting nonlinear decision boundary data set, As we saw in the slide in the case of overfitting with high variance.

To summarize, if the regularization parameter becomes large, the parameter is very small, will be relatively small, ignoring the influence of the right now, will be relatively small, in fact, the scope is very small, the activation function, namely curve function is linear, relatively whole neural network will also have a near linear function value, the linear function is very simple. It’s not a very complicated highly nonlinear function, it doesn’t fit.

You will see these results for yourself when you implement regularization in your programming homework. Before summarizing the regularization, I will give you a little implementation advice. When adding regularization items, apply the cost function defined previously.

If you’re using a gradient descent function, one of the steps in debugging gradient descent is to design the cost function to be a function that, when debugging gradient descent, represents the amount of amplitude modulation of gradient descent. As you can see, the cost function decreases monotonically for each amplitude modulation of gradient descent. If you implement regularization functions, keep in mind that there is a whole new definition. If you use the original function, the first regularization term, you may not see the monotone decrement. In order to debug gradient descent, be sure to use the newly defined function that contains the second regularization term, otherwise the function may not decrement monotone over all amplitude-modulation ranges.

This is regularization, and it’s the one I use most often when training deep learning models. Another method that’s also used in deep learning is regularization, which we’ll talk about in the next class.

1.6 Dropout Regularization (Dropout Regularization)

In addition to regularization, there is a very useful regularization method called “Dropout.” Let’s look at how this works.

Let’s say you’re training a neural network like the one above, and there’s overfitting, and that’s what dropout deals with, and we replicate this neural network, and the dropout goes through each layer of the network, and sets the probability of eliminating nodes in the neural network. Assumed that each node in the network, each layer is set in the form of a coin probability, the probability of each node is able to retain and eliminate is 0.5, set the probability of the node, we can eliminate some node, and then deleted from the node in and out of attachment, finally get a node less, smaller networks, then use backprop method for training.

This is a simplified sample of the network nodes. For the other samples, we still set the probability by flipping a coin, keeping one type of node set and deleting the other types of node set. For each training sample, we will use a reduced neural network to train it, which seems a little strange, simply traverse the nodes, the code is random, but it works. But as you can imagine, we’re training much smaller networks for each training sample, and eventually you might realize why we’re regularizing networks, because we’re training much smaller networks.

How do you implement dropout? There are several methods, and the most common one I’ll talk about is Inverted Dropout. For completeness, I’ll use a three-layer network as an example. There’s going to be a lot of 3’s in the code. I’ll just show you how to implement dropout in a single layer.

First we define a vector that represents the dropout vector for the third layer of the network:

d3 = np.random.rand(a3.shape[0],a3.shape[1])

We call it keep-prob. Keep-prob is a concrete number. In the last example it was 0.5. In this case it is 0.8. All it does is generate a random matrix, and if you factor it, it does the same thing. Is a matrix, each sample and each hidden unit, in which the probability of the corresponding value is 1 is 0.8, the probability of the corresponding value is 0 is 0.2, and the random number is less than 0.8. The probability of it being equal to 1 is 0.8, and the probability of it being equal to 0 is 0.2.

Multiply (a3,d3) = np.Multiply (a3,d3) = np.Multiply (a3,d3) =np.multiply(a3,d3) =np.multiply(a3,d3) =np.multiply(a3,d3) =np.multiply(a3,d3) =np.multiply(a3,d3) The probability of each element being equal to zero is only 20%, and the multiplication finally outputs the corresponding element in the middle, that is, the middle 0 element and the relative element in the middle are zero.

If you implement this algorithm in Python, it’s a Boolean array with values true and false, not 1 and 0. The multiplication still works. Python translates true and false into 1 and 0, so you can try it out in Python.

Finally, we expand out and divide it by 0.8, or by the keep-prob argument.

Let me explain why I did this. For convenience, let’s assume that there are 50 units or 50 neurons on the third hidden layer, which is 50 in one dimension, and we divide it into dimensions by factorization. The probability of retaining and deleting them is 80% and 20% respectively. This means that the average number of units that will be deleted or zeroed is 10. Now let’s see, our expectation is that 20% will be reduced, that is, 20% of the elements will be zeroed. In order not to affect the expected value, we need to use, it will correct or compensate for the expected value of the 20% that we need. The underlined part is called the dropout method.

What it does is it doesn’t matter what the keep-prop value is 0.8, 0.9 or even 1. If keep-Prop is set to 1, then there is no dropout because it keeps all the nodes. The inverted random dropout method ensures that the expected values remain the same by dividing by keep-Prob.

It turns out that during the testing phase, when we evaluate a neural network, the reverse random inactivation method with the green wire box, makes the testing phase easier because it has fewer data scaling problems, which we will discuss in the next class.

As far as I know, the most commonly used method of dropout is Inverted dropout. I suggest you to practice it. None of the early iterations of Dropout were divided by keep-Prob, so the averages became increasingly complex during the testing phase, but those versions are no longer used.

Now that you’re using vectors, you’ll notice that different training samples will clear different hidden units. In fact, if you pass data through the same training set multiple times, with different gradients each time, the different hidden units will be randomly zeroed, which is sometimes not the case. For example, when the same hidden unit needs to be reset to zero, some hidden units should be reset to zero in the first iteration of gradient descent, and different types of hidden layer units should be reset to zero in the second iteration of gradient descent, that is, in the second iteration of traversing the training set. Vectors are used to determine which elements in the third layer go to zero, either with foreprop or backprop, and we’ve only covered foreprob here.

How to train the algorithm in the test phase, where we’ve given, or we want to predict, variables using standard counting. I use the layer 0 activation function as the test sample. We don’t use dropout functions during the test phase, especially in cases like the following:

And so on down to the last layer, the predicted value is zero.

Obviously in the test phase, we don’t use dropout, so we don’t have to flip a coin to determine the probability of inactivation and which hidden units to eliminate, because we don’t expect the output to be random when we make a prediction in the test phase. If we use dropout functions in the test phase, the prediction would be disrupted. In theory, you only need to run the prediction process multiple times, and each time, different hidden units will be randomly zeroed out, and the prediction process will iterate over them, but it will be computationally inefficient and yield nearly the same result, very similar to the result produced by this different program.

The Dropout function can remember the previous operation when dividing by keep-Prob. The purpose is to ensure that the expected results of the activation function do not change even if the dropout range is not adjusted during the test phase, so there is no need to add additional scale parameters during the test phase, which is different from the training phase.

This is dropout, and you can try it out by executing it in this week’s programming exercise.

Why does dropout work? Next time, we’ll look a little bit more directly at what dropout can do.

1.7 Understanding Dropout

Dropout can randomly remove neural units from networks. Why can Dropout play such a big role through regularization?

Intuitively understand: don’t rely on any one feature, because the unit input may be removed at any time, therefore through this way, the unit and add a bit of weight, as a unit of the four input from the distribution of ownership, dropout will produce contraction weight, the effect of the squared norm regularization similar before and speak; The result of dropout is that it compresses weights and performs some outer regularization to prevent overfitting; The attenuation to different weights is different, depending on the size of the activation function multiplier.

To summarize, dropout functions are similar to regularization, except that regularization can be applied in a slightly different way, and even better for different input ranges.

The second intuition is that we start with a single neuron, as shown here, whose job is to input and generate some meaningful output. By dropout, the inputs to the unit are almost eliminated, sometimes these two units will be dropped, sometimes other units will be dropped, that is, the unit that I’ve circled in purple, it can’t rely on any features, because features can be randomly eliminated, or inputs to the unit can be randomly eliminated. I don’t want to put all bets on a node, is not willing to add too much weight to any one input, because it may be deleted, so the unit will be actively spread in this way, and add a bit of weight, as a unit of the four input from the distribution of ownership, dropout will produce contraction weight the effect of the squared norm, Similar to regularization we discussed earlier, the result of implementing dropout is that it compresses the weights and performs some outer regularization to prevent overfitting.

It turns out that dropout is formally used as an alternative form of regularization, and the attenuation is different for different weights, depending on the size of the multiplication activation function.

To summarize, dropout functions are similar to regularization, but unlike regularization, dropout can be different depending on how it is applied, and even better for different input ranges.

Another detail of implementing Dropout is that this is a network with three input characteristics, and one of the parameters to be selected is keep-PROb, which represents the probability of retaining units at each layer. So keep-PROB of different layers can also change. The first layer, the matrix is 7×3, the second weight matrix is 7×7, the third weight matrix is 3×7, and so on, is the largest weight matrix, because it has the largest parameter set, namely 7×7. In order to prevent the over-fitting of the matrix, for this layer, I think this is the second layer, and its keep-PROb value should be relatively low. Let’s say 0.5. For other layers, the degree of overfitting may be less severe, and their KEEP-PROB value may be higher, perhaps 0.7, here 0.7. If we don’t have to worry about overfitting at a certain level, then the keep-prob can be 1. To show the clearance, I’ll circle them with a purple marker. The value of keep-prob may be different for each level.

Note that the value of keep-prob is 1, which means keeping all the cells and not using dropout on this layer. For layers where overfitting is possible and there are many parameters, we can set keep-prob to a smaller value so that we can apply more powerful dropout. A little bit like dealing with regularization regularization parameters, we try to regularize some layers more. Technically, we can also apply dropout to the input layer, where we have the opportunity to remove one or more input features, although we don’t usually do that in real life. The keep-prob value of 1 is a very common input value, You could use a higher value, maybe 0.9. But eliminating half of the input features is unlikely, and if we follow this rule, the keep-prob will be close to 1, even if you apply dropout to the input layer.

To summarize, if you’re worried that some layers are more likely to overfit than others, you can set the keep-PROb value for some layers lower than others. The disadvantage is that in order to use cross-validation, you need to search for more super parameters. Another solution is to apply dropout on some layers and not on others. The application Dropout layer contains only one super parameter, keep-Prob.

Before we finish, share two tips from the implementation process, Implementation Dropout, many successful firsts in the field of computer vision. Is a large amount of calculation in the visual input, input too many pixels, so that there is not enough data, so the dropout used more frequently in computer vision, some researchers in computer vision are very like to use it, is almost a default choice, but to keep in mind, dropout is a kind of regularization method, it can help prevent a fitting, So I don’t use dropout functions unless the algorithm is well fitted, so it’s not used much in other fields, mostly in computer vision, because we don’t usually have enough data, so there’s always been overfitting, and that’s why some computer vision researchers are so interested in dropout functions. Intuitively I don’t think you can generalize to other disciplines.

One of the drawbacks of dropout is that the cost function is no longer clearly defined, and with each iteration, random nodes are removed, making it virtually impossible to double-check the performance of gradient descent. Well-defined cost functions fall after each iteration, because the cost functions we are optimizing are actually not well-defined, or to some extent hard to calculate, so we lose the debugging tools to draw such pictures. I usually turn off the dropout function, set the keep-prob value to 1, and run the code to make sure the J function decays monotonically. Then turn on the dropout function and hope that the code doesn’t introduce bugs during dropout. I think you can try other methods, and while we don’t have statistics on their performance, you can use them with dropout methods.

1.8 Other Regularization Methods

In addition to regularization and random inactivation regularization, there are several ways to reduce overfitting in neural networks:

I. Data amplification (Data enhancement)

Let’s say you’re fitting a cat image classifier, and if you want to solve the problem of overfitting by amplifying the training data, which is expensive, and sometimes we can’t do that, we can increase the training set by adding such images. For example, flip the image horizontally and add it to the training set. So now we have the original image in the training set and the flipped image, so by flipping the image horizontally, we can double the size of the training set, because the training set is redundant. It’s not as good as collecting a new set of images, but it saves the cost of getting more cat pictures.

In addition to flipping the image horizontally, you can crop the image as you like. This one was rotated and zoomed in to still identify the cat.

By flipping and cropping images at will, we can augment the dataset and generate additional fake training data. The extra fake data didn’t contain as much information as the new, independent cat picture data, but we did it for almost nothing, almost nothing, except for some adversarial costs. It is cheaper to amplify algorithm data in this way to regularize the data set and reduce overfitting.

With synthetic data like this, we have to run an algorithm to verify that the cat is still a cat when flipped horizontally. Notice that I didn’t flip the image vertically, because we didn’t want to flip it upside down, but we could have randomly selected a magnified part of the image, and the cat might still be on it.

For OCR, we can also amplify the data by adding numbers, rotating or twisting them at will, adding those numbers to the training set, they’re still numbers. For the sake of illustration, I’ve distorted the characters so that the number 4 looks wavy, but you don’t have to distort the number 4 like that, you just have to distort it slightly, so you can see it more clearly. In practice, we usually distort the characters more slightly. Because these four’s look a little twisted. Therefore, data amplification can be used as a regularization method, and its actual function is similar to regularization.

2. Early stopping

Another commonly used method is called early stopping. When running gradient descent, we can plot the training error, or just plot the optimization process of the cost function, and record the number of classification errors in 0-1 on the training set. It shows a monotonic downward trend, as shown in the figure.

Because during training, we want the training error, the cost function to go down, and by early stopping, we can plot not only the above, but also the validation set error, which can be the classification error on the validation set, or the cost function, the logic loss, the logarithmic loss on the validation set, and you’ll see, The validation set error tends to go down, and then it starts to go up at some point, and the early stopping effect is, you say, the neural network has done very well in this iteration, let’s stop training here and get the validation set error, how does that work?

When you haven’t run on the neural network when too many iterative process, parameter is close to zero, because the random initialization value, its value may be less random values, so before you training the neural network for a long time is still very small, in the process of iterative process and the training of value will become more and more big, in here, for example, the value of the parameter in the neural network has been very big, So the early stopping thing to do is to stop the iteration at the middle point, and we get a medium size Frobenius norm, similar to regularization, and choose a neural network with a small parameter W norm, and hopefully your neural network doesn’t overfit too much.

The term early stopping stands for stopping early to train a neural network. I sometimes use early stopping when I train a neural network, but it also has a disadvantage that we’ll look at.

I think the machine learning process involves several steps, one of which is to choose an algorithm to optimize the cost function. We have a variety of tools to solve this problem, such as gradient descent. I’ll talk about other algorithms later, such as Momentum, RMSprop, Adam, etc., but after optimizing the cost function, I don’t want to have a fit. There are tools available to solve this problem, such as regularization, augmented data, and so on.

In machine learning, super-parameters proliferate, and the selection of viable algorithms becomes increasingly complex. And what I’ve found is that machine learning becomes much easier if we optimize the cost function with a set of tools, and when you focus on optimizing the cost function, all you have to do is focus on the smaller the value of the sum, and you just have to figure out how to reduce that value, and you don’t have to worry about anything else. And then there’s the other task of preventing overfitting, in other words, reducing variance, which we do with a different set of tools, which is sometimes called orthogonalization. The idea is to do one task at a time, and I’ll talk more about orthogonalization later in the class, so don’t worry if you’re not familiar with that concept.

But the main drawback to stopping early for me is that you can’t handle these two problems independently, because stopping gradient descent early, you stop optimizing the cost function, because now you’re not trying to lower the cost function, so the cost function might not be small enough, and at the same time you don’t want to overfit, Instead of trying to solve the two problems in different ways, you try to solve both problems in one way, and the result of that is that what I have to think about becomes more complicated.

The alternative to stopping by early, which is regularization, can take a long time to train the neural network. I found that this made the super-parameter search space easier to decompose and easier to search, but the downside was that you had to try many regularized parameter values, which also made searching for a large number of values computationally expensive.

The advantage of Early Stopping is that by running gradient descent only once, you can find smaller values, middle values and larger values without trying to regularize a lot of super parameters.

If you don’t fully understand this concept, that’s okay, we’ll talk about orthogonalization in more detail next time, so it’ll make sense.

Despite the drawbacks of regularization, many people are willing to use it. Ng personally prefers to use regularization, experimenting with many different values, assuming you can afford to do a lot of calculations. Using early Stopping can get a similar result without trying as many values.

In this lecture we looked at how to use data amplification and how to use early stopping to reduce variance or prevent overfitting in neural networks.

1.9 Normalizing inputs

Training neural networks, one of the methods to speed up training is normalized input. Suppose a training set has two features, input features are 2-dimensional, normalization requires two steps:

  1. Zero mean

  2. Normalized variance;

    We want both the training set and the test set to be transformed through the same and defined data, which are derived from the training set.

\

The first step is zeroaveraging, which is a vector equal to each training data minus, which means moving the training set until it’s zeroaveraging.

The second step is the normalized variance, pay attention to the features of variance than the variance of far more, we have to do is to assign a value, and it is the square of the node, is a vector, its every feature has a variance, note that we have completed the zero value of homogenization, element is variance, we divide all the data by the vector, finally into above form.

The variance of alpha and beta is equal to 1. As a hint, if you’re using it to adjust training data, use the same sum to normalize the test set. In particular, you don’t want the normalization of the training set to be different from the normalization of the test set, no matter what the values of alpha are, no matter what the values of alpha are, they’re going to be used in both formulas. So you adjust the test set in the same way, rather than estimating and separately on the training set and the test set. Because we want both the training data and the test data to be transformed by the same data defined as μ and, where and is computed from the training set data.

Why are we doing this? For why we want to normalize the input characteristics, recall the cost function defined in the upper right corner.

If you use non-normalized input features, the cost function would look like this:

This is a very thin, narrow cost function, and the minimum you’re looking for should be right here. But if the characteristic values in different range, if the values range from 1 to 1000, the characteristics of the values range from 0 to 1, the result is the parameters and value range or ratio will be very different, these data shaft should be and, but intuitive understanding, I am a and tag, cost function is a bit like narrow bowl, if you can draw an outline of the part of the function, It’s going to be a long, narrow function like this.

Whereas if you normalize the features, the cost functions look more symmetrical on average, if you run gradient descent on a cost function like the one above, you have to use a very small learning rate. Because at this point, gradient descent may require several iterations until the minimum is found. But if the function is a more rounded, spherical shape, then gDA is more straightforward to find the minimum no matter where you start, and you can use a larger step length in GDA instead of repeating it as in the figure on the left.

, of course, is actually a high dimensional vector, so using the 2 d drawing and cannot communicate properly and intuitive understanding, but always to intuitive understanding of cost function is more round, and easier to optimize, the premise is characteristic in similar range, rather than from the range of 1 to 1000, 0 to 1, but in the range of 1 to 1 or similar deviation, This makes cost function optimization easier and faster.

In fact, if you assume that the features are in the range of 0 to 1, the features are in the range of -1 to 1, the features are in the range of 1 to 2, they are similar, so they will do very well.

When they’re in very different ranges, like one from 1 to 1000, and one from 0 to 1, it’s very bad for the optimization algorithm. But just setting them to a homogenized zero, assuming a variance of one, as we did in the last slide, and making sure that all the features are in a similar range, usually helps the learning algorithm run faster.

So if the input features are in different ranges, maybe some from 0 to 1, some from 1 to 1000, then normalized eigenvalues are very important. If the eigenvalues are in the similar range, normalization is not very important. There’s no harm in doing this kind of normalization, and I usually do it, although I’m not sure it improves training or algorithm speed.

This is the normalized feature input, and next time we will continue to talk about ways to increase the training speed of neural networks.

1.10 Gradient Vanishing/Explosion Gradient

One of the problems with training neural networks, especially deep ones, is gradient extinction or gradient explosion, which is when you train a neural network, the derivative or gradient sometimes gets very large, or very small, or even exponentially smaller, and that makes training more difficult.

In this lesson, you will learn what gradient extinction or gradient explosion really means, and how to make smarter choices about random initialization weights to avoid this problem. Suppose you are such a deep neural network training, to save space on slides, I drew a neural network has only two hidden units on each floor, but it may contain more, but the neural network parameters,, and so on, until, for simplicity’s sake, suppose we use activation function, which is linear activation function, we ignore, Suppose = 0, in that case, the output, if you want to test my math level,, because, so, I think, because we are using a linear activation function, it is equal to, so the first item, through reasoning, you might conclude, because of, also is equal to, can be used to replace, so this means, this is ().

All this matrix data will be passed by the protocol instead of the values given.

Assuming that each weight matrix, technically, the last term has a different dimension, maybe it’s the rest of the weight matrix, because we’re assuming that all matrices are equal to it, and it’s 1.5 times the identity matrix, the final result is going to be equal to. If it’s large for a deep neural network, it’s going to be very large, in fact it’s going to grow exponentially, it’s going to grow at a rate of zero, so for a deep neural network, it’s going to explode.

Conversely, if the weight is 0.5, which is less than 1, this term will also become the matrix, again ignored, so each matrix is less than 1, assuming that the sum is 1, the activation function will become , , , , , and so on, until the last term becomes, so as a custom function, the activation function will decrease exponentially, It is a function related to the number of layers in the network. In the deep network, the activation function decreases exponentially.

The intuition I want you to get is that the weight is only a little bit more than one, or just a little bit more than the identity matrix, the activation function of the deep neural network is going to explode, if it’s a little bit less than one, maybe.

In a deep neural network, the activation function will decrease exponentially, although I have only discussed the activation function growing or decreasing in an exponential series associated with it. It also applies to the derivative or gradient function associated with the number of layers, which also increases or decreases exponentially.

For the current neural network, it is assumed that, Microsoft recently study of 152 layer neural network has made great progress, in such a depth of neural network, if the activation function or gradient function and related index of growth or decline, their values would be great or small, leading to train harder, especially when gradient index of less than, The step size of gDA is going to be very, very small, and GDA takes a long time to learn.

To summarize, we talked about the depth of the neural network is how to produce gradient disappear or explosion problem, in fact, in a very long period of time, it was the depth of the training of the neural network resistance, while one cannot completely solve the solution to this problem, but has been on the question of how to choose weights initialization provides a lot of help.

1.11 Weight Initialization for Deep NetworksVanishing/Exploding gradients

Last lesson, we learned how deep neural network gradient disappeared and gradient explosion problem, finally aimed at the problem, we came up with a partial solution, although not completely solve the problem, but very useful, can help us for neural network more carefully choose random initialization parameters, in order to better understand it, Let’s start with an example of neural unit initialization, and then evolve to the entire deep network.

Let’s look at just one neuron, and then the deep network.

A single neuron might have four input features, from alpha to alpha, processed to get to alpha, and when WE talk about the deep network later, those inputs are represented by alpha, and for the moment we’ll use alpha.

Ignored, in order to prevent the value is too big or too small, you can see, the more you want, the less, because it is and, if you put a lot of these items together, hope every value is smaller, the most reasonable approach is set up, said the number of input features neurons, in fact, you have to do is to set a layer of weighting matrix, That’s the number of neurons I feed to the layer 1.

It turns out that if you use the Relu activation function instead of setting the variance to zero, it works better. You often find that when you initialize, especially when you use the Relu activation function, depending on how familiar you are with a random variable, this is a Gaussian random variable, and then multiply by the square root of it, which refers to this variance. Here, I’m using alpha, because in this case, the properties of logistic regression are constant. But normally each neuron in the layer has an input. If the input characteristics of the activation function are normalized by zero mean and standard variance, and the variance is 1, it will also be adjusted to the similar range, which does not solve the problem (gradient extinction and explosion problems). But it does reduce the gradient vanishing and the explosion problem, because it sets reasonable values for the weight matrix, you know, it can’t be much bigger than one, it can’t be much smaller than one, so the gradient doesn’t explode or disappear too fast.

I mentioned other variant functions, the one I just mentioned is the Relu activation function, which was introduced in a paper by Herd et al. For several other variants, such as the TANh activation function, there is a paper that says constant 1 is more efficient than constant 2. For the tanh function, it is, where the square root works in the same way as this formula (). It applies to the TANh activation function and is called Xavier initialization. Yoshua Bengio and his colleagues also came up with another approach, which you might have seen in some of the papers, which used formulas. Other theories have proven this, but if you want to use the Relu activation function, which is the most common activation function, I’ll use this formula, and if you use the tanh function, you can use the formula, and some authors use this function.

In fact, I think all these formulas just give you a starting point, they are initialized weights matrix of the variance of the default, if you want to add variance, covariance parameters is another you need to adjust the parameters, can add a multiplier to formula parameters, tuning as a parameter in a part of multiplier parameters. It’s not the first super parameter I want to tune, but I find that tuning this parameter can help, but I usually give it a lower priority because of the importance of other super parameters.

Hopefully, you now have an intuitive understanding of the gradient extinction or explosion problem and how to initialize reasonable values for weights. Hopefully, your weight matrix will neither grow too fast nor fall too fast to zero, training for a deep network where weights or gradients don’t grow or disappear too fast. This is also a technique to speed up training when we’re training deep networks.

Numerical approximation of Gradients

When implementing a backprop, there is a test called a gradient test, which is used to ensure that the backprop is implemented correctly. Because sometimes you can write down these equations and not be 100% sure that all the details of implementing backprop are correct. In order to implement gradient verification gradually, we will first talk about how to calculate numerical approximation of gradient. In the next lesson, we will discuss how to perform gradient verification in Backprop to ensure that backprop is implemented correctly.

So let’s draw the function, let’s label it theta, and let’s look at the values of theta, let’s say, instead of increasing theta, let’s set one on the right, let’s set one on the left. And, therefore, as before, a value of 0.01, see the little triangle, calculating the ratio of high and wide, is a more accurate gradient estimates, choose function at this point on, with the larger the ratio of the triangle, wide technical reasons I won’t explain in detail, larger triangle height to width ratio is closer to the derivative of, If WE move the top right triangle down, it looks like we have two triangles, one in the top right corner, one in the bottom left corner, and we’re thinking about both of these smaller triangles through this big green triangle. So instead of a unilateral tolerance we get a bilateral tolerance.

We write the data calculation, the value of the green triangle in the figure above points is, below the point is, is the height of the triangle, the two width are epsilon, so is the width of the triangle, and ratio of high to width of its expected value, value, the incoming parameters, you can stop the video, use calculators calculate results, the results should be 3.0001, And on the previous slide, when,, so these two values are very close, with an approximation error of 0.0001. On the previous slide, we only considered the unilateral tolerance, the error between from and, with a value of 3.0301, and the approximation error was 0.03, not 0.0001, so using the bilateral error method to get closer to the derivative, The result is close to 3, and now we are more sure that it is probably the correct implementation of the derivative, and when you use this method in gradient tests and back propagation, it ends up being the same speed as running two side tolerances. In fact, I think this method is well worth using because it gives more accurate results.

These are some of the theories of calculus that you might be familiar with, but if you’re not familiar with what I’m talking about, the official definition of a derivative is for small values, and the official definition of a derivative is, if you’ve taken calculus, you’ve learned the definition of infinity, so I’m not going to do it here.

For a non-zero, the approximation error can be written as epsilon is very small, if the capital notation means that the approximation error is actually some constant times, but it is a very accurate approximation error, so the capital constant is sometimes 1. However, if we use another formula to approximate the error is, when it’s less than one, it’s actually much larger than that, so the formula approximation is not nearly as accurate as the one on the left, so when we do the gradient test, we use the bilateral error, i.e., instead of the unilateral tolerance, because it’s not accurate enough.

If you don’t understand these two conclusions, all the formulas are here, don’t worry, if you’re familiar with calculus and numerical approximation, there’s plenty of information, but the important thing to remember is that bilateral error formulas give more accurate results, and that’s what we’re going to use when we do gradient tests next time.

Today we talked about how you can use a bilateral error to determine whether a function that someone gave you has implemented its partial derivative correctly, and now we can use this method to verify that back propagation has been implemented correctly, and if it’s not, it might have bugs that you need to fix.

1.13 Gradient checking

Gradient verification has saved us a lot of time and has helped me find bugs in backprop implementation several times. Next, let’s see how we can use it to debug or verify that backprop implementation is correct.

Suppose your network has the following parameters, and… Inspection and, in order to perform gradient, the first thing to do is to convert all parameters to a large vector data, you have to do is to transform matrix into a vector, all matrix after converted to vector, and make the connection operation, get a giant vector, the vector is expressed as parameters, cost function is a function of all and, Now you have a cost function (i.e. Then, you get data in the same order as and, and you can also put and… And convert to a new vector, and use them to initialize the big vector, which has the same dimension as the.

Similarly, if you convert it to a matrix, it’s already a vector, until you convert it to a matrix, so everything is already a matrix, notice that alpha and omega have the same dimension, alpha and omega have the same dimension. After the same transformation and join operation, you can convert all the derivatives into one big vector that has the same dimension, and now the question is what does it have to do with the gradient or slope of the cost function?

This is the process of implementing gradient check, commonly referred to as “grad check” in English, first of all, we need to be clear that it is a function of hyperparameters, you can also expand the J function as, whatever the dimension of the hyperparameter vector is, to implement gradient check, all you have to do is loop, So for each of these, for each of the components, I’m going to use the bilateral error, which is theta

Just for the increments, the other terms stay the same, because we’re using a bilateral error, and we do the same thing for the other side, just subtracting, all the other terms stay the same.

We learn from the last lesson () should be close to the value =, is the partial derivative of the cost function, then you need to each value of I carry out the operation, finally get two vectors, the approximate values, it and have the same dimensions, they two have the same dimensions, you have to do is to verify these vectors are close to each other.

In particular, how do you define whether two vectors are really close to each other? I usually do the following operation, calculate the distance between these two vectors, the Euclidean norm of omega, notice there’s no square of omega, it’s the sum of the error squares, and then take the square root, get the Euclidean distance, and then normalize it in terms of the length of the vector, the Euclidean norm of the length of the vector. The denominator is used to prevent these vector is too small or too large, the denominator makes the formula ratio, we actually perform the equation, may be to use this value range, if you find a calculating formula of value or smaller, it is very good, which means that derivative approximation is likely to be correct, its value is very small.

If it’s in range, I’m going to be careful, maybe it’s fine, but I’m going to double check all the terms of this vector, make sure that none of them are too far off, maybe there’s a bug here.

If this equation on the left turns out to be zero, THEN I’m worried that there’s a bug, that it should be much smaller than that, and if it’s much larger than that, I’m worried that there’s a bug. At this point, you should check all the items carefully to see if there is a specific value that makes them very different from each other, and use it to track if some derivative calculation is correct. After some debugging, the end result will be this very small value (), so your implementation is probably correct.

When I’m implementing neural networks, I often have to do foreprop and backprop, and THEN I might find a gradient check that has a relatively large value, and I suspect there’s a bug, and I start debugging, debugging, debugging, debugging for a while, and I get a very small gradient check, and now I can say with confidence, Neural network implementation is correct.

Now that you know how gradient checking works, it has helped me find a lot of bugs in my neural network implementation, and HOPEFULLY it will help you too.

1.14 Gradient Checking Implementation Notes

In this lesson, share some practical tips and considerations on how to implement gradient testing in neural networks.

First, do not use gradient testing in training, it is only used for debugging. I mean, it’s a very long calculation to compute all the values, and in order to do gradient descent, you have to use and Backprop to compute, and you have to use Backprop to compute the derivatives, and only when you debug, do you compute it to make sure that the values are close. When you are done, you turn off gradient checking, and each iteration of gradient checking does not execute it because it is too slow.

Second, if the gradient test of the algorithm fails, check all the terms, check every term, and try to find the bug, that is, if the value of theta is very different from that of dθ[I], all we have to do is look for different values of I, and see which is causing the value of theta to be so different from that of theta. For example, if you find that the values of or are very different with respect to some level or some level, but the terms of are very close to each other, notice that the terms of are one-to-one with the terms of and, then you might find that there is a bug in calculating the derivatives of parameters. And the reverse is also true, if you find that the values are very different, the values of theta are very different from the values of theta, you’ll find that all of these items are coming from or at some level, and it might help you locate the bug, not necessarily locate the bug, but it might help you estimate where you need to track the bug.

Third, if regularization is used when performing gradient tests, note the regularization terms. If the cost function, and this is the definition of the cost function, is equal to the gradient of the associated function, including this regular term, remember to include this regular term.

Fourth, the gradient test cannot be used simultaneously with dropout, because during each iteration, dropout randomly eliminates different subsets of hidden layer elements, making it difficult to calculate the cost function of dropout on gradient descent. Thus dropout can be used as a way to optimize the cost function, but the cost function J is defined as a sum over all exponentially large subsets of nodes. In any iteration process, these nodes may be eliminated, so it is difficult to calculate the cost function. You’re just sampling the cost function, using dropout, randomly eliminating different subsets each time, so it’s hard to double-check the dropout calculation with gradient testing, so I don’t usually use gradient testing and dropout at the same time. If you want to do this, you can set the KeepproB in Dropout to 1.0, then turn on dropout and hope that the dropout implementation is correct. You can also do something else, such as modify the node loss mode to make sure the gradient check is correct. Actually, I don’t usually do this. I recommend turning off dropout, double-checking with gradient testing, and turning on dropout if your algorithm is at least correct without dropout.

Finally, and more subtly, this rarely happens in reality. When and approach 0, the implementation of gradient descent is correct, during random initialization… But when running gradient descent, and become larger. It is possible that the implementation of backprop is correct only when the sum is close to zero. But as the sum gets bigger, it gets less and less accurate. One thing you need to do, and I don’t do this very often, is during random initialization, you run the gradient test, and then you train the network, and you go away from zero for a while, and if the random initialization is small, you train the network repeatedly, and then you run the gradient test again.

That’s the gradient test. Congratulations, this is the last lecture of the week. Review this week, we talked about how to configure the training set, validation set and test set, how to analyze the bias and variance, how to deal with deviation or variance and the coexistence of high bias and variance problem, how to apply different in neural network in the form of regularization, such as the regularization and dropout, and speed up the neural network training skills, finally the gradient test. We’ve covered a lot of stuff this week, so you can practice these concepts in this week’s programming assignment. Good luck and look forward to seeing you next week.

The resources

[1] Deep Learning Courses:Mooc.study.163.com/university/…[2] Huang Hai-Guang:github.com/fengdu78[3]github: Github.com/fengdu78/de…