This article is the notes section of Ng’s deep Learning course [1].

Author: Huang Haiguang [2]

Main Author: Haiguang Huang, Xingmu Lin (All papers 4, Lesson 5 week 1 and 2, ZhuYanSen: (the third class, and the third, three weeks ago) all papers), He Zhiyao (third week lesson five papers), wang xiang, Hu Han, laughing, Zheng Hao, Li Huaisong, Zhu Yuepeng, Chen Weihe, the cao, LuHaoXiang, Qiu Muchen, Tang Tianze, zhang hao, victor chan, endure, jersey, Shen Weichen, Gu Hongshun, when the super, Annie, Zhao Yifan, Hu Xiaoyang, Duan Xi, Yu Chong, Zhang Xinqian

Editorial staff: Huang Haiguang, Chen Kangkai, Shi Qinglu, Zhong Boyan, Xiang Wei, Yan Fenglong, Liu Cheng, He Zhiyao, Duan Xi, Chen Yao, Lin Jianyong, Wang Xiang, Xie Shichen, Jiang Peng

Note: Notes and assignments (including data and original assignment files) and videos can be downloaded on Github [3].

I will publish the course notes on the official account “Machine Learning Beginners”, please pay attention.

Week 3 Hyperparameter Debugging, Batch Regularization and Program Framework (Hyperparameter Tuning)

3.1 Tuning Process

Hi, welcome back. So far, you’ve seen that changes in neural networks involve setting a lot of different hyperparameters. Now, for hyperparameters, how do you find a good set of Settings? In this video, I want to share with you some guidelines and tips on how to organize the overparameter debugging process in a systematic way, in the hope that this will help you focus more effectively on the proper overparameter Settings.

One of the hardest things about training depth is the number of parameters you have to deal with, from learning rate to Momentum parameters. If you use Momentum or Adam optimization parameters,, and, maybe you have to choose the number of layers, maybe you have to choose the number of hidden cells in different layers, maybe you want to use learning rate decay. So, you’re not using a single learning rate. Then, of course, you may need to choose the mini-batch size.

It turns out that some hyperparameters are more important than others, and IN my opinion, for the broadest learning application, learning rate is the most important hyperparameter to debug.

There are also some parameters to debug, such as Momentum, for which 0.9 is a good default. I also debug the mini-batch size to make sure the optimal algorithm works. I also debug the hidden units a lot, and the ones I’ve circled in orange, those are the three that I think are the next most important, compared to. Third in importance are other factors. The number of layers can sometimes make a big difference, as does the decline in learning rate. When applying the Adam algorithm, I actually never debug, and I always select 0.9, 0.999, and you can debug them if you want.

But I hope you to a brief look at what the super parameter is more important, is undoubtedly the most important, next is I live in orange circle, and then I use the purple circle who, but this is not the standard strictly and quickly, I think, other researchers can be very deep learning don’t agree with me or have different intuitions.

Now, if you’re trying to adjust some hyperparameters, how do you select debug values? In the earlier generation of machine learning algorithms, if you had two hyperparameters, which I’m going to call hyperparameter 1, hyperparameter 2, the common thing to do was to sample points in the grid, like this, and then systematically study those values. I’m putting a 5×5 grid here, and as it turns out, the grid can be 5×5, it can be more or less, but for this example, you can try all 25 points and choose which parameter works best. This approach is useful when the number of parameters is relatively small.

In deep learning, what we often do, I recommend you do the following thing, pick points at random, so you can pick the same number of points, right? 25 points, and then, using these randomly selected points to test the effect of hyperparameters. The reason for doing this is that it’s hard to know in advance which hyperparameters are most important to the problem you’re trying to solve, and as you’ve seen before, some hyperparameters are more important than others.

For example, suppose that hyperparameter 1 is (learning rate), and take an extreme example, suppose that hyperparameter 2 is in the denominator of Adam’s algorithm. In this case, the value of phi is important, and the value of phi is irrelevant. If you take the points in the grid, and then you experiment with the five values, what you find is that no matter what the values are, they’re basically the same. So, you know, there were 25 models, but only five of them were tested, and I think that’s really important.

In contrast, if you do it randomly, you try 25 independent ones, and it seems like you’re more likely to find the one that works best.

I have explained the case for two parameters. In practice, you may search for more than two hyperparameters. Let’s say you have three hyperparameters, and instead of searching for a square, you’re searching for a cube, and the hyperparameter 3 represents the third dimension, and then you evaluate it in a three-dimensional cube, and you’re going to experiment with a lot more values, each of the three hyperparameters.

In practice, you might search for more than three hyperparameters and sometimes it’s hard to predict which hyperparameter is the most important, and for your particular application, random values rather than grid values indicate that you’re exploring more potential values for important hyperparameters, no matter what the result is.

Another convention when you give hyperparameters is to go from rough to fine.

In 2 d of the case, for example, has carried on the value of you, maybe you will find the best effect of a certain point, perhaps this point at other points around the effect is very good also, that in the next thing to do is to enlarge the small area (the little blue box), and in the more densely populated values or random values, gather more resources, in this blue grid search, If you doubt the optimal results of these hyperparameters in this region, then after a cursory search of the entire grid, you will know that you should focus on smaller squares next. In smaller squares, you can pick points more closely. So this coarse-to-thin search is often used.

By experimenting with different values for the hyperparameter, you can choose the best value for the training set target, or the best value for the development set, or what you want to optimize during the hyperparameter search.

Hopefully, this gives you a way to systematically organize the hyperparametric search process. Another key point is random selection and precise search, and consider using a search process from rough to fine. But there’s more to searching for hyperparameters, and in the next video, I’ll do more on how to choose a reasonable range for hyperparameters.

3.2 Using an appropriate scale to pick Hyperparameters

In the last video, you saw how random values can improve your search efficiency in hyperparameter ranges. However, random values are not random and uniform values in the effective range, but the selection of appropriate scales to explore these hyperparameters, which is very important. In this video, I’ll show you how to do that.

Let’s say you want to pick the number of hidden units, and let’s say you pick values somewhere between 50 and 100, in which case, if you look at this number line from 50 to 100, you can pick points randomly on it, which is a very intuitive way to search for a particular hyperparameter. Or, if you want to select neural network layers, which we call letters, you may choose layer for a value of 2 to 4, then follow the 2 and 4 uniform random sampling is more reasonable, you can also apply the grid search, you will feel 2 and 4, these three values are reasonable, it is in a few examples of uniform random value within the range you consider, These values are reasonable, but not for some hyperparameters.

Take a look at this example. Suppose you are searching for the hyperparameter (learning rate) and suppose you suspect that its value is at least 0.0001 or at most 1. If you draw a number line from 0.0001 to 1 and evaluate it randomly and uniformly, 90% of the values will fall between 0.1 and 1. The result is that between 0.1 and 1, 90% of the resources are applied, and between 0.0001 and 0.1, only 10% of the resources are searched. That doesn’t look right.

Instead, it makes more sense to search for hyperparameters with a logarithmic scale, so instead of using a linear axis, take 0.0001, 0.001, 0.01, 0.1 and 1, respectively, and pick points at random on the logarithmic axis. In this way, more search resources will be available between 0.0001 and 0.001. And between 0.001 and 0.01 and so on.

So in Python, you can do this by making r=-4*np.random.rand(), and then taking a random value, so, the first line yields, so, so the leftmost number is, and the rightmost number is.

More often, if you take the value between PI and PI, which in this case is (0.0001), you can calculate it by the value of PI, which is -4, and the value on the right hand side, which you can calculate, which is 0. All you have to do is randomly and uniformly assign values in the interval, in this case, and then you can set values based on random sampling hyperparameters.

So just to summarize, you take the logarithmic coordinates, you take the logarithm of the minimum and you get the value, you take the logarithm of the maximum and you get the value, so now you take the value of the interval between PI and PI on the logarithm line, and you arbitrarily pick the value between PI and PI, and you set the hyperparameter to PI, and that’s how you take the value on the logarithm line.

Finally, another tricky example is assigning values, which are used to calculate the weighted average of an index. Let’s say you think it’s somewhere between 0.9 and 0.999, maybe that’s where you want to search. Keeping this in mind, when calculating the weighted average of the index, taking 0.9 is like averaging over 10 values, somewhat similar to averaging the temperature over 10 days, while taking 0.999 is averaging over 1000 values.

So similar to what we saw on the last slide, if you want to search between 0.9 and 0.999, you can’t use the linear axis, right? So the best way to think about this is, what we’re going to explore is that this value is in the range from 0.1 to 0.001, so we’re going to evaluate it, roughly from 0.1 to 0.001, using the method I showed you in the previous slide, this is, this is, and it’s worth noting that in the previous slide, We wrote the minimum on the left and the maximum on the right, but here we reversed the magnitude. Here, the one on the left is the maximum and the one on the right is the minimum. So all you have to do is randomly and uniformly evaluate r in. You set it, so, and then this becomes random hyperparameter values within a certain range of choices. To get the results you want this way, you have just as many resources to explore between 0.9 and 0.99 as you do between 0.99 and 0.999.

So, if you want to do a more formal mathematical proof of why we’re doing this, why it’s not a good idea to use the linear axis, it’s because as you approach one, the sensitivity of the result changes, even if it changes slightly. So between 0.9 and 0.9005, it doesn’t matter, your results are pretty much the same.

But if the value is between 0.999 and 0.9995, that makes a huge difference to your algorithm, right? In both cases, you average over about 10 values. But in this case, it’s the weighted average of the exponent, based on 1000 values, now 2000 values, because this formula, as it approaches 1, becomes very sensitive to small changes. So the whole process, you have to evaluate more intensively, in the interval closer to one, or closer to zero, so that you can distribute the sampling points more efficiently, explore the possible outcomes more efficiently.

Hopefully, this will help you to choose the appropriate ruler to evaluate the hyperparameter. If you don’t have in the super parameter selection to make the right decision scale, don’t worry about it, even if you are on a scale of uniform values, if total value is more, you will get good results, especially from coarse to fine search method, after the iteration, you will focus on the super parameter value range of useful.

Hopefully this will help you with your hyperparametric search, and in the next video, we’ll share some thoughts on how to structure your search process to make it more efficient.

Hyperparameters tuning in Practice: Pandas vs. Caviar

So far, you’ve heard a lot about how to search for optimal hyperparameters, and before we wrap up our discussion of hyperparametric search, I want to end by sharing with you some tips and tricks on how to organize your hyperparametric search process.

Nowadays, deep learning has been applied to many different fields. The setting of hyperparameters in one application field may be generalized in another field, and different application fields appear to blend with each other. For example, I’ve seen clever approaches emerge in the field of computer vision, such as Confonets or ResNets, which we’ll cover in a later class. It’s been successfully applied to speech recognition, and I’ve seen ideas originally derived from speech recognition successfully applied to NLP and so on.

One of the things that’s going really well in the field of deep learning is that people in different application fields are reading more and more articles from other research fields, looking for inspiration across fields.

In terms of super parameter setting, I’ve seen some intuitive idea become very lack of new idea, so, even if you only study a problem, such as logic, you may have found a group of very good parameter Settings, and continue to develop algorithms, perhaps in the course of several months, observed that your data will gradually change, or maybe it’s just in your data center to update the server, Because of these changes, your old hyperparameter Settings don’t work anymore, so I suggest maybe just retest or evaluate your hyperparameter, at least every few months, to make sure you’re still happy with the numbers.

Finally, I’ve seen about two important schools of thought or two important but different approaches that people often take to the question of how to search for hyperparameters.

One is that you take care of a model, usually with a large data set, but without a lot of computing resources or enough CPU and GPU, and basically you can only afford to experiment with one model or a small batch of models at a time, in which case you can gradually improve even while it’s being tested. So, for example, on day 0, you initialize the random parameters, and you start the experiment, and then you gradually watch your learning curve, maybe the loss function J, or the data set error or something, gradually decrease over the first day, and then at the end of the day, you might say, look, it’s learning really well. I try to increase the learning rate a little bit and see how it goes, and maybe it turns out to be better, and that’s how you perform the next day. Two days later, you say, it’s still doing well, maybe I can now fill in Momentum or reduce variables. And then the third day, every day, you watch it, you adjust your parameters. Maybe one day, you’ll find that your learning rate is too high, so you might go back to the previous model, like this, but you can say that you’re taking time to watch the model every day, even if it’s over the course of many days or many weeks. So this is a way for people to take care of a model, to see how it behaves, to be patient with the learning rate, but that’s usually because you don’t have enough computing power to try a lot of models at the same time.

The other way is to try multiple models at the same time, you set some hyperparameters, just let it run on its own, or for a day or more days, and then you get a learning curve like this, which could be the loss function J or the loss of experimental error or the loss of data error, but it’s all a measure of the trajectory of your curve. And you can start a different model with a different set of hyperparameters, so your second model will generate a different learning curve, maybe one like this (purple curve), which I would say looks better. In the meantime, you can experiment with a third model, which might produce a learning curve like this (red curve), another (green curve), maybe this one deviates, like this, and so on. Or you can try many different models in parallel at the same time, and the orange lines are different models. In this way you can experiment with many different Settings and just quickly pick the one that works best. In this case, maybe this one looks the best (green curve below).

For example, I call the method on the left the panda method. When pandas have children, they have very few children, usually only one at a time, and then they put a lot of effort into raising the baby panda to make sure it survives, so it’s really a kind of care, a kind of model like a baby panda. The one on the right, by contrast, is more fish-like, which I call the caviar way. Some fish may lay 100 million eggs during mating season, but the way fish reproduce is that they produce a lot of eggs and don’t care much for any one of them, just hoping that one of them, or a group of them, will do well. That’s the difference, I guess, between mammal reproduction and fish reproduction, and many reptiles reproduction. I’ll call it the Panda way versus the caviar way because it’s fun and easier to remember.

So the choice between these two approaches, depending on the computing resources you have, if you have enough computers to run a lot of models in parallel, definitely go caviar and try lots of different hyperparameters and see what happens. But in some applications, like online AD Settings and computer vision applications, there’s so much data that you have to experiment with so many models that it’s difficult to experiment with so many models at once, and it really depends on the process of the application. But I see organizations that use the panda approach more, where you baby-sit a model, tweak parameters, and try to make it work. Although, of course, even in the way of the panda, a model test, see it work or not, maybe after the second or the third week, maybe I should set up a different model (green curve), to take care of it like a panda, I guess, in this life can foster children, even if they have only one child or child at a time of few in number.

So I hope you can learn how to make super parameter searching process, now, there is another skill, can make your neural network become more solid, it is not applicable to all of the neural network, but when applicable, it can make the super parameter search much easier and accelerated test process, we explain again the next video with this technique.

3.4 Normalizing Activations in a network

After the rise of deep learning, one of the most important ideas is Batch normalization, which was created by two researchers Sergey Loffe and Christian Szegedy. Batch normalization will make it easier for you to search for parameters, make the neural network’s selection of hyperparameters more stable, the range of hyperparameters will be larger, the work effect is also very good, and it will make your training easier, even the deep network. Let’s take a look at how Batch normalization works.

When training a model, such as logistic regression, you may remember that normalized input features speed up the learning process. You compute the mean, you subtract the mean from the training set, you compute the variance, and then you normalize your data set based on the variance, and we saw in the previous video how this can change the contour of a learning problem from something that’s long, to something that’s more rounded, and easier to algorithmically optimize. So this is valid for logistic regression and neural network normalized input eigenvalues.

What about the deeper model? Not only do you put in eigen values, but you have activation values in this layer, activation values in this layer and so on. If you want to train these parameters, for example, wouldn’t normalized means and variances be nice? In order to make the training more efficient. In the case of logistic regression, we saw how normalization can help you train more effectively and.

So the question is, for any hidden layer, can we normalize the value of, in this case, for example, the value of, but it could be any hidden layer, and train at a faster rate, because it’s the input value of the next layer, it would affect the training of. In a nutshell, this is the role of Batch normalization. Although strictly speaking, what we really normalize is not, there is some debate in the deep learning literature about whether values should be normalized before activating functions, or whether values should be normalized after activating functions are applied. In practice, normalization is often done, so this is the version I introduced, which I recommend as the default, and the following is the Batch normalization method.

In a neural network, you have some intermediate values, let’s say you have some hidden unit values, from PI to PI, and these come from the hidden layer, so it’s more accurate to write it this way, the hidden layer, from 1 to PI, but this way I’m going to omit the square brackets, just to simplify the notation on this line. So these values, known as follows, do you want to calculate the average, emphasize that all of these are for the floor, but I omitted and brackets, and then used as you commonly used formula for the calculation of variance, and then, you will take each value, make its normalization, the method is as follows, subtracting the mean divided by the standard deviation, in order to make the numerical stability, is usually as the denominator, Just in case.

So now we have these values, standardization, reduced to 0 and standard unit containing average variance, so each component contains mean zero and variance 1, but we don’t want to hide always with mean zero and variance of 1 unit, hidden units have a different distribution might be meaningful, so we have to do is calculated, we call it,, Here the sum is the learning parameter of your model, so we use gradient descent or some other gradient descent like Momentum or Nesterov Adam, and you update the sum, just like you update the weights of the neural network.

Notice what the sum does is, you can set the average as you like, and in fact, if, if phi is equal to this denominator, phi is equal to phi, and this value right here is phi, then phi does exactly convert this equation, and if these are true, then phi.

By properly setting the sum, the normalization process, the four equations, essentially just compute the identity function, by assigning the sum to other values, allows you to construct hidden unit values with other means and variances.

So, the way the network matches this unit, which might have been used, and so on, will now be used instead to facilitate subsequent calculations in the neural network. If you want to put it back, just to make it clear which floor it’s on, you can put it here.

So what I want you to learn is how the normalized input characteristics contribute to learning in neural networks, and the Batch normalization function is that it applies to the normalization process, not only at the input layer, but also at the deep hidden layer in neural networks. You apply Batch normalization to the mean and variance in some hidden cell values, but one difference between the training input and these hidden cell values is that you may not want the hidden cell values to be mean 0 and variance 1.

For example, if you have a sigmoid activation function, you don’t want to let all your values always focus here, do you want to make them have more variance, or average value is not zero, in order to make better use of the nonlinear sigmoid function, rather than making all values are focused on the linear version, that is why after a and two parameters, You can make sure that all the values can be whatever you want, or it can make sure that the hidden units have normalized the mean and variance. There, the mean and variance are controlled by two parameters, the sum, and the learning algorithm can be set to any value, so what it really does is normalize the mean and variance of the hidden cell values, that is, there are fixed means and variances, the mean and variances can be 0 and 1, or they can be other values, and it’s controlled by the sum.

I want you to learn how to use Batch normalization, at least for a single layer of a neural network, and in the next video I’ll show you how to match Batch normalization with a neural network or even a deep neural network. And how to make it work for many different layers of a neural network, and THEN I’ll tell you why Batch normalization helps train a neural network. So if it’s a bit of a mystery why Batch normalization works, stick with me and we’ll figure it out in the next two videos.

3.5 Fitting Batch Norm into a neural network

You’ve seen those equations, which can be Batch normalized at a single hidden layer, but let’s see how they fit in deep network training.

Suppose you have a neural network like this, and AS I said before, you can think of each unit as being responsible for two things. One, it computes Z, then applies it to the activation function and computes A, so I can think of each circle as a two-step process. Similarly, for the next level, it’s sum and so on. So if you don’t apply Batch normalization, you fit the input to the first hidden layer and then compute first, which is controlled by the and two parameters. And then, in general, you would fit into the activation function to compute. But Batch normalization is Batch normalization of values, abbreviated as BN. This process will be controlled by and two parameters. This operation will give you a new normalized value (), which is then input into the activation function to get, i.e.

Now that you have made the calculation at the first layer, Batch normalization occurs between the calculation of z and. Next, you need to apply the values to the calculation, which is controlled by and. Similar to what you did in the first layer, you’re going to do Batch normalization, now we’re going to call it BN, which is governed by the next layer of Batch normalization parameters, which is sum, which you now get, which is computed by activation functions and so on.

So it’s important to emphasize that Batch normalization happens between calculations and. The intuition is that instead of applying unnormalized values, we should use normalized values, which is the first level (). Similarly, instead of applying unnormalized values, we should use the normalized values of variance and mean. So, the parameters of your network are going to be alpha, alpha and so on, and we’re going to get rid of those parameters. But now, imagine parameters to, and we add some other parameters to this new network, and so on. For each layer that applies Batch normalization. To be clear, note that these (and so on) have nothing to do with hyperparameters, explained in the next slide, which are used for Momentum or for calculating weighted averages of indices. Adam, the author of the paper, used the term hyperparameter in the paper. Batch normalized paper authors, then used to represent this parameter (, etc.), but these are two completely different. I decided to use it in both cases so you can read the original papers, but Batch normalized learning parameters, etc., are different from those used in Momentum, Adam, RMSprop algorithms.

So now, this is the new parameter to your algorithm, and then you can use whatever optimization algorithm you want, like gradient descent to execute it.

For example, for a given layer, you would compute, and then update the parameter to. You can also use Adam or RMSprop or Momentum to update parameters and not just apply gradient descent.

Even though I’ve explained in previous videos how Batch normalization works, calculating the mean and variance, subtracted the mean, and divided by the variance, if they’re using a deep learning programming framework, you usually don’t have to apply the Batch normalization steps to the Batch normalization layer yourself. So exploratory frameworks, as a line of code, for example, in the TensorFlow framework you use this function (TF.nn.batch_normalization) to do Batch normalization, which we’ll talk about later, but in practice you don’t have to do all the details yourself. But by knowing how it works, you can better understand what the code does. But in deep learning frameworks, Batch normalization is often something like a line of code.

So, so far, we’ve talked about Batch normalization, as if you were training across the entire training site, or as if you were using Batch gradient descent.

In practice, Batch normalization is usually used with mini-Batch of the training set. The way you apply Batch normalization is, you take the first mini-batch() and you calculate, which is the same thing we did on the last slide, and you apply the parameters and you use this mini-batch(). Then, proceed to the second mini-batch(), and then batch normalization subtracts the mean, divides it by the standard deviation, rescales it by and, and so you get, and all of this is based on the first mini-batch, and then you apply the activation function to get. So you’re doing all this to do a gradient descent on the first mini-batch().

Similar work, you will calculate on the second mini-batch (), and then use batch normalization to calculate, so in the batch normalization step, you use the data in the second Mini-batch () to normalize, the same is true for the batch normalization step, Let’s look at the example in the second mini-batch (), the mean and variance calculated on the mini-batch, the sum rescaled to get, and so on.

Then do the same on the third mini-batch () and continue the training.

Now, I want to clarify one detail of this parameter. I said before that the parameters of each layer are and and, please note that the calculation method is as follows, but Batch normalization is done according to the mini-batch, which is normalized first, and the result is mean 0 and standard variance, and then rescaled by and, but this means that no matter what the value of is, it must be subtracted. Because in the Batch normalization process, you calculate the mean and subtract the mean, in this case, any constant added to mini-Batch will not change the value, because any constant added will be offset by the subtraction of the mean.

So, if you’re using Batch normalization, you can actually eliminate the parameter (), or you can temporarily set it to 0, so the parameter becomes, and then you compute the normalized, and you end up using the parameter, in order to determine the value of, and that’s why.

So to summarize, since Batch normalization exceeds the mean of this layer, this parameter is meaningless, so you must remove it and replace it. This is a control parameter that affects transition or bias conditions.

And finally, remember the dimension of phi, because in this case the dimension would be phi, the dimension of phi is phi, if it’s the number of hidden units in layer L, then so is the dimension of phi, because that’s the number of hidden units you have, so phi is used to scale the mean and variance of each hidden layer to what the network wants.

Let’s summarize how Batch normalization applies gradient descent. Suppose you are using mini-batch gradient descent, and you run Batch for loops, and you apply forward prop to mini-batch, each hidden layer applies forward prop, Replace Batch normalization with. And then, it makes sure that in this mini-batch, the values have normalized means and variances, normalized means and variances, and then you use reverse prop to compute the sum, and all the parameters of all l layers, and. Although technically, because you want to get rid of it, this part is already gone. Finally, you update these parameters: as before, so does for.

If you have the gradient calculated as follows, you can use gradient descent, which is what I wrote here, but also for gradient descent with Momentum, RMSprop, Adam. Instead of using the gradient descent method to update the mini-Batch, you can use these other algorithms, which we discussed in the previous videos a few weeks ago, as well as apply some other optimization algorithms to update the sum parameters added to the algorithm by batch normalization.

Hopefully, you’ll learn how to apply Batch normalization from scratch, if you want to. If you use one of the deep learning programming frameworks, which we’ll talk about later, hopefully, you can call someone else’s programming framework directly, which makes Batch normalization easy to use.

Now, in case Batch normalization still seems a bit of a mystery, especially if you’re not sure why it speeds up training so dramatically, let’s move on to the next video, which goes into more detail about why Batch normalization works so well and what it does.

3.6 Why does Batch Norm work? (Why does Batch Norm work?)

Why does Batch normalization work?

One reason is that you’ve already seen how to normalize the input eigenvalues so that the mean is zero and the variance is one, and how it speeds up learning, you have some eigenvalues from zero to one instead of one to 1000, and you can speed up learning by normalizing all the input eigenvalues to get a similar range of values. So Batch normalization of the role of reason, intuitive is that it is doing similar work, but not just for the input values, and values of hidden units, this is just the tip of the iceberg of Batch normalization in action, the principle of some deep, it will help you to Batch normalization have a deeper understanding of the effect of, let’s get together and see it.

The second reason Batch normalization works is that it can make weights lag or go deeper than your network. For example, the weight of layer 10 can withstand changes better than the weight of previous layers in a neural network, such as layer 1. To illustrate what I mean, let’s look at the most vivid example.

This is the training of a network, maybe a shallow network, like logistic regression or a neural network, maybe a shallow network, like this regression function. Or a deep web, build on our famous cat face recognition test, but if you have in all black cats on the image of the training data set, if you want to apply the network to colored cat, in this case, the positive examples not only on the left side of the black cats, and on the right side of the color of the cat, the other you cosfa may apply is not very good.

If in the graph, your training set looks like this, and you have the heads here and the tails there (left picture), but you’re trying to unify them all into one data set, maybe the heads are here and the tails are there (right picture). You might not expect that modules that are well trained on the left will also work well on the right, even if there is the same function that works well, but you don’t want your learning algorithm to find green decision boundaries if you just look at the data on the left.

So the idea of getting your data to change the distribution is called a “Covariate shift,” and the idea is that if you already learn the mapping, if the distribution changes, then you might need to retrain your learning algorithm. The same applies if the real function stays the same from the map, as in this case, because the real function is whether or not the picture is a cat, the need to train your function becomes more urgent, and worse if the real function also changes.

How does the “Covariate shift” question apply to neural networks? Imagine a deep network like this, and let’s look at the learning process from this layer. The network has learned the parameters and, from the point of view of the third hidden layer, it takes some values from the previous layer, and then it needs to do something to make the desired output value close to the real value.

Let me cover up the left side, from the point of view of the third hidden layer, it gets some values, it’s called , , , , but those values could also be eigenvalues , , , , and the job of the third hidden layer is to find a way for those values to map to, you can imagine doing some truncation, so the parameters and or and or and, Maybe it’s learning these parameters, so the network does a good job, mapping from what I’ve written in black on the left to the output value.

Now let’s unmask the left side of the network, and the network also has parameters, and if these parameters change, the values of these will change. So from the perspective of the third hidden layer, the values of the variate shift constantly change, which contributes to the “Covariate shift” problem we saw on the previous slide.

What Batch normalization does is it reduces the number of changes in the distribution of these hidden values. If you’re plotting the distribution of these hidden unit values, maybe this is the renormalization values, this is actually, I’m going to draw two values instead of four so that we can imagine 2D, Batch normalization is saying that the values of alpha can change, and they do change when the neural network updates the parameters in the previous layer, Batch normalization ensures that no matter how they change, their mean and variance remain the same, so even if their values change, at least their mean and variance will be mean 0 and variance 1, or not necessarily mean 0 and variance 1, but values determined by and. If the neural network chooses, it can force it to be mean 0, variance 1, or any other mean and variance. But what it does is, it limits the extent to which parameter updates in the previous layer can affect the value distribution that the third layer sees in this case and therefore gets learned.

Batch normalization reduces the problem of changing input values, and it does make those values more stable, so that the subsequent layers of the neural network have a more solid foundation. Even if you change the input distribution a little bit, it changes even less. What it does is the current layer keeps learning, and when it changes, it forces the later layer to adapt less, so you can think of it this way, it reduces the relationship between the effects of the parameters at the front layer and the effects of the parameters at the back layer, and it allows each layer of the network to learn on its own, a little bit independently of the other layers, which helps speed up the learning of the whole network.

So hopefully that gives you a better intuition, but the point is Batch normalization means, especially from the point of view of one of the back layers of the neural network, the front layer doesn’t move left and right as much, because they’re constrained by the same mean and variance, so it makes it easier for the back layer to learn.

Batch normalization also has a slight regularization effect. One of the non-intuitive things about Batch normalization is that each mini-batch, I will say the value of mini-batch is, in the mini-batch calculation, scaled by the mean and variance, Because the mean and variance are calculated on the mini-batch and not on the entire data set, the mean and variance have some minor noise because it is only calculated on your Mini-batch, such as 64 or 128 or 256 or larger training examples. Because the mean and variance are a little bit noisy, because it’s only estimated from a small number of data points. Scaling from PI to PI is also a little bit noisy, because it’s calculated using the mean and variance of some of the noise.

So just like dropout, it adds noise to the activation value of each of the hidden layers, dropout has a way of adding noise, it makes a hidden cell, with some probability times zero, with some probability times one, so your dropout contains several loads of noise because it’s multiplied by zero or one.

In contrast, Batch normalization contains several loads of noise due to the scaling of the standard deviation and the subtraction of the mean resulting in additional noise. Here also have noise estimates of the mean and standard deviation, so similar to dropout, Batch normalized mild regularization effect, because the hidden unit added noise, forcing the back of the unit is not to rely too much on any hidden units, similar to the dropout, it adds noise to the hidden layer, so there is a slight regularization effect. Since the added noise is very small, it’s not a huge regularization effect. You can use Batch normalization and dropout together if you want a more powerful regularization effect.

Maybe another slightly non-intuitive effect is that if you apply a larger mini-batch, right, let’s say, you use 512 instead of 64, by applying a larger min-batch, you reduce the noise and therefore the regularization effect, which is a strange property of dropout, The regularization effect can be reduced by using a larger mini-batch.

At this point, I’m going to think of Batch normalization as a regularization, which is really not the intent, but sometimes it can have additional expected or unintended effects on your algorithm. But instead of thinking of Batch normalization as regularization, think of it as a way to normalize your hidden unit activation values and speed up learning, which I think is almost an unexpected side effect.

So hopefully this gives you a better understanding of Batch normalization, and before we wrap up the Batch normalization discussion, I want to make sure you know one more detail. Batch normalization can process only one mini-batch of data at a time, and it calculates the mean and variance on the mini-batch. So when you test, you’re trying to make predictions, you’re trying to evaluate neural networks, you may not have mini-batch examples, you may only have one simple example at a time, so when you test, you need to do something different to make sure that your predictions make sense.

In the next and final Batch normalization video, let’s go through some of the details that you need to pay attention to in order for your neural network application Batch normalization to make predictions.

3.7 Batch Norm (Batch Norm at test Time)

Batch normalization Processes your data individually in a mini-batch format, but when testing, you may need to process each sample individually, so let’s look at how you can tune your network to do this.

Recall that during training, these are the equations used to perform Batch normalization. In a mini-batch, you sum the values of the mini-batch to calculate the mean, so here you just add up all the samples in one mini-batch, and I use m to represent the number of samples in this mini-batch, not the entire training set. And then you compute the variance, and then you compute, you adjust for mean and standard deviation, plus for numerical stability. That’s what you get by using and retuning.

Please note that the sum used to adjust the calculation is calculated on the entire mini-batch, but when testing, you may not be able to process 6428 or 2056 samples in a mini-batch at the same time, so you need to use other methods to get the sum, and if you only have one sample, The mean and variance of a sample are meaningless. So, in fact, in order to use your neural network for testing, you need to estimate separately and, in a typical Batch normalization application, you need to estimate using an exponential weighted average that covers all the Mini-batches, which I’ll explain in more detail.

Let’s choose the layer, assuming we have mini-Batch… And the corresponding values and so on, so when you train for the layer, you get, I’ll write it as the first mini-batch and this layer, (). When you train the second mini-batch, in this layer and this mini-batch, you get the second () value. Then in the third mini-batch of this hidden layer, you get the third () value. Just as we used the exponential weighted average before to calculate the mean of alpha, when we were trying to calculate the exponential weighted average of the current temperature, you would do this by tracking the latest average of the mean vector that you saw, and then the exponential weighted average would be your estimate of the mean of this hidden layer. Similarly, you can use exponentially weighted averages to track the values you see in the first mini-batch of this layer, the values you see in the second mini-batch, and so on. So while training the neural network with different mini-batches, you can get real-time values of the average sum of each layer that you are looking at.

And then finally, when you test, for this equation (), you just take your values, you take the exponential weighted average of the sum, you make adjustments with the most recent values you have, and then you can use the sum parameters on the left that we just figured out and the sum parameters that you got during your neural network training to compute the values of your test sample.

So just to summarize, in training, the sum is calculated over the entire mini-batch with like 64 or 28 or a certain number of samples, but in testing, you might have to deal with the samples one by one by estimating the sum based on your training set, and there are a number of ways to estimate the sum, In theory you can run the entire training set through the final network to get the sum, but in practice we usually use exponentially weighted averaging to track the value of the sum you see during training. You can also use exponential weighted averages, sometimes called flow averages, to roughly estimate sums, and then use the values of sums in your tests to make adjustments to the hidden unit values you need. In practice, this process is robust no matter how you estimate sums, so I’m not too worried about how you do it, and if you’re using some kind of deep learning framework, there’s usually a default way to estimate sums, which should work just as well. But in practice, any reasonable way to estimate the mean and variance of your hidden unit values should work in testing.

That’s it. Using Batch normalization, you can train deeper networks and make your learning algorithms run faster. Before we wrap up this week, I want to share with you some ideas about deep learning frameworks that we can discuss together in the next video.

3.8 Softmax Regression

So far, all the classification examples we’ve talked about have used dichotomies, where there are only two possible tokens, 0 or 1, is this a cat or is this not a cat, what if we have multiple possible types? There’s a general form of logistic regression, called Softmax regression, that allows you to make a prediction when you’re trying to identify one category, or one of multiple categories, not just two categories, so let’s look at that.

Let’s say you don’t just need to identify cats, but cats, dogs, and chickens, and I’m going to add cats to class 1, dogs to class 2, and chickens to class 3, and if you don’t belong to any of those categories, you’re in the “other” or “none of the above” category, and I’m going to call it class 0. The picture shown here and its corresponding categories is an example, this picture is a chicken, so it’s category 3, a cat is category 1, a dog is category 2, I guess it’s a koala, so it doesn’t fit, so it’s category 0, the next category 3, and so on. We’re going to use symbols, and I’m going to use capital letters to indicate the total number of categories that your input will be categorized into, and in this case, we have four possible categories, including “other” or “none of the above.” When there are four categories, the number that indicates the category is 0 to, in other words, 0, 1, 2, 3.

So in this case, we’re going to build a neural network whose output layer has four, or one output unit, so the output layer, the number of units in the layer, is equal to four, or equal to in general. We want the number of output layer cells to tell us what the probability of each of these 4 types is, so the first node here (the first square + circle that we output last) should output or we want it to output the probability of “other” types. In the case of input, this (the second square of last output + circle) outputs the probability of cat. In the case of input, this outputs the probability of the dog (the third square of the last output + the circle). In the case of input, the probability of output chick (4th square of last output + circle), I abbreviated chick to BC (baby chick). So this is going to be a dimensional vector, because it has to output four numbers, give you these four probabilities, because they should add up to 1, and the four numbers in the output should add up to 1.

The standard model for your network to do this is to use the Softmax layer, and the output layer to generate the output, so let me write that down, and then I’ll go back and get a little bit of a feel for what Softmax does.

In the last layer of the neural network, you will be calculated as usual linear part of the layers, this is the last layer of the variables, remember that this is the upper layer, as usual, the calculation method is to calculate the, you need to apply Softmax activation function, the activation function for the Softmax layer is somewhat different, it is like this. First of all, we have to calculate a temporary variable, we call it “t, it is, it is suitable for each element, and here, in our case, is 4 x 1, four dimensional vector, this is for all elements exponentiation, is also a 4 x 1 dimensional vector, then output, is basically a vector, but will be normalized, and 1. So, in other words, it’s also a 4 by 1 vector, and the first element of this 4 vector, let me write it down, just in case it’s not very clear what we’re doing here, we’ll do an example of that in a second.

Detailed explanation, let’s look at an example, suppose you worked out the, is a four dimensional vector, for that we need to do is to use exponentiation method to calculate the element, so, if you click on the calculator will get the following values, we only need it from vector vector normalization, will these projects to sum to 1. If you add up the elements, if you add up the four numbers, you get 176.3.

For example, the first node here, it will print, so in this case, for this picture, if this is the value you get (), the probability of it being class 0 is 84.2%. The next node outputs a 4.2% chance. The next one is. And then the last one, which is 11.4% of the time, is category 3, which is the chicken group, right? That’s the probability that it’s in class 0, class 1, class 2, class 3.

The output of the neural network, that is, is a 4×1 vector, and the elements of that 4×1 vector are the four numbers () that we figured out, so the algorithm uses the vector to calculate the four probabilities that add up to 1.

If we summarize the calculation steps from to, the entire calculation process, from calculating powers to generating temporary variables to normalization, we can summarize this as a Softmax activation function. Suppose that the activation function is different in that it takes input of a 4×1 vector and output of a 4×1 vector. Previously, our activation functions took a single line of numeric input, such as Sigmoid and ReLu activation functions, which input one real number and output one real number. The Softmax activation function is special in that, because you need to normalize all possible outputs, you need to input a vector and then output a vector.

What else can the Softmax classifier stand for? Let me give you a couple of examples, you have two inputs, they go directly to the Softmax layer, it has three or four or more output nodes, output, I’m going to show you a neural network with no hidden layer, all it does is compute, and the output output, or Softmax activation function, This neural network with no hidden layers should give you an idea of what the Softmax functions can represent.

This case (the left picture), the original input only, and a classification of output Softmax layer to represent this type of decision boundary, please note that this is a few linear decision boundary, but it has also made it to the data assigned to the three categories, in this chart, what we do is to select this picture shown in the training set, The Softmax classifier is trained with three output labels of the data. The colors in the figure show the threshold of the output of the Softmax classifier. The input is colored based on the highest probability of the three outputs. So we can see that this is the general form of logistic regression, which has decision boundaries that are like linear, but there are more than two categories, and the categories are not just 0 and 1, but can be 0, 1 or 2.

This is (middle image) another example of decision boundaries that The Softmax classifier can represent, trained with data sets with three categories, one more here (right image). Right, but the intuition tells us that any decision boundary between the two categories is linear, that is why you see, such as yellow and red classification decision boundary between here is linear boundary, is also a linear boundary between purple and red, purple and yellow is linear decision between the border, but it can use these different linear function to divide the space into three categories.

Let’s take a look at some more classification examples, in this case (left figure), so the green classification and Softmax can still represent these types of linear decision boundaries between multiple classifications. Another example (middle photo) is a class, a final example (figure on the right) is that it shows the Softmax classifier in the absence of hidden layers can do things, deeper neural network will have, of course, and then some hidden units, and more hidden units, etc., you can learn more complex nonlinear decision boundary, To distinguish between different categories.

I hope you understand the Softmax layer in neural networks or what the Softmax activation function does. In the next video, we’ll look at how you can train a neural network that uses the Softmax layer.

3.9 Training a Softmax Classifier

In the previous video we learned about the Softmax layer and the Softmax activation function. In this video you will learn more about the Softmax classification and learn how to train a model that uses the Softmax layer.

Recall our previous example, the output layer computed as follows, we had four categories, which could be 4×1 vectors, we computed temporary variables, exponentiated elements, and finally, if your output layer activation function was Softmax activation, the output would look like this:

The simple way to do it is to normalize it with temporary variables, so that the sum is 1, so this becomes, you notice that the largest element in the vector is 5, and the largest probability is the first probability.

The name Softmax comes from the contrast with the so-called hardmax. Hardmax turns the vector into this vector, and the hardmax function looks at the element and places 1 on the largest element and 0 on the other elements, so this is a hard Max. All the other outputs are 0. Softmax, on the other hand, does a more modest mapping from these probabilities, and I don’t know if that’s a good name, but at least that’s the idea behind the name Softmax, as opposed to Hardmax.

One thing I didn’t go into detail about, but I mentioned before, is that Softmax regression or Softmax activation functions generalize the Logistic activation function to classes, not just two classes, and the result is that if, then Softmax actually goes back to logistic regression, which I won’t prove in this video, But the general idea is that if, and you apply Softmax, the output layer will output two numbers, if so, maybe 0.842 and 0.158, right? These two numbers have to add up to 1, because they have to add up to 1, and they’re actually redundant, so maybe you don’t need to count two, you only need to count one of them, and the result is that you end up counting that number in a way that goes back to logistic regression counting a single output. This is not much of a proof, but we can conclude from it that Softmax regression generalizes logistic regression to more than two categories.

Next we will look at how to train a neural network with Softmax output layer. Specifically, we will first define the loss function that will be used to train the neural network. For example, let’s take a look at some concentration in the sample of the target output, real tag is, to use a video about the example, this means that this is a picture of a cat, because it belongs to the class 1, now we assume that your neural network output, is a including probability vector sum to 1, and you can see the sum to 1, that is,. For the poor performance of the neural network in this sample, this is actually a cat, but it’s only assigned a 20% probability that it’s a cat, so it’s poor performance in this case.

So what loss function do you want to train this neural network with? The typical loss function we use in the Softmax classification is let’s look at the individual samples above to better understand the process. Notice in this sample, because these are all 0’s, only, if you look at the sum, all the terms that have a value of 0 are going to be 0’s, and then you’re left with only, because when you add them all up by subscript, all the terms are going to be 0’s, except for phi, and because phi, so it’s going to be equal to phi.

Learning algorithm of this means that if you try to put it smaller, because the gradient descent method is used to reduce the loss of the training set, the only way to make it smaller is made smaller, if you want to do this, you need to make as large as possible, because these are probability, so impossible is greater than 1, but it does make sense, because in this case is the picture of the cat, You want the probability of this output to be as high as possible (second element).

In a nutshell, what the loss function does is it finds the real category in your training set, and then it tries to make the corresponding probability of that category as high as possible, and if you’re familiar with maximum likelihood estimation in statistics, this is actually a form of maximum likelihood estimation. But if you don’t know what that means, don’t worry, just use the algorithmic thinking we just talked about.

That’s the loss of a single training sample, but what about the loss of the entire training set? The cost of setting parameters and things like that, and the cost of various forms of deviation, which is defined as you can probably guess, is the sum of the loss of the entire training set, the sum of all the predictions that your training algorithm makes for all the training samples,

So what you want to do is use gradient descent to minimize the loss here.

And then there’s one last implementation detail, and notice that because alpha is a 4 by 1 vector, it’s also a 4 by 1 vector, if you vectorize, the matrix capital is alpha, for example if this sample up here is your first training sample, then the matrix, then this matrix is ultimately a dimensional matrix. Similarly,, this is actually the output of (), or the first training sample, so, is itself a dimensional matrix.

And then finally, let’s see how we can do gradient descent with Softmax output layer, and that output layer will calculate, it’s dimensional, in this case 4×1, and then you can use Softmax activation function to get or, and then you can calculate the loss from that. We’ve talked about how to implement neural network forward propagation steps to get these outputs and calculate the losses, but what about back propagation steps or gradient descent? Is the key step in the initialization required for back propagation or key equation is the expression, you can use the 4 * 1 vector minus the 4 * 1 vector, you can see all 4 * 1 vector, when you have four categories, in general, it is in our general definition, it is the loss function of the partial derivative (), If you’re good at calculus, you can do it yourself, or if you’re good at calculus, you can try to do it yourself, but it’s just as useful if you need to start from scratch.

And with that, you can compute, and then you can start the process of back propagation, compute all the derivatives that you need in the whole neural network.

But in this week’s primary practice, we will begin to use a kind of deep learning programming framework, for these programming framework, usually you only need to focus on spread before to do the right, as long as you indicate it as programming framework, prior to transmission, how he would understand it back propagation, will help you to realize back propagation, so the expression is worth bearing in mind (), If you need to start from scratch, implement Softmax regression or Softmax classification, but you won’t need it in this week’s introductory exercise because the programming framework will take care of the derivative calculations for you.

So much for Softmax classification, which allows you to use learning algorithms to divide your input into not just two categories, but different ones. What I want to do next is show you some deep learning programming frameworks that can make you more efficient at implementing deep learning algorithms, and let’s talk about that together in the next video.

3.10 Deep Learning Frameworks

You’ve learned to implement deep learning algorithms using Python and NumPy pretty much from scratch, and I’m glad you did, because I want you to understand what these deep learning algorithms actually do. But what you find is that unless you apply more complex models, such as convolutional neural networks, or cyclic neural networks, or when you start applying very large models, it becomes less and less practical, at least for most people, to start from scratch and do it all yourself.

Fortunately, there are many good deep learning software frameworks that can help you implement these models. Analogy, I guess you know how to do matrix multiplication, you should also know how to program to realize two matrix multiplication, but when you are under a lot of application, you probably don’t want to use their own matrix multiplication function, but you want to access a numerical linear algebra library, it will be more efficient, but if you see two matrix multiplication is going on is quite useful. I think deep learning is mature now, and there are a few frameworks that would be more practical and more effective, so let’s take a look at some of them.

There are a number of deep learning frameworks that make it easier to implement neural networks, but let’s talk about the main ones. Each framework for a particular user or group of development, I think here every frame is a reliable choice of a kind of application, there are a lot of people write articles to compare these deep learning framework, and how well these deep learning framework for development, and more often than not, these frameworks evolve, every month in advance, if you want to see the discussion about the advantages and disadvantages of the framework, I’ll leave it to you to do your own research on the Internet, but I think a lot of frameworks are getting better and better very quickly, so I’m not going to give you a strong recommendation, but I’m going to share with you the criteria for choosing a framework.

One of the key criteria is ease of programming, both in the development and iteration of neural networks, but also in configuring the product for actual use by tens of millions or even hundreds of millions of users, depending on what you want to do.

The second important criterion is speed, especially when training large data sets, and there are frameworks that allow you to run and train neural networks more efficiently.

Another criterion that people don’t talk about very often, but I think is important, is whether the framework is really open, and if it’s really open, it not only needs to be open source, but it needs to be well managed. Unfortunately, in the software industry, some companies have a history of open source software, but companies keep full control of the software, and as years go by and people start using their software, some companies gradually shut down resources that were once open, or move functions to their exclusive cloud services. So one of the things I would look for is whether you can trust the framework to remain open source for a long time, as opposed to being under the control of a company that might, for some reason, choose not to open source in the future, even though the software is now being released as open source. But at least in the short term, depending on your language preference, whether you prefer Python or Java or C++ or whatever, and depending on the application you’re developing, whether it’s computer vision or natural language processing or online advertising, I think all of these frameworks are good choices.

So much for frameworks. By providing a higher level of abstraction than the numerical linear algebra library, each of these frameworks can make you more efficient when developing deep machine learning applications.

3.11 TensorFlow

Welcome to the last video of the week, there are a lot of great deep learning programming frameworks out there, one of them is TensorFlow, and I’m really looking forward to helping you get started with Using TensorFlow, and WHAT I want to do in this video is show you the basic structure of TensorFlow, and then let you practice and learn more details on your own, And apply it to this week’s programming exercises, which will take some time to do, so be sure to set aside some free time.

A heuristic questions first, suppose you have a need to minimize the loss function, in this case, I will use this highly simplified loss function, and this is the loss function, you may have noticed that this function is actually, if you put the quadratic formula were given on the above expression, so it is the smallest value in 5, But assuming we don’t know that, and you just have this function, let’s see how we can minimize that with TensorFlow, because a very similar program structure can be used to train neural networks. There can be some complex loss function that depends on all the parameters of your neural network, and then similarly, you can use TensorFlow to automatically find the sum that minimizes the loss function. But let’s start with a simpler example on the left.

I’m running Python in my Jupyter Notebook,

import numpy as np
import tensorflow as tf
# import TensorFlow


w = tf.Variable(0,dtype = tf.float32)
# Next, let's define the parameter w. In TensorFlow, you use tf.variable () to define the parameter


# Then we define the loss function:


cost = tf.add(tf.add(w**2,tf.multiply(- 10.,w)),25)
And then we define the loss function JThen we write:train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
#(Let's use 0.01 learning rate, the goal is to minimize loss).


# The following lines are idiomatic expressions:


init = tf.global_variables_initializer()
session = tf.Session()This opens a TensorFlow session.
session.run(init)# to initialize the global variable.


# Then let TensorFlow evaluate a variable that we will use:


session.run(w)
# above this line will w initialized to 0, and define the loss function, we define "train" as the learning algorithm, it USES the gradient descent method optimizer to minimize loss function, but in fact we haven't run learning algorithm, so this line # above will w initialized to 0, and define the loss function, we define "train" as the learning algorithm, It minimizes the loss function with the gradient descent optimizer, but we haven't actually run the learning algorithm yet, so session.run(w) evaluates W and lets me...


print(session.run(w))
Copy the code

So if we run this, it evaluates to zero, because we haven’t run anything yet.

Now let's enter: $session.run(train). All it does is run a gradient descent. Print: print(session.run(w)) print: print(session.run(w))Copy the code

Now we run gradient descent 1000 iterations:

So this is 1000 iterations of gradient descent, and it’s 4.99999, remember we said minimize, so the optimal value is 5, which is pretty close.

So hopefully this gives you a sense of the general structure of TensorFlow, and some of the functions THAT I’m using here will be familiar to you as you do your programming exercises and use more TensorFlow code, but one thing to notice here is that the parameters that we want to optimize, so we call them variables, Note that all we need to do is define a loss function, using functions like add and multiply. TensorFlow knows how to take derivatives of Add and Mutiply, as well as other functions, which is why you only need to basically implement forward propagation and it can figure out how to do back propagation and gradient calculations because it’s already built into Add, multiply, and square functions.

And by the way, if that’s not a pretty way to write it, TensorFlow actually overrides general addition and subtraction and so on, so you can write it in a more elegant form, comment out the cost, run it again, and you get the same result.

Once called TensorFlow variables, squaring, multiplying, and adding and subtracting are overloaded, so you don’t have to use the ugly syntax above.

One other feature of TensorFlow that I want to tell you about is that this example minimizes a fixed function. What if the function you want to minimize is the training set function? Whatever training data you have, the training data will change as you train the neural network, so how do you put the training data into TensorFlow?

I’m going to define, think of it as playing the role of training data, and actually training data has and, but in this case, only, and define it as:

X = tf.placeholder(tf.float32,[3,1]), make it an array, and what I want to do is, because this quadratic equation has fixed coefficients in front of three terms, it’s, we can make the numbers 1, -10, and 25 into data, and what I want to do is replace it with:

Cost = x[0][0]* W **2 +x[1][0]* W +x[2][0], now becomes the data that controls the coefficient of this quadratic function, the placeholder function tells TensorFlow, you’ll supply the value later.

Coefficient = Np.array ([[1.],[-10.],[25.]])). This is the data we want to access. Finally, we need to somehow plug this array of coefficients into the variable, and the syntax to do this is, in the training step, to supply the value, I’ll set it here:

feed_dict = {x:coefficients}

Okay, hopefully there are no syntax errors, so let’s rerun it and hopefully get the same result as before.

Now if you want to change the coefficient of this quadratic function, suppose you put:

coefficient = np.array([[1.],[-10.],[25.]])

Coefficient = np.array([[1.],[-20.],[100.]])

Now this function becomes, if I rerun it, hopefully I get something that minimizes to 10, let’s see, good, after 1000 iterations of gradient descent, we get something close to 10.

The placeholder in TensorFlow is a variable that you’ll assign values to later, and that’s a way of adding training data to the loss equation, and adding data to the loss equation using this syntax, when you run training iteration, Use feed_dict to make x=coefficients. If you’re doing mini-batch gradient descent and you need to insert different mini-batches in each iteration, then each iteration you use feed_dict to feed different subsets of the training set to different mini-batches where the loss function needs data.

Hopefully this gives you an idea of what TensorFlow can do, what makes it so powerful is that you just show how to compute the loss function, it can take its derivatives, and with one or two lines of code you can use the gradient optimizer, Adam optimizer, or whatever.

This is the code that I just wrote, but I’ve cleaned it up a little bit, and although these functions or variables may seem a little mysterious, they’ll become familiar if you practice them a few times as you do your programming exercises.

One last point I’d like to mention is that these three lines (in blue curly braces) are conventionalized in TensorFlow, and some programmers use this form as an alternative, essentially doing the same thing.

But the with structure is also used in many TensorFlow programs. It basically means the same thing as the one on the left, but Python’s with command is easier to clean up in case there are errors or exceptions when executing the inner loop. So you’ll also see this in programming exercises. So what does this code actually do? Let’s look at this equation:

cost =x[0][0]*w**2 +x[1][0]*w + x[2][0]#(w-5)**2

The heart of TensorFlow is to calculate the loss function, and then TensorFlow automatically calculates the derivative, and how to minimize the loss, so what this equation or this line of code does is ask TensorFlow to build a graph, and what the graph does is take, take, square it, multiply it, And you get, and so on and so forth, and eventually the whole thing is set up, and you get the loss function.

The advantage of TensorFlow is that by using this loss calculation, the computational graph basically propagates forward. TensorFlow already has all the necessary reverse functions built in. Recall the set of forward functions and the set of reverse functions when training deep neural networks. And programming frameworks like TensorFlow already have the necessary backpropagation built in, which is why if you use the built-in function to compute the forward function, it can automatically use the backpropagation function to compute the derivative for you, even if the function is very complicated, which is why you don’t have to explicitly implement the backpropagation, This is one of the reasons programming frameworks can help you be productive.

If you look at the instructions for TensorFlow, I’m just pointing out that the instructions for TensorFlow use a different notation than I do to draw a graph, it uses theta, and then instead of writing out values, like here, the instructions for TensorFlow tend to just write operators, so this is squaring, And these two together point to multiplication, and so on, and then at the last node, I guess it’s an addition that’s going to be added to get the final value.

For the sake of this course, I think it’s easier to understand the diagrams in the first way, but if you look at the instructions for TensorFlow, if you look at the diagrams in the instructions, you’ll see a different representation, where the nodes are labeled with operations instead of values, but the two representations represent the same diagrams.

You can do a lot of things with one line of code in a programming framework, for example, if you don’t want to use gradient descent, you want to use the Adam optimizer, you just change that line of code, and you can quickly replace it with a better optimization algorithm. All modern deep learning programming frameworks support such capabilities, making it easy to write complex neural networks.

I hope I help you understand the structure of a typical TensorFlow program, summarize the content of this week, you learned how to systematically organize super parameter searching process, we also talked about the Batch normalization, and how to use it to accelerate the training of neural network, finally we talked about the deep learning programming framework, there are a lot of great programming framework, In this last video we focused on TensorFlow. With it, I hope you enjoyed this week’s programming exercises and helped you become more familiar with these concepts.

The resources

[1] Deep Learning Courses:Mooc.study.163.com/university/…[2] Huang Hai-Guang:github.com/fengdu78[3]github: Github.com/fengdu78/de…