Some people say that deep learning is essentially optimization, but how is it different?

Compiled by: McGL

The most common way to train neural networks today is to use variations such as gradient descent or Adam. Gradient descent is an iterative optimization algorithm for finding the minimum of a function. In simple terms, in optimization problems, we are interested in some metric P and want to find a function (or parameter of the function) that maximizes (or minimizes) that metric on some data (or distribution) D. This sounds like machine learning or deep learning. We have metrics like accuracy, even better accuracy/recall or F1 values; There is a model with learnable parameters (our network); There is also data (training and test sets). Using gradient descent, we will “search” or “optimize” the parameters of the model, thus ultimately maximizing the data metric (accuracy) on the training and test sets.

The General Inefficiency of Batch Training for Gradient Descent Learning

There are at least two major differences between optimization and deep learning that are important for getting better results in deep learning.

The first difference is the measure/loss function. In optimization, we have a single well-defined metric that we want to minimize (or maximize). Unfortunately, in deep learning, we often use metrics that are impossible or difficult to optimize. For example, in a classification problem, we might be interested in the “accuracy” or “F1 value” of the model. “The problem with accuracy and F1 values is that these are not differentiable functions, we can’t compute gradients, so we can’t use gradient descent. Therefore, we use proxy measures/losses like negative log-likelihood (or cross-entropy), Hopefully minimizing proxy functions will maximize our original measures. These proxy measures are not necessarily bad and may even have some advantages, but we need to remember what we really care about, not proxy measures.”

One way to make sure we care about the raw metrics is to use Early Stopping. With each epoch, we evaluated the model with raw metrics (accuracy or F1 values) on some validation set and stopped training when overfitting began. It is better to print the accuracy (or any other measure) in each epoch to better understand the performance of the model.

The second important difference is data. In optimization, we only care about existing data. We know that finding the maximum is the best solution to our problem. In deep learning, we are primarily concerned with generalization, data that we don’t have. This means that even if we find the maximum (or minimum) of the data we already have (the training set), we may still get poor results on data we don’t yet have. It’s important to divide our data into different pieces and think of the test set as “data we don’t have.” We can’t make any decisions based on the test set. To make decisions about hyperparameters, model structure, or early stopping criteria, we can use validation sets rather than test sets.

I’m not done yet. We train the model by gradient descent to push the parameters in the “right” direction. But what is “right”? Is it true for all data or just for our training set? For example, this is relevant when we select Batch size. One could argue that by using the entire training data (called “batch gradient descent”), we obtain a “real” gradient. But that only applies to the data we have. To push the model in the “right” direction, we need gradients that approximate data we don’t have. This can be done with a smaller batch size (so-called mini-batch or Stochastic gradient descent). According to The general Inefficiency of Batch Training for Gradient Descent Learning, Only batch size 1 (also known as on-line Training) is used to achieve the best results. By applying smaller batch size and introducing noise into gradient, generalization ability can be improved and overfitting can be reduced. The following table shows the performance of “Batch” and “on-line” training on more than 20 datasets. You can see that online is better on average.

The General Inefficiency of Batch Training for Gradient Descent Learning

Although machine learning problems are sometimes called optimization problems. It’s important to understand the differences and work through them.

“Source:” towardsdatascience.com/what-is-the…