A background

Those familiar with machine learning and deep learning have probably heard the saying, machine learning = model + strategy + algorithm. In fact, I have based on this concept at the beginning of my study, but it took some time to really understand what these three things are.

  • Model: A function, expression, or network structure that you want to learn.
  • Strategy: The essence is how to reduce the Gap between the inference value and the actual value of the exploration model, including training data, test data and even unknown data.
  • Algorithm: The essence is to reduce the Gap between the inference value introduced above and the actual value through optimization algorithm, usually through iterative asymptotic algorithm.

Gradient descent is the preferred method for optimizing neural networks and many other machine learning algorithms, but the process is used as a black box and the underlying mechanism is not very clear. This article explores some of the most popular gradient-based optimization algorithms, such as Momentum, Adagrad, and Adam.

Gradient descent is one of the most popular optimization algorithms and the most commonly used method to optimize neural networks. Meanwhile, each state-of-the-art deep learning library contains implementations of various algorithms to optimize gradient descent (e.g. TensorFlow, Caffe, MxNet, etc.). However, these algorithms are often used as black-box optimizers because it is difficult to find practical explanations for their strengths and weaknesses. Let’s start by looking at the different variations of gradient descent. Then we will briefly summarize the challenges in training. We then introduce the most common optimization algorithms, show what motivates them to solve these challenges, and how this leads to their update rules.

Two gradient descent and its variants

There are three variants of gradient descent, the difference is how much data we use to calculate the gradient of the objective function. Depending on the amount of data, we trade off the accuracy of parameter updates against the time required to perform the updates.

Batch gradient Descent

For the whole data set, all the data are loaded, the gradient value of the whole function is calculated, and the parameters are updated. Because all the data need to be loaded, the memory requirement of the running state is high, and it is not easy to achieve in the case of a large amount of data. Since we need to calculate the gradient of the whole data set to perform an update, batch gradient descent may be very slow, and it is difficult for the data set not suitable for memory. Meanwhile, it is not suitable for the update of Online services. Online Learning cannot be used to update and correct the model in real time.

Batch gradient descent pseudo code is as follows, you can take a look:

for i in range(nb_epochs):
  params_grad = evaluate_gradient(loss_function, data, params)
  params = params - learning_rate * params_grad
Copy the code

Stochastic gradient descent

Batch gradient descent performs the same operation for the entire sample, calculating gradients for all samples before each parameter is updated. SGD performs a parameter update through a single execution of a sample. Therefore, it is usually faster and can also be used for online learning.

SGD updates frequently with high variances, causing the objective function to fluctuate wildly as shown in the figure below.

When the batch gradient descent converges to the local minimum of the parameter, the fluctuation of SGD on the one hand enables it to jump to a new and possibly better local minimum. On the other hand, this ultimately complicates convergence to the exact minimum, so SGD needs to be adjusted further. When we slowly reduce the learning rate, SGD shows the same convergence behavior as batch gradient descent, and for non-convex and convex optimizations, convergence to local or global minima is almost certain, respectively. The pseudo-code of stochastic gradient descent method is as follows:

for i in range(nb_epochs):
  np.random.shuffle(data)
  for example in data:
    params_grad = evaluate_gradient(loss_function, example, params)
    params = params - learning_rate * params_grad
Copy the code

Mini-batch gradient Descent (Mini-Batch gradient descent)

Small batch gradient descent is a compromise between batch gradient descent method and stochastic gradient descent method. Gradient calculation is carried out for each small batch data and parameters are updated.

Chinese people pay attention to the “mean”, small batch gradient update is also a kind of mean, so what characteristics does it have?

  1. Firstly, the variance of parameter update is reduced and the convergence is more stable because the method of small batch is used instead of random single.
  2. Second, highly optimized matrix operations and advanced deep learning libraries can be utilized to make computing gradients very efficient.

The Mini Batch size is 128,25,512 and depends on the distributed cluster size. Small batch gradient descent is a typical selection algorithm for neural network training.

Challenge of optimization algorithm

However, small batch gradient descent does not guarantee good convergence, and it also presents some challenges that need to be solved. There is a long way to go, and we will continue to work on it.

  • Choosing an appropriate learning rate can be difficult. Too small learning rate will lead to slow convergence, while too large learning rate will hinder convergence and cause the loss function to fluctuate or even diverge around the minimum value.
  • Try to adjust the learning rate during training by, for example, annealing algorithms, which reduce the learning rate according to predefined rules or when the target changes below a threshold. However, these rules and thresholds must be defined in advance and therefore cannot be adapted to the characteristics of the dataset.
  • In addition, the learning rate of all parameter updates is the same. If our data is sparse and our features have very different frequencies, we may not want to update everything to the same extent, but rather perform larger updates for features that are rarely present.
  • Another key challenge in minimizing the highly nonconvex error functions common in neural networks is to avoid falling into their numerous suboptimal local minima. The difficulty actually comes not from local minima, but from saddle points, where one dimension is tilted up and the other is tilted down. These saddle points are usually surrounded by a platform with the same error, which makes it difficult for SGD to escape because the gradient is close to zero in all dimensions.

Iv Reference Materials

  • An overview of gradient descent optimization algorithms∗ arxiv.org/pdf/1609.04…

V. Self-introduction

Personal Introduction: Du Baokun, jingdong federal study from 0 to 1 builders, lead the team to build the jingdong federal learning solutions, has realized the electricity marketing support the industrialization of the vlsi federal learning solutions, support very large scale sample PSI privacy alignment, the safety of tree model and neural network model and so on model support, and implement the landing of the advertising business side, It has created a new business growth point and produced remarkable business economic benefits. Personally, I like to study technology. Based on the consideration of full-link thinking and decision technology planning, there are many research fields, ranging from architecture, data to algorithm and algorithm framework. Welcome students who like technology to communicate with me, email: [email protected]

To continue, please look forward to!