When faced with a problem in real life, we always want to find the best solution. The same is true for software products. The best programs are the best products.

Optimization means getting the best output. It is not only an important branch of mathematics, but also plays an important role in real life. Optimization is considered as an important field in modern computer science and artificial intelligence. We also believe that some algorithms of ARTIFICIAL intelligence simulate the process of human beings seeking optimal solutions to practical problems. For example, the use of artificial intelligence algorithm design software, with external electronic devices such as cameras to recognize faces; Using data mining and neural network algorithm to find the best time to invest and so on, are using the principle of optimization.

Optimization in machine learning is slightly different from its applications in other disciplines. In general, while optimizing, we know exactly what the data looks like and what we want to improve. But in machine learning, we don’t know what “new data” looks like, let alone optimize it. To solve this problem, in machine learning, we perform optimizations on training data and examine validation data created from them.

Optimization is widely used

  • Mechanics: Design surfaces for aerospace products;
  • Economics: cost minimization;
  • Physics: Optimization time in quantum computing;
  • Determine the best shipping routes, optimize shelf space, etc.

Many popular machine algorithms rely on linear regression, k-nearest neighbor, neural networks and other techniques. The application of optimization is infinite, so it has become the subject of extensive research in academia and industry. In this article, we will introduce an optimization technique called Gradient Descent. This is the most commonly used optimization technique for machine learning.


1. What is gradient descent?

Let’s look at a classic mountaineering example: Suppose you’re at the top of a mountain and you have to reach a lake at the lowest point of a valley. But you are blindfolded and cannot see your goal. So, what method will you take to get to the lake?

The easiest way to do this is to check the ground near you and find out where the ground is sloping downward. This is where you should take the first step. Follow the downhill route and there is a good chance you will reach the lake. The picture below shows the path you traveled:

Now let’s describe this scenario in mathematical terms.

Suppose we want to find the best parameters of the learning algorithm (θ0) and (θ1). Similar to the mountaineering example above, we found similar mountains and valleys when we plotted a 3D image of the J (θ) function of the cost space. When z-axis represents cost J (θ), X-axis and Z-axis correspond to parameters θ0 and θ1 respectively, hilly areas in red represent high cost, while valley areas in blue represent low cost. The cost space is simply the performance of the algorithm when a particular parameter is selected.

There are two main types of gradient descent algorithms:

1.1 Data intake baseline method

  1. Full Batch Gradient descent algorithm
  2. Stochastic gradient descent algorithm

The full batch gradient descent algorithm uses all the data at once to calculate the gradient, while the stochastic gradient descent algorithm can take samples while calculating the gradient.

1.2 Differentiation skills benchmark method

  1. First order differential
  2. Second order differential

Gradient descent requires a differential equation of the cost function J (θ) to calculate the gradient. We can use first derivative or second derivative.



2. Challenges in implementing the gradient descent method

Gradient descent is a technique that works for most situations. But there are times when gradient descent doesn’t work, or doesn’t work at all. There are three main reasons why this happens:

2.1 Challenges from data

  • If the arrangement of the data causes a non-convex optimization problem, it can be very difficult to perform the optimization using gradient descent.
  • Even when optimizing convex optimization problems, there may be many minima. The lowest point is called the Global Minima, and the remaining points are called Local Minima. Our goal is to reach the global minimum while avoiding the local minimum.
  • Then there is the Saddle Point problem. This is the point in the data where the gradient is zero but it’s not optimal. There’s no specific way to avoid this, and it’s still a very active area of research.

2.2 Challenges from gradients

  • If gradient descent is not executed correctly, it could lead to problems like the vanishing gradient or the exploding gradient. These problems occur when the gradient is too small or too large, causing the algorithm to not Converge.

2.3 Challenges from practical application difficulty

  • Most neural network practitioners pay little attention to practical applications. But things like network resource utilization are also very important. Exactly how many resources are required is very important when implementing gradient descent. If the memory is too small for the application, it will surely fail.
  • In addition, gradient descent algorithms are very demanding on floating point and hardware/software.



3. Variations of gradient descent algorithm

The most common gradient descent algorithm and its implementation.

3.1 Vanilla gradient descent

This is the simplest form of gradient descent. Vanilla means pure/without any adulteration. Its main characteristic is to take small steps towards the minimum value by adopting the gradient of cost function. Its pseudocode is as follows:

update = learning_rate * gradient_of_parameters
parameters = parameters - updateCopy the code

We constantly update the parameters by taking the gradient of the old parameters. Multiply this by a learning rate (learning_rate, a constant) to indicate the speed at which we want to reach our lowest point. Learning rate is a hyper-parameter, and you should be careful when choosing its size.


3.2 Gradient Descent with Momentum

By tweaking the Vanilla algorithm, it was possible to pay attention to the previous step before each next step.

Update = learning_rate * gradient velocity = previous_update * momentum parameter = parameter + velocity -- updateCopy the code

In this case, update is the same as vanilla gradient descent. But a new term called Velocity is introduced, which takes into account the previous update and a constant called momentum.

3.3 Adagrad

Adagrad uses adaptive technology to update the learning rate. The algorithm changes the learning rate based on the gradient changes of all previous iterations. The pseudocode is as follows:

grad_component = previous_grad_component + (gradient * gradient)
rate_change = square_root(grad_component)+epsilon
adapted_learning_rate = learning_rate * rate_change
update = adapted_learning_rate * gradient
parameter = parameter – update
Copy the code

Where epsilon is a constant used to keep the rate of change of the learning rate.

3.4 Adam

Adam is an adagrad-based adaptive technology that further mitigates its shortcomings. In other words, Momentum + Adagrad. The pseudocode is as follows:

Adapted_gradient = previous_gradient + ((gradient -- previous_gradient) * (1 -- beta1)) gradient_component = (gradient_change -- previous_learning_rate) adapted_learning_rate = previous_learning_rate + (gradient_component * (1 -- Beta2)) update = adapted_learning_rate * adapted_gradient parameter = parameter -- updateCopy the code

Where, beta1 and beta2 are constants used to check the gradient and the change of learning rate.



4. Practical application of gradient descent algorithm

Basic applications of gradient descent using Python.

Next we use gradient descent optimization to find the best parameters of the deep learning model and apply them to the image recognition problem. Our problem is to identify numbers from a given 28×28 image. A portion of the images were prepared for training and the rest for testing the model.

Here is the main code defining Vanilla gradient descent:

Weights_hidden, weights_output, bias_hidden, bias_output] def SGD (cost, params, lr=0.05): grads = T.grad(cost=cost, wrt=params) updates = []for p, g in zip(params, grads):
 updates.append([p, p - g * lr])
 return updates
updates = sgd(cost, params)
Copy the code

Let’s break down this code. Function SGD is defined as the dependent variable of cost, params and LR. It is exactly the same as J (θ). And the previous θ0, θ1 in this case is the parameter and the learning rate of the deep learning algorithm. We set the default learning rate to 0.05, but this value can be changed at any time.

Def SGD (cost, params, lr = 0.05) :Copy the code


Then we define the gradient of the parameters of the cost function J (θ). Here, we use theano library to find the corresponding gradient, and we import theano as T

grads = T.grad(cost=cost, wrt=params)
Copy the code

Finally, update all possible parameters, using Vanilla gradient descent.

for p, g in zip(params, grads):
 updates.append([p, p - g * lr]
Copy the code

We can use this function to find the best parameters for the neural network. When using this function, the neural network does an excellent job, as shown below:

Prediction is: 8
Copy the code

In this application, the gradient descent method finds the optimal parameters for the deep learning algorithm.


5. Apply practical techniques of gradient descent

Each of the gradient descent algorithms mentioned above has its advantages and disadvantages. The following tips may help you choose the right algorithm.

  • For quick prototyping, use adaptive techniques such as Adam/Adagrad. They are efficient for a short time and do not require much hyperparametric tuning.
  • For best results, you should use Vanilla gradient descent or Momentum. These results are mostly more accurate than those obtained by adaptive techniques, even though they are extremely slow.
  • If the data is small and fits into an iteration, a second-order technique, such as L-BFGS, can be used. Because second-order technology is very fast and accurate for processing sufficiently small data.

There are many reasons why neural networks fail to learn successfully. But if you can figure out where the algorithm went wrong, it will be very helpful for future work.

The following are common considerations when using gradient descent:

  • Error rate – The training error rate, test error rate should be checked after a particular iteration and ensure that they tend to decrease. If the error rate does not decrease, there is a good chance that something is wrong with the algorithm.
  • Gradient flow in hidden layers – Check for gradient disappearance or gradient explosion problems.
  • Learning rate – should be checked when using adaptive techniques.

Hopefully, after reading this article, you will be familiar with the basics of gradient descent and its variants. I hope you found my reading of these algorithms in action helpful too!