This is the 17th day of my participation in the Novembermore Challenge.The final text challenge in 2021

title: date: 2020-12-16 16:47:46 tags: [Machine_Learning] categories: [Machine_Learning] mathjax: true

This paper introduces the most rapid descent method, a common iterative method for optimizing quadratic form.

Problem description

To recap the problem we need to optimize:


f ( x ) = 1 2 x T A x b T x + c \label 1 (1) f({\bf{x} }) = \frac{1}{2}{\bf{x^TAx} } – { {\bf{b} }^{\bf{T} } }{\bf{x} } + {\bf{c} } \tag{1} \label{1}
  • The matrix A\bf{A}A is positively definite symmetric
  • The objective is to optimize x\bf{x}x so that f(x)f(\bf{x})f(x) is minimized

Fastest descent method

One day, when the gods were tired of solving the optimization problem by algebraic method, they proposed the optimization method of iterative calculation and gradually approaching the optimal solution. The idea of the method is simple and intuitive.

core idea

  • Our goal is to find the minimum value of the function, and I already know that the function has only one minimum point, which is equivalent to finding the lowest point of a surface, the flow chart of the fastest descent method:
Graph LR A -->B --> C --> D --> D --> C --> D --> E

Calculation process

Initial position:
x 0 \bf{x}_0
Computing the gradient
g \bf{g}
  • The previous calculation conclusions are followed:

g = f ( x ) = A x b (2) \bf{g} = f'(\bf{x}) = A\bf{x}-\bf{b} \tag{2}
  • So for x0\bf{x}_0x0, in order to find the minimum value, it is necessary to advance in the opposite direction of the gradient, but in order to ensure a smooth derivation process, the forward distance coefficient is set as α I {\alpha}_iα I, then for the third xi\bf{x}_ixi, there is:

x i + 1 = x i + Alpha. i g i \label 3 (3) {\bf{x}_{i + 1} } = {\bf{x}_i} + \alpha_i {\bf{g}_i} \tag{3} \label{3}
The coefficient of
Alpha. i {\alpha}_i
To solve the
Method 1:

The gradient direction of the two positions before and after the action should be orthogonal, otherwise the position of the current action can be taken to a lower value:

That is:

{% raw %}


g i T g i + 1 = 0 (4) {\bf{g}}_i^T {{\bf{g}}_{i + 1}}=0 \tag{4}

Among them:


g i = A x i b (5) { {\bf{g} }_i}{\bf{ = A} }{ {\bf{x} }_i}{\bf{ – b} } \tag{5}

x i + 1 = x i + Alpha. i g i = x i + Alpha. i ( A x i b ) (6) { {\bf{x} }_{i + 1} } = { {\bf{x} }_i} + {\alpha _i}{ {\bf{g} }_i} = {\bf{x}_i} + {\alpha _i}({\bf{A} }{ {\bf{x} }_i} – {\bf{b} }) \tag{6}

g i + 1 = A x i + 1 b = A ( x i + Alpha. i g i ) b (7) \begin{array}{l} { {\bf{g} }_{i + 1} } &= {\bf{A} }{ {\bf{x} }_{i + 1} } – {\bf{b} }\\ &= {\bf{A} }({ {\bf{x} }_i} + {\alpha _i}{ {\bf{g} }_i}) – {\bf{b} } \tag{7} \end{array}

So are:


g i T g i + 1 = g i T A ( x i + Alpha. i g i ) g i T b = 0 {\bf{g} }_i^T{ {\bf{g} }_{i + 1} } = {\bf{g} }_i^T{\bf{A} }({ {\bf{x} }_i} + {\alpha _i}{ {\bf{g} }_i}) – {\bf{g} }_i^T{\bf{b} } = 0

g i T A x i + Alpha. i g i T A g i g i T b = 0 {\bf{g} }_i^T{\bf{A} }{ {\bf{x} }_i} + {\alpha _i}{\bf{g} }_i^T{\bf{A} }{ {\bf{g} }_i} – {\bf{g} }_i^T{\bf{b} } = 0

Alpha. i g i T A g i + g i T g i = 0 {\alpha _i}{\bf{g} }_i^T{\bf{A} }{ {\bf{g} }_i} + {\bf{g} }_i^T{ {\bf{g} }_i} = 0

Alpha. i = g i T g i g i T A g i (8) {\alpha _i} = – \frac{ { {\bf{g} }_i^T{ {\bf{g} }_i} } }{ { {\bf{g} }_i^T{\bf{A} }{ {\bf{g} }_i} } } \tag{8}

{% endraw %}

Method 2

After determining the gradient direction g\bf{g}g, solve for α\alphaα. Since the fastest descent method determines the locally optimal solution in the direction of each step, that is, the position where the minimum value is obtained by moving along this direction, the derivative of α\alphaα can be taken so that the derivative is 0:

  • From \eqref1\eqref{1}\eqref1 \eqref3\eqref{3}\eqref3, there are:

f ( x ) = 1 2 ( x + Alpha. g ) T A ( x + Alpha. g ) b T ( x + Alpha. g ) + c (9) f({\bf{x} }) = \frac{1}{2}{({\bf{x} } + \alpha {\bf{g} })^T}{\bf{A} }({\bf{x} } + \alpha {\bf{g} }) – { {\bf{b} }^{\bf{T} } }({\bf{x} } + \alpha {\bf{g} }) + {\bf{c} } \tag{9}
  • For the determined current position xi\bf{x}_ixi, the current can be regarded as a function of xi\bf{x}_ixi, α I \alpha_iα I:

f ( x i + 1 ) = f ( x i . Alpha. i ) = 1 2 ( x i + Alpha. i g i ) T A ( x i + Alpha. i g i ) b T ( x i + Alpha. i g i ) + c (10) f(\bf{x}_{i+1}) =f(\bf{x}_i,{\alpha _i}) = \frac{1}{2}{({ {\bf{x} }_i} + {\alpha _i}{ {\bf{g} }_i})^T}{\bf{A} }({ {\bf{x} }_i} + {\alpha _i}{ {\bf{g} }_i}) – { {\bf{b} }^{\bf{T} } }({ {\bf{x} }_i} + {\alpha _i}{ {\bf{g} }_i}) + {\bf{c} } \tag{10}
  • Partial derivative of α I \alpha_iα I:

{% raw %}


partial f ( x i + 1 ) partial Alpha. i = partial f ( x i + 1 ) partial x i + 1 partial x i + 1 partial Alpha. i = ( A x i + 1 b ) T g i = ( A ( x i + Alpha. i g i ) b ) T g i = ( A x i b + Alpha. i A g i ) T g i = ( g i T + Alpha. i g i T A ) g i = g i T g i + Alpha. i g i T A g i = 0 (11) \begin{array}{l} \frac{ {\partial f({ {\bf{x} }_{i + 1} })} }{ {\partial {\alpha _i} } } &= \frac{ {\partial f({ {\bf{x} }_{i + 1} })} }{ {\partial { {\bf{x} }_{i + 1} } } }\frac{ {\partial { {\bf{x} }_{i + 1} } } }{ {\partial {\alpha _i} } }\\ &= {({\bf{A} }{ {\bf{x} }_{i + 1} } – {\bf{b} })^T}{ {\bf{g} }_i}\\ &= {({\bf{A} }({ {\bf{x} }_i} + {\alpha _i}{ {\bf{g} }_i}) – {\bf{b} })^T}{ {\bf{g} }_i}\\ &= {({\bf{A} }{ {\bf{x} }_i} – {\bf{b} } + {\alpha _i}{\bf{A} }{ {\bf{g} }_i})^T}{ {\bf{g} }_i}\\ &= ({\bf{g} }_i^T + {\alpha _i}{\bf{g} }_i^T{\bf{A} }){ {\bf{g} }_i}\\ &= {\bf{g} }_i^T{ {\bf{g} }_i} + {\alpha _i}{\bf{g} }_i^T{\bf{A} }{ {\bf{g} }_i} =0 \end{array} \tag{11}

{% endraw %}

  • Get:

Alpha. i = g i T g i g i T A g i (12) {\alpha _i} = – \frac{ { {\bf{g} }_i^T{ {\bf{g} }_i} } }{ { {\bf{g} }_i^T{\bf{A} }{ {\bf{g} }_i} } } \tag{12}
Next position
  • According to the formula \eqref3\eqref{3}\eqref3, obtain the next obtained position xi+1\bf{x}_{I +1}xi+1
  • Detect the current gradient size, if the gradient is less than a certain threshold ε\varepsilonε is considered to have reached the minimum value:

Epsilon. p g i + 1 (13) \varepsilon \ge \left\| { { {\bf{g} }_{i+1} } } \right\| \tag{13}
  • Otherwise, take the next step forward

Analysis of the

  • The fastest descent method is a greedy optimization method that simply fits the current problem and approaches the global optimal solution by calculating the local optimal solution step by step. Its essence is as follows:

    • The present quadratic form is fitted to a higher dimensional sphere
    • Move to the center of a fitting sphere
    • Calculation error, measurement accuracy
    • If the precision is not enough, the next high-dimensional sphere is fitted according to the error
  • First of all, it should be declared that the method is convergent, that is, under the problem we define, it can gradually converge to the optimal solution. The proof is complicated, refer to the boring derivation of the fastest descent method in The Extras (2)

  • There are also some problems in this method. Because the modeling is too simple, the deviation between the calculated optimal value and the real minimum value is constantly omitted, and the result is that the optimal value can only be approached by repeated iteration

Origin of conjugate gradient method

  • The fitting scheme of the fastest descent method is too simple, so it is doomed to fail to get the global optimal solution easily
  • If there is an iterative optimization method that correctly fits the real quadratic form, then if the optimal value under this model is found, the global optimal solution can be obtained iteratively
  • With this in mind, the gods proposed the conjugate gradient method