Quadratic optimization problem - 6 - conjugate gradient method

This is the 20th day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

In this paper, conjugate gradient method, an excellent iterative method in quadratic optimization methods, is introduced.

Problem description

To recap the problem we need to optimize:

f({\bf{x} }) = \frac{1}{2}{\bf{x^TAx} } – { {\bf{b} }^{\bf{T} } }{\bf{x} } + {\bf{c} } \tag{1} \label{1}

The matrix A\bf{A}A is positively definite symmetric
The objective is to optimize x\bf{x}x so that f(x)f(\bf{x})f(x) is minimized

The problem with the fastest descent method

Greedy calculation of local minima
No global vision, no real modeling
As a result, the optimization process needs repeated iterations to gradually approach the optimal value

yoke

Yoke is a Chinese character, pronounced E, which means a twisted piece of wood tied around the neck of an animal while driving. It was recorded in yili · Jixi Li and Xunzi · Zhenglun. — Baidu Encyclopedia

In mathematics, a lot of yoke-related things, where conjugation means that they are mutually constrained, that they can interact under certain conditions.

Origin of conjugate gradient method

In order to solve the back-and-forth problem of the steepest descent method, people began to think whether it can be directly optimized under the definition of the quadratic function that needs optimization, and whether it can get the real optimal solution through finite step calculation
If we use an accurate model of the problem instead of an approximate local optimal model, we can achieve the above goal if we can calculate the coordinates of each dimension of the optimal solution in a certain N-dimensional space
So how do you design this space, how do you compute it step by step and integrate it into the real result, is the conjugate gradient method
The core idea of the method is to establish a set of linearly independent bases in n-dimensional space. Theoretically, the linear combination of the bases can represent any point in the space. The conjugate gradient method can accurately solve each coefficient component of the target’s position in the base space through multiple calculations, so as to achieve the purpose of solving the optimal value
The method and the steepest descent method is don’t in modeling precision, with god’s perspective, each iteration calculation will need to adjust the direction adjustment to the extreme value of the component, that is to say, after the calculation of no longer consider the movement on the direction of component, it is a precise process for solving problems, not a simple model of the optimal value to move gradually approaching process

The definition of the conjugate basis

Let A\bf{A}A be A real symmetric positive definite matrix of NNN order if there are two NNN dimensional column vectors S1 \bf{s}_1s1 and S2 \bf{s} 2s2 satisfied

{% raw %}

{\bf{s}}_1^T{\bf{A}}{{\bf{s}}_2} = 0 \tag{2} \label{2}

{% endraw %}

The vectors S1 \bf{s}_1s1 and s2\bf{s}_2s2 are conjugate to the matrix A\bf{A}A. If A\bf{A}A is the identity matrix, then \eqref2\eqref{2}\eqref2 becomes \bf{s}_1$$bf{s}_2, such that the dot product of the two vectors is zero, and the two vectors are geometrically orthogonal, A special case of conjugations.

Let A be A symmetric positive definite matrix, if A set of non-zero vectors s1\bf{s}_1s1, s2\bf{s}_2s2… Sn \bf{s}_nsn satisfies {% raw %}

{\ bf {s}} _i ^ T {\ bf {A}} {{\ bf {s}} _j} (I indicates j) = 0 \ tag {3}

Says vector of si (1 I n or less) or less {{\ bf {s}} _i} (1 \ I \ le le n) si (1 I n or less) or less on matrix A \ bf {A} A conjugate.

If si(1≤ I ≤n){{\bf{s}} _I}(1 \le I \le n) Si (1≤ I ≤n) is linearly independent, then we call the set of vectors A conjugate basis of the matrix A\bf{A}A in NNN dimensional space. {% endraw %}

Conjugate groups

Suppose there is A set of conjugate basis D\bf{D}D for the matrix A\bf{A}A:

\bf{D}=\{ {\bf{d}_1},{\bf{d}_2},… ,{\bf{d}_n}\} \tag{4}

X * bf{x}^*x∗ = x * bf{D}D = x * bf{D}D = x

{% raw %}

{ {\bf{x} }^*} = {\lambda _1}{ {\bf{d} }_1} + {\lambda _2}{ {\bf{d} }_2} + \cdots + {\lambda _n}{ {\bf{d} }_n} \tag{5}

{% endraw %}

Since the derivative of the extreme value of the function is 0 in all directions,

{% raw %}

\begin{array}{l} f'({{\bf{x}}^*}) = {\bf{A}}{{\bf{x}}^*} – {\bf{b}} = 0\\ \Rightarrow {\bf{A}}{{\bf{x}}^*} = {\bf{b}} \end{array} \tag{6}

{% endraw %}

Raw {% %} we calculate diTAx ∗ {\ bf {d}} _i ^ T {\ bf {A}} {{\ bf {x}}} ^ * diTAx ∗, according to the different between conjugate base conjugate: {% endraw %}

{% raw %}

\begin{array}{l} {\bf{d}}_i^T{\bf{A}}{{\bf{x}}^*} &= {\bf{d}}_i^T{\bf{A}}\left( {{\lambda _0}{{\bf{d}}_0} + \ldots + {\lambda _{n – 1}}{{\bf{d}}_{n – 1}}} \right)\\ &= {\lambda _i}{\bf{d}}_i^T{\bf{A}}{{\bf{d}}_i} + 0\\ \tag{7} \end{array}

{% endraw %}

Get:

{% raw %}

{\lambda _i} = \frac{{{\bf{d}}_i^T{\bf{A}}{{\bf{x}}^*}}}{{{\bf{d}}_i^T{\bf{A}}{{\bf{d}}_i}}} = \frac{{{\bf{d}}_i^T{\bf{b}}}}{{{\bf{d}}_i^T{\bf{A}}{{\bf{d}}_i}}} \tag{8} \label{8}

{% endraw %}

For the solution of λ I {\lambda _i}, we know the variables b\bf{b}b and A\bf{A} a. If we have obtained the conjugate basis of A\bf{A}A in space (any group), we can directly solve λ I {\lambda _i}λ I
This is an exciting conclusion, so our current work focus on how to quickly find A set of conjugate basis for A\bf{A}A

Get the conjugate basis by definition

Given this definition, it is not difficult to think of a violent way to obtain conjugate groups:

Graph TB A (random) n n dimensional vector B {} I = 1 to n cycle C [excluding -i weight vector j = 1] E 1 - n] [collected vector D [I] get vector C - A - > B > D D > B B - loop ends -- -- -- -- > > E F conjugate base (output) B-- loop not finished -->C

With this set of methods, we can get the conjugate basis calculated according to the definition, and substitute \eqref8\eqref{8}\eqref8 to calculate the extreme value, without any problem
However, in fact, the computational complexity is equivalent to that of the algebraic solution {% raw %} Ax=b{\bf{A}}{{\bf{x}}} = {\bf{b}}Ax=b{% endRAW %}, which brings no performance benefit to the optimization problem

Conjugate gradient method

The core step of this algorithm is the same as that of the fastest descent method, which is to find the conjugate direction and calculate the movement step.

Finding conjugate direction

Because of the simplicity of gradient calculation, the process of finding conjugate gradient is dependent on the calculation of gradient direction.

The optimization goal for x ∗ \ bf {x} ^ * x ∗, the initial position of the x1 \ bf {x} _1x1, need to get the conjugate base of D = {d1, d2,… ,dn}\bf{D}=\{ {\bf{d}_1},{\bf{d}_2},… ,{\bf{d}_n}\}D={d1,d2,… ,dn}
Calculate the gradient at the initial x1\bf{x}_1x1 position, where the first conjugate basis is the opposite direction of the gradient:

{% raw %}

{{\bf{d}}_1} = – {{\bf{g}}_1} = – ({\bf{A}}{{\bf{x}}_1} – {\bf{b}}) = {\bf{b}} – {\bf{A}}{{\bf{x}}_1} \tag{9}

{% endraw %}

In order to eliminate the JJJ conjugate base (j< I)(j< I)(j< I) direction component of the third gradient, we need to subtract the βj\beta_j βj component of this direction:

{% raw %}

\begin{array}{l} {\bf{d}}_j^T{\bf{A}}({{\bf{g}}_i} – \beta_j {{\bf{d}}_j}) = 0\\ {\bf{d}}_j^T{\bf{A}}{{\bf{g}}_i} = \beta_j {\bf{d}}_j^T{\bf{A}}{{\bf{d}}_j}\\ \beta_j = \frac{{{\bf{d}}_j^T{\bf{A}}{{\bf{g}}_i}}}{{{\bf{d}}_j^T{\bf{A}}{{\bf{d}}_j}}} \end{array} \tag{10} \label{10}

{% endraw %}

Therefore, the conjugate basis of KKK is:

{% raw %}

\begin{array}{l} {{\bf{d}}_k} = {{\bf{g}}_k} – \sum\limits_{i = 1}^{k – 1} {\frac{{{\bf{d}}_i^T{\bf{A}}{{\bf{g}}_k}}}{{{\bf{d}}_i^T{\bf{A}}{{\bf{d}}_i}}}{{\bf{d}}_i}} \\ {{\bf{d}}_k}={{\bf{g}}_k} -\sum\limits_{i = 1}^{k – 1}\beta_i{{\bf{d}}_i} \end{array} \tag{11} \label{11}

{% endraw %}

At present, if we can determine the position of xi\bf{x}_ixi moved in each iteration and obtain Gi \bf{g}_igi, then we can determine the current iii conjugate basis from the 1st to the I −1i-1i−1 conjugate basis
So our current goal is, given the conjugate direction, how do we determine the distance we travel in that direction

Fixed distance

{% raw %} has moved to xk\bf{x}_kxk, the next forward direction is dk{{\bf{d}}_k}dk, the former progress long αk{\alpha _k}αk, Error for ek = x ∗ – xk \ bf = {e} _ {k} \ bf {x} ^ * – x_ {k} ek = x ∗ – xk, that is to say:

{{\bf{x}}_{k + 1}}={{\bf{x}}_k} + {\alpha _k}{{\bf{d}}_k} \tag{12}

{% endraw %}

Here we introduce two ways of finding the advance length αk{\alpha _k}αk.

Methods a

Determine the moving step size of the KKK step αk{\alpha _K}αk, which is the coefficient of a conjugate basis. The conditions limiting the coefficient are as follows:

Raw {% %} the current direction of the conjugate base dk \ bf {d} _ {k} dk and error vector ek ∗ + 1 = x – xk + 1 \ bf {e} _ {k + 1} = \ bf {x} ^ * – x_ + 1 = x + 1} {k ek ∗ – conjugate xk + 1:

\begin{aligned} \bf{d}_{k}^{T^{\prime}} A e_{k+1} &=\bf{d}_{k}^{T} A\left(x^{*}-x_{k+1}\right) \\ &=\bf{d}_{k}^{T} A\left(x^{*}-x_{k}+x_{k}-x_{k+1}\right) \\ &=\bf{d}_{k}^{T} A\left(e_{k}-\alpha_{k} \bf{d}_{k}\right) \\ &=\bf{d}_{k}^{T} A e_{k}-\alpha_{k} \bf{d}_{k}^{T} A \bf{d}_{k}=0 \end{aligned} \tag{13} \label{13}

{% endraw %}

There are:

{% raw %}

\begin{aligned} \alpha_{k} &=\frac{\bf{d}_{k}^{T} A e_{k}}{\bf{d}_{k}^{T} A \bf{d}_{k}} \\ &=\frac{\bf{d}_{k}^{T} A\left(x^{*}-x_{k}\right)}{\bf{d}_{k}^{T} A \bf{d}_{k}} \\ &=\frac{\bf{d}_{k}^{T}\left(A x^{*}-A x_{k}\right)}{\bf{d}_{k}^{T} A \bf{d}_{k}} \\ &=\frac{\bf{d}_{k}^{T}\left(b-A x_{k}\right)}{\bf{d}_{k}^{T} A \bf{d}_{k}} \\ &=-\frac{\bf{d}_{k}^{T} g_{k}}{\bf{d}_{k}^{T} A \bf{d}_{k}} \end{aligned} \tag{14}

{% endraw %}

Method 2

Raw {% %} to f (xk + 1) f ({{\ bf {x}} _ {k + 1}}) f (xk + 1) of the alpha k {\ alpha _k} alpha k derivation, makes the derivative is zero, calculate the alpha k {\ alpha _k} alpha k: {% endraw %}

With xk raw {% %} {{\ bf {x}} _k} xk said xk + 1 {{\ bf {x}} _ {k + 1}} xk + 1:

\begin{array}{l} f({{\bf{x}}_{k + 1}}) &= f({{\bf{x}}_k} + {\alpha _k}{{\bf{d}}_k})\\ &= \frac{1}{2}{\bf{x}}_{k + 1}^T{\bf{A}}{{\bf{x}}_{k + 1}} – {{\bf{b}}^T}({{\bf{x}}_k} + {\alpha _k}{{\bf{d}}_k}) + c \end{array} \tag{15}

{% endraw %}

Raw {% %} to f (xk + 1) f ({{\ bf {x}} _ {k + 1}}) f (xk + 1) in the alpha k {\ alpha _k} alpha k derivation:

\begin{array}{l} f'({{\bf{x}}_{k + 1}}|{\alpha _k}) &= \frac{{\partial f({{\bf{x}}_{k + 1}})}}{{\partial {{\bf{x}}_{k + 1}}}}\frac{{\partial {{\bf{x}}_{k + 1}}}}{{\partial {\alpha _k}}}\\ &= {({\bf{A}}{{\bf{x}}_{k + 1}} – {\bf{b}})^T}{{\bf{d}}_k}\\ &= {({\bf{A}}{{\bf{x}}_k} + {\alpha _k}{\bf{A}}{{\bf{d}}_k} – {\bf{b}})^T}{{\bf{d}}_k}\\ &= {({\alpha _k}{\bf{A}}{{\bf{d}}_k} + {{\bf{g}}_k})^T}{{\bf{d}}_k}\\ &= {\alpha _k}{\bf{d}}_k^TA{{\bf{d}}_k} + {\bf{g}}_k^T{{\bf{d}}_k} \end{array} \tag{16}

{% endraw %}

{% raw %} sets the derivative to 0, with:

\begin{array}{l} {\alpha _k}{\bf{d}}_k^TA{{\bf{d}}_k} + {\bf{g}}_k^T{{\bf{d}}_k} = 0\\ {\alpha _k} = – \frac{{{\bf{g}}_k^T{{\bf{d}}_k}}}{{{\bf{d}}_k^TA{{\bf{d}}_k}}} = – \frac{{{\bf{d}}_k^T{{\bf{g}}_k}}}{{{\bf{d}}_k^TA{{\bf{d}}_k}}} \end{array} \tag{17}

{% endraw %}

At this point we have calculated a series of methods for computing the conjugate gradient, and we can then compute a set of conjugate bases, but some of these steps can still be simplified.

Simplified calculations and some inferences

Concluded a

{% raw %} the gradient gk\bf{g}_kgk calculated at the KKK step and the conjugate vectors d1,d1… , dk – 1 {\ bf {d} _1, \ bf {d} _1,… ,\bf{d}_{k-1}}d1,d1,… ,dk−1 orthogonal :{% endraw %}
{% raw %} proves that when I

\begin{array}{l} \bf{d}_i^T{g_j} &= \bf{d}_i^T(A{x_j} – b)\\ &= \bf{d}_i^T(A{x_j} – A{x^*})\\ &= – \bf{d}_i^T(A{x^*} – {x_{i + 1}} + {x_{i + 1}} – {x_j})\\ &= – \bf{d}_i^TA({e_{i + 1}} – \sum\limits_{k = i + 1}^{j – 1} {{\alpha _k}{d_k})} \\ &= – \bf{d}_i^TA{e_{i + 1}} + \sum\limits_{k = i + 1}^{j – 1} {{\alpha _k}d_i^TA{d_k}} \end{array} \tag{18} \label{18}

{% endraw %}

The formula \eqref18\eqref{18}\eqref18 on the left is 0 due to the di\bf{d} _IDI calculation process \eqref13\eqref{13}\eqref13 on the right is 0 due to the conjugate values of different conjugate vectors
{% raw %} it is proved that the gradient gk \bf{g}_kgk calculated at the KKK step and the conjugate vectors d1,d1… , dk – 1 {\ bf {d} _1, \ bf {d} _1,… ,\bf{d}_{k-1}}d1,d1,… Orthogonal, dk – 1. {% endraw %}

Corollary 2

{% raw %} The calculated gradient gk\bf{g}_kgk at the KKK step and the gradient G1, G1,… , gk – 1 {\ bf {g} _1, \ bf {g} _1,… ,\bf{g}_{k-1}}g1,g1,… ,gk−1 orthogonal :{% endraw %}
{% raw %} proves that when I
By \ eqref11 \ eqref {11} \ eqref11 must:

{% raw %}

{{\bf{g}}_i}={{\bf{d}}_i} +\sum\limits_{k = 1}^{i – 1}\beta_k{{\bf{d}}_k} \tag{19}

{% endraw %}

There are:

{% raw %}

\begin{array}{l} {\bf{g}}_i^T{{\bf{g}}_j} &= {({{\bf{d}}_i} + \sum\limits_{k = 1}^{i – 1} {{\beta _k}} {{\bf{d}}_k})^T}{{\bf{g}}_j}\\ &= \sum\limits_{k = 1}^i {{\beta _k}} {{\bf{d}}_k}^T{{\bf{g}}_j} \qquad ({\beta _i} = 1) \tag{20} \label{20} \end{array}

{% endraw %}

According to corollary 1, formula \eqref20\eqref{20}\eqref20 has the value 0
Conclusion: the gradient gk\bf{g}_kgk calculated at KKK step and the gradient G1, G1… , gk – 1 {\ bf {g} _1, \ bf {g} _1,… ,\bf{g}_{k-1}}g1,g1,… Orthogonal, gk – 1. {% endraw %}
Raw {% %} so for two different gradient gi, gj (I indicates j) \ bf {g}, _i \ bf {g} _j (I \ ne j) gi, gj (I  = j), so they must points before and after the orthogonal to each other, so the gradient g = {g1, g2,… ,gn}{\bf{G}} = \{ {{\bf{g}}_{1,}}{{\bf{g}}_2},… ,{{\bf{g}}_n}\} G={g1,g2,… Gn} constitutes an orthogonal basis {% endraw %} in NNN dimensional space.

Inference three

Raw {% %} computing gj + 1 tgi {\ bf {g}} indicated _ ^ T {j + 1} {{\ bf {g}} _i} gj + 1 tgi: indicated

\begin{array}{l} {\bf{g}}_{j + 1}^T{{\bf{g}}_i} &= {({\bf{A}}{{\bf{x}}_{j + 1}} – {\bf{b}})^T}{{\bf{g}}_i}\\ &= {({\bf{A}}({{\bf{x}}_j} + {\alpha _j}{{\bf{d}}_j}) – {\bf{b}})^T}{{\bf{g}}_i}\\ &= {({\bf{A}}{{\bf{x}}_j} – {\bf{b}} + {\alpha _j}{\bf{A}}{{\bf{d}}_j})^T}{{\bf{g}}_i}\\ & = {({{\bf{g}}_j} + {\alpha _j}{\bf{A}}{{\bf{d}}_j})^T}{{\bf{g}}_i}\\ &= {\bf{g}}_j^T{{\bf{g}}_i} + {\alpha _j}{\bf{d}}_j^TA{{\bf{g}}_i}\\ {\bf{d}}_j^TA{{\bf{g}}_i}& = \frac{1}{{{\alpha _j}}}({\bf{g}}_{j + 1}^T{{\bf{g}}_i} – {\bf{g}}_j^T{{\bf{g}}_i}) \end{array} \tag{21} \label{21}

{% endraw %}

According to formula \eqref21\eqref{21}\eqref21 and corollary 2, {% raw %} since αj\alpha _jαj is not 0 in general, in all cases, to ensure that \eqref20\eqref{20}\eqref20, Need when I indicates the ji ji \ can  = j and I indicates j j + 1 + 1 \ I can when I  = j + 1, djTAgi = 0 {\ bf {d}} _j ^ TA {{\ bf {g}} _i} = 0 djTAgi = 0} {% endraw %
This means that the current gradient direction is conjugate orthogonal to all the conjugate directions before and after the last one

Simplified calculation

In formula \eqref11\eqref{11}\eqref11, the coefficient β\betaβ generated in the process of solving dk\bf{d}_kdk is stressed here \eqref10\eqref{10}\eqref10:

{% raw %}

{\beta _i} = \frac{{{\bf{d}}_i^T{\bf{A}}{{\bf{g}}_k}}}{{{\bf{d}}_i^T{\bf{A}}{{\bf{d}}_i}}}

{% endraw %}

By inference 3 {% raw %}, \eqref10\eqref{10}\eqref10 if I ≠ki\ne ki=k and I ≠k−1i \ne k-1i=k−1, DiTAgk {{\ bf {d}} _i ^ T {\ bf {A}} {{\ bf {g}} _k}} diTAgk value is 0} {% endraw %
Therefore, the formula \eqref11\eqref{11}\eqref11 can be simplified as:

\begin{array}{l} {{\bf{d}}_k} &= {{\bf{g}}_k} – \sum\limits_{i = 1}^{k – 1} {{\beta _i}} {{\bf{d}}_i}\\ &= {{\bf{g}}_k} – {{\beta _{k-1}}} {{\bf{d}}_{k-1}} \end{array} \tag{22} \label{22}

{% endraw %}

That is, when solving the KKK conjugate basis, it is only necessary to subtract the conjugate component of the k− 1k-1K −1 conjugate basis from the current gradient Gk \bf{g}_kgk

Corollary four

According to the simplified calculation formula \eqref22\eqref{22}\eqref22, we have:

{% raw %}

\begin{array}{l} {{\bf{d}}_k} &= {{\bf{g}}_k} – {\beta _{k – 1}}{{\bf{d}}_{k – 1}}\\ &= {{\bf{g}}_k} – {\beta _{k – 1}}({{\bf{g}}_{k – 1}} – {\beta _{k – 2}}{{\bf{d}}_{k – 2}})\\ &= {{\bf{g}}_k} + {\gamma _{k – 1}}{{\bf{g}}_{k – 1}} + {\gamma _{k – 2}}{{\bf{g}}_{k – 2}} + \cdots {\gamma _1}{{\bf{g}}_1}\\ &= \sum\limits_{i = 1}^k {{\gamma _i}{{\bf{g}}_i}} \end{array} \tag{23}

{% endraw %}

The fixed constant coefficient is represented by γ\gammaγ
So are:

{% raw %}

\begin{array}{l} {\bf{g}}_i^T{{\bf{d}}_j} &= {\bf{g}}_i^T\sum\limits_{k = 1}^j {{\gamma _k}{{\bf{g}}_k}} \\ & = \sum\limits_{k = 1}^j {{\gamma _k}{\bf{g}}_i^T{{\bf{g}}_k}} \end{array} \tag{24} \label{24}

{% endraw %}

Formula \eqref24\eqref{24}\eqref24 according to the conclusion of corollary 2, the value is:

{% raw %}

k=1

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Quadratic optimization problem – 6 – conjugate gradient method

Problem description

The problem with the fastest descent method

yoke

Origin of conjugate gradient method

The definition of the conjugate basis

Conjugate groups

Get the conjugate basis by definition

Conjugate gradient method

Finding conjugate direction

Fixed distance

Methods a

Method 2

Simplified calculations and some inferences

Concluded a

Corollary 2

Inference three

Simplified calculation

Corollary four

The resources

Quadratic optimization problem – 6 – conjugate gradient method

Problem description

The problem with the fastest descent method

yoke

Origin of conjugate gradient method

The definition of the conjugate basis

Conjugate groups

Get the conjugate basis by definition

Conjugate gradient method

Finding conjugate direction

Fixed distance

Methods a

Method 2

Simplified calculations and some inferences

Concluded a

Corollary 2

Inference three

Simplified calculation

Corollary four

The resources

Related Posts

The 15 most popular GitHub machine learning projects

Face recognition Based on MATLAB GUI PCA face recognition

Voice recognition in mobile terminal based on TensorflowLite