Vernacular machine learning – Optimization methods – Newton method


@[toc]

Introduction to the

Newton method, BFGS, is one of the most effective methods to solve nonlinear optimization problems.

The characteristics of

  • Fast convergence rate;

way

  • Newton method is an iterative algorithm, each step needs to solve the objective functionHesse matrixQuasi – Newton method, which simplifies this process by approximating the inverse or Hessel matrix of a positive definite matrix.

Analysis of the

Consider unconstrained optimization problems


min x R f ( x ) \min_{x \in R} f(x)

Where X ∗x^*x∗ is the minimum of the objective function. Assume that f (x) with second order partial derivative in a row, if the first k iteration value of x (k) x ^ {} (k) (k), x, f (x) can be in the x (k) x ^ {} (k) x (k) near the second-order Taylor expansion:


f ( x ) = f ( x k ) + g k T ( x x k ) + 1 / 2 ( x x k ) T H ( x k ) ( x x k ) f(x) = f(x^{k}) + g_{k}^{T}(x – x^{k}) + 1/2(x-x^{k})^TH(x^{k})(x – x^{k})

  • Gk = g (xk) = ∇ (f) (xk) g_k = g (x ^ {k}) = \ nabla (f (x ^ {k})) gk = g (xk) = ∇ (f) (xk) is f (x) of the gradient vector at x (k) x ^ {} (k) values of x (k).
  • H (xk) H (x ^ {k}) H (xk) is f (x) of Hesse matrix [partial f2 partial xi partial yj] NXN [\ frac {\ partial f ^ 2} {\ partial x_i \ partial y_j}] _ {NXN} [partial xi partial yj partial f2] NXN in X (k) x ^ {} (k) values of x (k).

Here, the hessian matrix of Taylor’s expansion is explained in detail, and Taylor’s expansion of binary functions is explained temporarily

And then we move on, the necessary condition for f(x) to have an extreme is that the first derivative at the extreme point is 0, the gradient vector is 0. In particular, when H(xk)H(x^{k})H(xk) is a positive definite matrix, the extreme value of the function f(x) is a minimum, so:


( f ( x ) ) = 0 \nabla(f(x)) = 0

The derivative of f of x, so

∇ (f (x) = f (xk) + gkT (x – xk) + 1/2 (x – xk) TH (xk) (x – xk)) \ nabla (f (x) = f (x ^ {k}) + g_ {k} ^ {T} (x – x ^ {k}) + 1/2 (x – x ^ {k}) ^ TH (x ^ {k} (x – X ^ ∇ {k}))) (f (x) = f (xk) + gkT (x – xk) + 1/2 (x – xk) TH (xk) (x – xk)) = gk + H (xk) = g_k (x – xk) + H (x ^ {k}) (x – x ^ {k}) = gk + H (xk) (x – xk) Gk + H (xk) (xk + 1 – xk) = 0 g_k + H (x ^ {k}) (x ^ ^ {k + 1} – x} {k) = 0 gk + H (xk) (xk + 1 – xk) = 0 xk + 1 – xk = (xk) – 1 – H GKX ^ {k + 1} – x ^ = {k} – H (x) ^ k ^ {1} g_kxk + 1 – xk = (xk) – 1 – H gk or pk xk + 1 = xk + x ^ + 1} {k = x ^ {k} + p_kxk + 1 = xk + pk H (xk) of pk = – gk H p_k = (x ^ k) -g_kh (xk)pk=−gk The formula is derived

algorithm

Input: objective function f (x), the gradient g (x) = ∇ f (x) (x) = g \ nabla f (x) (x) = g ∇ f (x), Hesse matrix H (x), epsilon accuracy; Output: the minimum of f(x) x^*;

  1. X ^{(0)}x(0), k=0;
  2. Gk =g(x(k))g_k =g(x ^{(k)})gk=g(x(k))
  3. If ∣ ∣ gk ∣ ∣ < epsilon | | g_k | | < epsilon ∣ ∣ gk ∣ ∣ < epsilon, stop calculation, get the solution x ∗ = x (k) = x x ^ * ^ {} (k) ∗ = x x (k)
  4. Calculation of Hk = H (k) (x) H_k = H (x ^ {(k)}) Hk = H (x) (k), and solving pkp_kpk

H(xk)pk=−gk H(x^k)p_k = -g_kh (xk)pk=−gk 5. Iterate, xk + 1 = xk + PKX ^ + 1} {k = x ^ {k} + p_kxk + 1 = xk + pk, request k++, go to step 2;