perceptron

@(Machine learning algorithm)


@[toc]

An overview of the

Perceptron was proposed by Rosenblatt in 1957. It is the basis of neural network and support vector machine, and has great significance for neural network learning and deep learning. Perceptron is a linear classification model of binary classification. The input of perceptron is the feature vector of the instance, and the output is the category of the instance, with values of +1 and -1. Perceptron corresponds to the input space (feature space), the instance is divided into positive and negative separated hyperplane, belongs to the discriminant model. The purpose of perceptron learning is to find the hyperplane of linear division of training data. Therefore, the loss function based on misclassification is introduced and the gradient descent method is used to minimize the loss function to obtain the perceptron model. The algorithm of perceptron is simple and easy to implement, which can be divided into primitive form and dual form.

Perceptron model

Definition: Suppose that the input space (characteristic space) is X∈RnX \in R^nX∈Rn and the output space is Y∈{+1,−1}Y\in \{+1, -1\}Y∈{+1,−1}. The input X∈ Xx\in Xx∈X represents the eigenvector of the instance and is the point of the input space (characteristic space). The output Y ∈Yy \in Yy∈Y represents the category of the instance. From the input space to the output space by the following function:


y = s i g n ( w x + b ) y = sign(w*x+b)

The function is called perceptron, where W and B are parameters, W ∈Rnw \in R^ NW ∈Rn is called weight or weight vector, B ∈Rb \in Rb∈R is called bias, w*R represents the inner product of W and R, and sign is a sign function, that is

\begin{equation} sign(x) = \begin{cases} +1 & \mbox{if x >= 0}\\ -1 & \mbox{if x< 0} \end{cases} \end{equation}

Perceptron learning strategies

Assuming that the training data set is linearly separable, the goal of perceptron learning is to obtain a separate hyperplane that can completely separate the positive and negative instance points of the training data set. In order to find such a hyperplane, it is necessary to determine the parameters W and B of the model, and to determine a learning strategy, that is, to define the empirical loss function and minimize the loss function.

  1. Selection of loss function:
  • Disadvantages: Such a loss function is not a continuously accessible function of parameters W and B, so it is difficult to optimize;
  • Distance from misclassification point to hyperplane: this is the approach taken by perceptron.
  1. Calculation of loss function
  • Distance from any point x0x_0x0 of the input space RnR^nRn to the hyperplane S:

Y = 1 ∣ ∣ w ∣ ∣ ∣ w ∗ x0 + b ∣ y = \ dfrac {1} {| | w | |} | | w * x_0 + b y = ∣ ∣ w ∣ ∣ ∣ 1 w ∗ x0 + b ∣ here, ∣ ∣ w ∣ ∣ {| | w | |} ∣ ∣ w ∣ ∣ is w L2 norm;

  • Distance calculation: for the misclassification point (xi,yi)(x_i, y_i)(xi,yi),


y i ( w x i + b ) > 0 -y_i(w*x_i + b) > 0

  • The total distance between the misclassification points and the hyperplane is, assuming M is the set of misclassification points:


y = 1 w x i i n M w x 0 + b y = \dfrac{1}{||w||} \sum_{x_i in M}|w*x_0 + b|

  • Don’t consider y = 1 ∣ ∣ w ∣ ∣ y = \ dfrac {1} {| | w | |} y = ∣ ∣ w ∣ ∣ 1, get perceptron loss function

Perceptron learning algorithm


min w . b L ( w . b ) = x i M y i ( w x i + b ) \min_{w,b}L(w,b) = – \sum_{x_i \in M}y_i(w*x_i + b)

Original form

The perceptron learning algorithm is driven by misclassification, and the specific approach is stochastic gradient Descent. Firstly, a hyperplane W0, B0W_0, B_0W0, B0 is randomly selected, and then the objective function above is continuously minimized by gradient descent method. In the process of minimization, instead of gradient descent of all misclassification points in M at one time, one misclassification point is randomly selected for gradient descent. Find the gradient of the objective function (partial derivative)

∇ wL (w, b) = – ∑ xi ∈ Myixi \ nabla_wL (w, b) = – \ sum_ {x_i \} in M y_ix_i ∇ wL (w, b) = – ∑ xi ∈ Myixi ∇ bL (w, b) = – ∑ xi ∈ Myi \ nabla_bL (w, b) = – \ sum_ {x_i \} in M y_i ∇ bL (w, b) = – ∑ xi ∈ Myi

Then a random misclassification point was selected to update W and B. Gradient descent has already been introduced, but I won’t go into it here.

Dual form

More on that later;