Theoretical basis

Logistic regression is a classification model in machine learning. This section will introduce the theoretical basis of Logistic regression.

The relevant knowledge

Maximum likelihood thought: Suppose there is a jar with black and white balls, the number of balls is unknown, and the color ratio of the balls is unknown. At this time, we take out 10 balls from the jar, that is, take out 1 ball, record the color, and then put it back into the jar to shake well. This operation is repeated 10 times. Let’s say the record shows seven white balls and three black balls. What is the most likely proportion of black and white balls in the tank? Most people don’t hesitate to give an answer,. The theory behind this is actually the idea of maximum likelihood.

Likelihood means to make this happen. Maximum likelihood is what maximizes the probability of this happening. Now, let’s look at how maximum likelihood ideas get this result.

Let’s say the probability of getting a black ball is zero, then the probability of picking a white ball is. So if I draw 10 times and I get 7 white balls and 3 black balls, that’s the probability of that. In order to maximize the probability of this happening, we can obtain:


Let the derivative be 0, and the formula can be obtained:


So the ratio of black balls drawn is 0.3, and the ratio of black to white balls is.

Model features

  • Model capability of Logistic regression:classification

    Given the characteristics of sample X, judge the label of the sample.
  • Logistic regression learning method: supervised learning requires training samples to train the model.
  • Model types of Logistic regression:Discriminant model

    Characteristics through trainingAnd categories, direct calculationAs a model, directly based on the characteristics of the sampleComputing category.

Model is derived

The idea of maximum likelihood can be summed up as “seeing through appearances”. According to the black and white ball example above, the phenomenon is that 10 balls are drawn, 3 black and 7 white, essentially the probability of picking a black ball is 0.3. In logistic regression, phenomena are the characteristics and labels of training samples, and the essence is the model parameters of logistic regression (LR) model. Our task is to find a set of LR model parameters to maximize the probability of the appearance of the characteristics and labels of the training samples. Suppose we have the following m samples, each with n-dimensional characteristics:


The label. The model parameters of LR model areAccording to the idea of maximum likelihood, we believe that these M samples are the actual situations that have occurred, so the question is transformed into:


Among them,That’s the LR model.

We can see that logistic regression model is actually a classification model, not a regression model. So why is there such a name? The author believes that Logistic regression can be decomposed into Logistic + linear regression. The linear regression model is a regression model, which can be expressed as:


The value of the model is zero. So how do you translate that into a classification problem, you need a functional mapping ofmapping(probability representation) and is continuous, smooth, strictly monotone and symmetric about the midpoint center. The Logistic function (SigmoID) satisfies all the requirements. Logistic function is. The code for drawing the function image is as follows:

import matplotlib.pyplot as plt     
import numpy as np 
import math
' 'Def sigmoid(x): return (1/(1+math.exp(-x)))' '
def sigmoid(x):
    returnMath. J exp (- np. Logaddexp (0 -) x) x = np. Arange (0.2) - 10. 10., y = [sigmoid (x_i)for x_i in x]
plt.plot(x, y)     
plt.show() 
Copy the code

By integrating logistic function and linear regression model function, we can obtain the model function of logistic Regression model:


We makeRepresents the model function of LR model, i.e. if, we can assume that the sample hasThe probability of omega is a positive sample,The probability of is minus sample, which isand. Substitute LR’s model function back to the maximum likelihood solution problem, and get:


Since we are multiplying multiplying, we use monotone operatorsTurn the multiplicative into a summation without affecting the final result. Get:


We make:


If I want to solve the above, you need toand. According to the functions of LR model, there are:


Due to theIs multivariable, so the gradient rise method is used to solve its maximum value. If requiredThe maximum value of theta is the same thing as finding thetaAt this timeIt can be understood as the loss function of the Logistic regression model. The final problem is equivalent to the gradient descent methodThe minimum value of The parameter updating process of gradient descent method is as follows:


Among themIs the learning rate. thenaboutThe partial derivative of is:


The above formula expresses such a truth: the firstModel parameter on dimension feature, and its update direction is learning rate (learning step)Times each of the samplesIn the firstTrue values on dimension characteristicsAnd the model predicted valueThe poor.

Simply put, the direction of parameter update is adjusted by multiplying the learning step by the model error.

Advanced optimization

Feature discretization, which discretizes continuous features into a series of zeros and ones.

1. Features are easy to increase or decrease, and features are highly explanatory, such as age dimensionFor example, in the first version of the model, the feature isAnd this dimension can be understood as adulthood. The subsequent second edition can be directly addedCan be understood as whether they are young.

2, characteristicIt becomes a zero or a one,In theThe inner product multiplication is fast, and the calculation results are easy to store and expand.

3. Strong robustness after discretization of features. Take the dimension of ageIs characterized,. If there is an entry error in the age data, such as the entry of height data 175, inIt’s just a 1, so it doesn’t matter that much. But if you use a continuous feature, thenbecauseIt gets big. Result is very big deviation.

4. LR model is still a linear model in essence and is still a linear model in essence, if not dispersion, age characteristicsThere’s only going to be one coefficientIf discretized, age characteristics may become..In this way, the parameters before each feature are different, which means that the feature of age is piecewise linear, which improves the expression ability of the model.

5. Feature discretization is conducive to feature crossover. If you don’t discretize it, you might justandThe intersection of phi, discretized, can be phiThe female is a minor,Male youth. Feature A is discretized to, and feature B is discretized asZero, so there’s going to be zeroVariables, further introduce nonlinear, improve expression ability;

6. After feature discretization, the model is more stable. For example, the most important feature is age. After discretization, is, so the age isThe age values are the same, so the final result doesn’t change much just because you’re one year older.

7. After feature discretization, feature dispersion can weaken the feature and avoid excessive feature of a dimension, which may lead to over-fitting. Like age characteristicsThe characteristic importance ofLarge, then the model will be overly dependent on this dimensional feature. Once theA small fluctuation will have a huge impact on the model results.

Project implementation

LR_classify in the code is the LR classifier model, and the specific code is as follows.

code

#! /usr/bin/env python3
# -*- coding: utf-8 -*-
""" Created on Thu Aug 23 19:52:02 2018 @author: huangzhaolong """

from sklearn.datasets import load_iris
import numpy as np
import random
import math

def load_data():
    
    iris = load_iris()
    X = iris.data
    Y = iris.target
    # Feature discretization, multi-classification to binary classificationX = X/Y < = 1] round () Y = Y/Y data_num = < = 1] X.s hape [0] index_list = [random. The random () > = 0.2for _ in range(data_num)] 
    train_index_list = []
    test_index_list = []
    for i in range(len(index_list)):
        if index_list[i]:
            train_index_list.append(i)
        else: test_index_list.append(i) X_train = X[train_index_list] X_test = X[test_index_list] Y_train = Y[train_index_list] Y_test  = Y[test_index_list]return X_train, X_test, Y_train, Y_test

class LR_classify(object):
    
    def sigmoid(self, num_list):
        if isinstance(num_list, (int, float)) :return 1/(1+math.exp(-num_list))
        else:
            result_list = []
            for num in num_list:
                if isinstance(num, (int, float)):
                    result_list.append(1/(1+math.exp(-num)))
                else:
                    print ("error type!")
                    return- 1return result_list
    
    def LR_train(self, X_train, Y_train, train_step = 1000):
        # Train your steps
        train_step = 10
        Initialize the weight
        weights=[random.random() for _ in range(X_train.shape[1]+1)] 
        The training set is extended to a generalized matrix
        X_train = np.hstack((X_train, np.ones(X_train.shape[0]).reshape(X_train.shape[0],1)))
        # Define step for gradient descentLearning_rate = 0.001# for each sample
        sample_num = X_train.shape[0]
        
        for step in range(train_step):
            # prediction
            pred = self.sigmoid(np.dot(X_train,weights))
            for i in range(sample_num):
                error = Y_train[i]- pred[i]
                weights = weights + learning_rate * error * X_train[i]
            if step % 1 == 0:
                acc = cal_accuracy([round(z) for z in pred], Y_train)
                print("Training Process",step,"The accuracy of step is:",acc)

        self.weights = weights
        
    def LR_test(self, X_test):
        X_test = np.hstack((X_test, np.ones(X_test.shape[0]).reshape(X_test.shape[0],1)))
        weights = self.weights
        pred = self.sigmoid(np.dot(X_test,weights))

        return pred
        
def cal_accuracy(pred, label):
    
    acc = list(pred-label).count(0)/len(pred-label)
    
    return acc
        
if __name__=="__main__":
    acc = 0
    
    for i in range(1):
        X_train, X_test, Y_train, Y_test = load_data()
        
        lr = LR_classify()
        
        lr.LR_train(X_train, Y_train)
        
        pred = lr.LR_test(X_test)
        
        acc += cal_accuracy([round(z) for z in pred],Y_test)
        
    acc = acc/1
    
    print("Test set Accuracy :",acc)
Copy the code

The results of

The accuracy of step 0 in the training process is 0.46153846153846156. The accuracy of step 1 in the training process is 0.46153846153846156. 0.6153846153846156 The accuracy of the third step in the training process is 0.6153846153846154 The accuracy of the fourth step in the training process is 0.8717948717948718 the accuracy of the fifth step in the training process is: 0.8717948717948718 The accuracy of the sixth step in the training process is: 0.9871794871794872 The accuracy of the seventh step in the training process is: 0.9871794871794872 The accuracy of the eighth step in the training process is: 1.0 Accuracy of step 9 in the training process: 1.0 Accuracy of test set: 1.0Copy the code