The basic process of logistic regression is as follows: A to establish regression or classification model –> B to establish cost function –> C to iteratively find the optimal model parameters by optimization method –> D to verify the quality of solving model.

1. Logistic regression model:


Logistic Regression: Classification algorithm based on linear Regression. Usually used to solve dichotomous problems.

The linear regression model is as follows:


The idea of Logistic Regression is based on linear Regression (Logistic Regression is a generalized linear Regression model), and the formula is as follows:

Among them,

Called the Sigmoid function

It can be seen from the figure that the range of Sigmoid function is between (0, 1) and the median value is 0.5, so the logistic regression function can represent the probability that data belongs to a certain category:

H_ θ(x) ≥ 0.5, y=1 class

H_ θ(x) < 0.5, y=0 class

2. Cost function

The cost function is derived from the maximum likelihood function.

Equation (1) above can be written as the following formula:

Multiply the formula by (-1/m), and use gradient descent to find the optimal θ parameter


When x and y are multidimensional:

2.1 Gradient descent method to solve the minimum value

To solve for the optimal solution of a function (maximum and minimum), in mathematics we usually take the derivative of the function, set the derivative equal to 0, get the equation, and then solve the equation directly to get the result. However, in machine learning, our functions are often multi-dimensional and high-order, and it is difficult to directly solve (sometimes even impossible to solve) the equation whose derivative is 0, so we need to use other methods to get this result, and gradient descent is one of them.

Update θ parameters:

α is the learning rate

3. Regularization:

(1) Overfitting problem

Over-fitting means over-fitting the training data, which increases the complexity of the model but leads to poor generalization ability.


(2) Regularization method: Regularization is the realization of structural risk minimization strategy, and a regularization term or penalty term is added to the empirical risk.

The regular term can take different forms. In regression problems, the squared loss is the L2 norm of the parameter, or the L1 norm. When the loss is squared, the loss function of the model becomes:



Lambda is the canonical term coefficient (λ) :

• If its value is very large, it indicates that the complexity of the model is severely punished, which may lead to under-fitting; (Minimize J(θ), θ→0 for a large value of λ)

Such as: curve for z equals theta _0 + theta _1 x_1 + theta _2 x_1 ^ 2 + lambda theta _3 x_2 ^ 3 — — — — > theta _3 > 0

• If its value is very small, it indicates that more attention is paid to the fitting of training data, and the deviation in training data will be small, but it may lead to over-fitting. (Keep more theta)

The regularized gradient descent algorithm θ is updated as:





4. Advantages and disadvantages of logistic regression:

Advantages :(1) fast speed, suitable for binary classification problems

(2) Simple and easy to understand, directly see the weight of each feature

(3) It can absorb new data easily

Disadvantages: Limited ability to use data and scenarios, not as adaptable as decision tree algorithm


Logistic regression modeling is performed on the data of a bank in reducing the loan delinquency rate:

Random logistic regression was used to screen the features, and then a logistic regression model was established for the screened features

#coding=gbk
The random logistic regression in the stability selection method is used to screen the features, and then the logistic regression model is established for the screened features
import pandas as pd 
import numpy as np
filename = r'D:\datasets\bankloan.xls'
data = pd.read_excel(filename)
print(data.head())
# age education seniority address income debt ratio Credit card debt other liabilities default
# 0 41 3 17 12 176 9.3 11.359392 5.008608 1
# 1 27 1 10 6 31 17.3 1.362202 4.000798 0
# 2 40 1 15 14 55 5.5 0.856075 2.168925 0
# 3 41 1 15 14 120 2.9 2.658720 0.821280 0
# 4 24 2 2 0 28 17.3 1.787436 3.056564 1
x = data.iloc[:,:8].as_matrix()
y = data.iloc[:,8].as_matrix()      # convert to matrix form so that there is no index entry default 1,0

from sklearn.linear_model import RandomizedLogisticRegression as RLR
rlr = RLR()
rlr.fit(x, y)
rlr.get_support()   # Obtain feature screening results,
print(rlr.get_support())    #[False False True True False True True False]
print(rlr.scores_)  #[0.085 0.045 0.995 0.395 0. 0.995 0.595 0.04] get the score of the feature result
print('End of feature screening by random logistic regression model')
# print (u 'effective characteristic is: % s' %', '. Join (data) columns [RLR. Get_support ())))
# x = data[data.columns[rlr.get_support()]].as_matrix()
print(u'Valid characteristics are: %s' % ', '.join(np.array(data.iloc[:,:8].columns)[rlr.get_support()]))

x = data[np.array(data.iloc[:,:8].columns)[rlr.get_support()]].as_matrix() # Screen for good features

from sklearn.linear_model import LogisticRegression as LR
lr = LR()
lr.fit(x,y)
print('End of logistic regression model training')
print('The average accuracy of the model is: %s' % lr.score(x,y))
# End of feature screening by random logistic regression model
# Effective characteristics are: length of service, address, debt ratio, credit card debt
# End the training by logistic regression model
The average accuracy of the model is 0.8142857142857143Copy the code

Gradient descent method in matrix form:

An implementation of logistic regression in Python

#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""__title__ = '__author__ =' mike_jun __mtime__ = '2019-6-19' # purpose: the realization of the logistic regression"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

def create_data():
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['label'] = iris.target
    df.columns = ['sepal length'.'sepal width'.'petal length'.'petal width'.'label']
    data = np.array(df.iloc[:100, [0, 1, -1]]) # just use 2 features
    returnData [:, :2], data[:, -1] X, y = create_data() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)print(X_train.shape) # (70, 2)Class LogisticRegression(): def __init__(self, learning_rate=0.05, max_iter=500, random_state=666): self.learning_rate = learning_rate self.max_iter = max_iter self.random_state = random_state self.theta = None self.inter_ = None self.weights = None def sigmoid(self, x, theta):# return the sigmoID calculated number
        return 1 / (1 + np.exp(-np.dot(x, theta)))

    def leftAppend(self, x):
        Add a column of all ones to the left of the array to compute the intercept
        allOnes = np.ones(shape=(x.shape[0],  1))
        x = np.c_[allOnes, x]
        return x

    def fit(self, x, y):
        y = y.reshape(-1, 1) This step must be performed
        x = self.leftAppend(x)
        # Initialize Theta
        # self.theta = np.zeros(shape=(x.shape[1], 1))
        # Normal distribution initializes Theta
        np.random.seed(self.random_state)
        self.theta = np.random.randn(x.shape[1], 1)
        print(self.theta.shape)

        # Gradient descent
        errors = []
        for iter_ in range(self.max_iter):
            h_x = self.sigmoid(x, self.theta)
            # print(h_x.shape)
            self.theta = self.theta - self.learning_rate * np.dot(x.T, (h_x - y))
            # Calculate the loss
            error = np.sum(h_x - y)
            errors.append(error)
        print(errors)
        return self

    def predict(self, X_test):
        # Predict the input vector
        X_test = self.leftAppend(X_test)
        y_pred = self.sigmoid(X_test, self.theta)
        return y_pred

    def score(self, X_test, y_test):
        Get the score for the test set
        X_test = self.leftAppend(X_test)
        y_test = y_test.reshape(-1, 1)
        # predict values with a sigmoID greater than or equal to 0.5 as 1 and 0 otherwiseResults = self.sigmoid(X_test, self.theta) Results = np. Where (results >= 0.5, 1, 0) counts = len(y_test)trueNum = np.sum(y_test == results)
        scores = trueNum / counts
        print(scores)
        return scores

    def printTheta(self):
        print(self.theta)
        self.inter_ = self.theta[0]
        self.weights = self.theta[1:]
        print('Intercept is:', self.inter_)
        print('The weight value is:', self.weights)

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.printTheta()
lr.score(X_test, y_test)
# x = lr.leftAppend(X_train)
# print(x.shape)

# Draw all the data
plt.scatter(X[:50, 0], X[:50, 1])
plt.scatter(X[50:, 0], X[50:, 1])
x_ponits = np.arange(4, 8)
# theta_1 * x + theta_2 * y + theta_0 = 0
Draw decision boundaries according to the formula
y_ = -(lr.theta[1]*x_ponits + lr.theta[0])/lr.theta[2]
plt.plot(x_ponits, y_)
plt.show()

Copy the code



Sklearn LogisticRegression class parameter meaning:

class sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, TOL =0.0001, C=1.0, FIT_Intercept =True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, 
          multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)
Copy the code

Penalty =’l2′ : string ‘l1’ or ‘l2’, default ‘l2’.

  • The baseline (regularization parameter) used to specify penalties. Only ‘L2′ supports’ Newton-CG ‘, ‘SAG’ and ‘LBFGS’ algorithms.
  • If ‘L2’ is selected, solver parameters can be selected from ‘Liblinear’, ‘Newton-CG’, ‘sag’ and ‘LBFGS’ algorithms. If ‘L1’ is selected, the ‘Liblinear’ algorithm is used.
  • If our main goal in tuning is to solve overfitting, general PENALTY L2 regularization is sufficient. However, L1 regularization can be considered if L2 regularization is found to be overfitting, that is, when the prediction effect is poor. In addition, L1 regularization can also be used if the model has a large number of features and some unimportant feature coefficients are expected to return to zero to make the model coefficients sparse.

Dual =False: dual or primitive method. Dual is only applicable to the case of ‘liblinear’ whose regularization phase is L2. In general, the default value of Dual is False when the number of samples is larger than the number of features. C=1.0: C is the reciprocal of the regularization coefficient λ, which must be positive and defaults to 1. Just like C in SVM, the smaller the value, the stronger the regularization. Fit_intercept =True: Whether an intercept exists. Intercept_scaling =1: Only useful when the regularization term is’ liblinear ‘and FIT_Intercept is set to True. Solver =’liblinear’ : Solver parameter determines our optimization method for logistic regression loss function. There are four algorithms to choose from.

  • A) LiblineAR: The open source LiblineAR library is used for implementation, and the coordinate axis descent method is used internally to iteratively optimize the loss function.
  • B) LBFGS: a quasi-Newton method, which uses the second derivative matrix of the loss function, namely The Hessian matrix, to iteratively optimize the loss function.
  • C) Newton-CG: it is also a family of Newton methods, which uses the second derivative matrix of loss function, namely Hessian matrix, to iteratively optimize the loss function.
  • D) SAG: random mean gradient descent, which is a variant of gradient descent method. The difference from ordinary gradient descent method is that only a part of samples are used in each iteration to calculate the gradient, which is suitable for the case of large sample data.

Sag uses only part of the sample for gradient iteration at a time, so do not choose sag when the sample size is small, but if the sample size is very large, such as more than 100,000, SAG is the first choice. But SAG can’t be used for L1 regularization, so when you have a large number of samples and need L1 regularization, you have to make your own trade-offs. Either reduce the sample size by sampling the sample, or return to L2 regularization.


regularization algorithm Applicable scenario
L1 liblinear Liblinear is suitable for small data sets; If L2 regularization is still found to be overfitting, that is, L1 regularization can be considered when the prediction effect is poor. L1 regularization can also be used if the model has many features and some unimportant feature coefficients are expected to return to zero to make the model coefficients sparse.
L2 liblinear Libniear only supports OvR of multiple logistic regression, not MvM, but MvM is relatively accurate.
L2 lbfgs/newton-cg/sag Large data sets, supporting two multivariate logistic regression of One-VS-REST (OvR) and many-VS-many (MvM).
L2 sag If the sample size is very large, say more than 100,000, SAG is the first choice; But it cannot be used for L1 regularization.

Multi_class =’ OVR ‘: classification mode. The website has an example of comparing the two categories: link address.

  • Multinomial is also multinomial. Ovr stands for one-VS-REST (OVR). Multinomial is also important because OVR is similar to multinomial in binary logistic regression.
  • Ovr is treated as binary regression regardless of whether it is a multicomponent regression. MVM performs binary regression by selecting two classes at a time from multiple classes. If you have a total of T classes, you need T(t-1)/2 classes.
  • OvR is relatively simple, but slightly less effective in classification (most sample distributions). While MvM classification is relatively accurate, but the classification speed is not as fast as OvR.
  • If OVR is selected, the four loss function optimization methods liblinear, Newton-CG, LBFGS and sag can be selected. Multinomial, however, is the only alternative to Newton-CG, LBFGS, and sag.

Class_weight =None: type weight parameter. Used to indicate the weights of various types in the classification model. The default value is not entered, that is, all categories have the same weight.

  • Select balanced to automatically calculate the type weight based on the y value.
  • Set your own weight, format:{class_label: weight}. For example, the 0,1 classification of the er’yuan binary model, setClass_weight = {0-0. 9, while 1}, so that the weight of type 0 is 90% and that of type 1 is 10%.

Random_state =None: Seed for a random number, default to None. It is useful only when the regularization optimization algorithm is SAG and Liblinear.

Max_iter =100: the maximum number of iterations of algorithm convergence.

Tol =0.0001: error range of iteration termination criterion.

Verbose =0: log verbose int: log verbose. 0: does not output the training process; 1: occasional output; >1: outputs for each submodel

Warm_start =False: Whether to start hot, if so, the next training is in the form of an append tree (reusing the previous call as initialization). Boolean, False by default.

N_jobs =1: number of parallelism, int: number of parallelism; -1: the value is the same as the number of CPU cores. 1: default value.


Common methods of the LogisticRegression class

  • fit(X, y, sample_weight=None)
    • The fitting model is used to train LR classifier, where X is the training sample and y is the corresponding marker vector
    • Return the object, self.
  • fit_transform(X, y=None, **fit_params)
    • A combination of fit and transform, first fit and then transform. returnX_newMatrix: numpy.
  • predict(X)
    • To predict the sample, the classification, X is the test set. Returns an array.
  • predict_proba(X)
    • Output classification probability. Returns the probability of each category, given in order of category. Multinomial =”multinomial” in the case of a multi-classification problem, the probability of the sample for each class is given.
    • Returns an array – like.
  • score(X, y, sample_weight=None)
    • Returns the mean accuracy of a given set of tests, a floating point value.
    • For multiple categories returned, the hash matrix composed of the accuracy of each category is returned.

Using logistic regression to solve multiple classification problems:

Use LogisticRegression directly and multiclass

#coding=gbk
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 
from sklearn.datasets import load_iris 
iris = load_iris()
print(iris.data.shape)  # There are 4 categories
print(iris.target.shape)    # (150, 4)
print(iris.data[:5, :]) X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, Test_size =0.3, random_state=666) lr_iris_OVr = LogisticRegression()# use 'OVr' by default for multiple categories
lr_iris_ovr.fit(X_train, y_train)
print(lr_iris_ovr.score(X_test, y_test))    # 0.9111111111111111

lr_iris_ovo = LogisticRegression(multi_class='multinomial', solver='newton-cg') # Use 'ovo' for multiple classifications
lr_iris_ovo.fit(X_train, y_train)
print(lr_iris_ovo.score(X_test, y_test)) # 1.0

# Use another method to solve the multi-classification problem
from sklearn.multiclass import OneVsOneClassifier # 'ovo' classification
from sklearn.multiclass import OneVsRestClassifier
ovr = OneVsRestClassifier(lr_iris_ovr)
ovr.fit(X_train, y_train)
print(ovr.score(X_test, y_test))    # 0.9111111111111111 the data obtained is the same as the above results using OVR

ovo = OneVsOneClassifier(lr_iris_ovo)   The parameters passed into the classifier
ovo.fit(X_train, y_train)
print(ovo.score(X_test, y_test))    # 1.0Copy the code

Reference: Logistic regression for machine learning

Liu Jianping’s blog