Logistic Regression is a machine learning method for solving binary (0 or 1) problems that can be used to estimate the likelihood of something. For example, the likelihood that a user will buy a product, the likelihood that a patient will have a disease, or the likelihood that an AD will be clicked on.

An overview of this article is as follows:

  1. Select the prediction function.
  2. Calculate the loss function.
  3. Calculate the minimum loss function using gradient descent.
  4. To quantify.
  5. Logistic regression algorithm based on gradient descent.
  6. Polynomial characteristics.
  7. Many classification.
  8. Regularization.

Prediction function

Logistic regression, although with the word regression, is actually a classification algorithm that can be used for binary classification problems (as well as for multi-classification problems, as described in the next section). The prediction function is Sigmoid function, and the function expression is as follows:


We can draw it using Python code:

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(t):
    return 1. / (1. + np.exp(-t))
  
x = np.linspace(- 10.10.500)

plt.plot(x, sigmoid(x))
plt.show()
Copy the code

For the case of linear boundary, it can be expressed as:


The prediction function is defined as:


Is the probability of taking 1 as the result, so:


Constructional loss function

The loss function is defined as follows:


Do the conversion:


Gradient descent solves the loss function

To find the minimum value of the loss function, gradient descent method can be usedFor learning step:


Here is the result of taking partial derivatives:


The specific derivation process is as follows:

So:


Because 1/m is a constant,Is also a constant, so the final expression is:


To quantify

The final result after vectorization is:


So let’s do that. – Todo

Python implements logistic regression

The following is a logistic regression algorithm implemented in Python code using gradient descent to optimize the loss function.

def __init__(self):
    self.coef_ = None
    self.intercept_ = None
    self._theta = None

def _sigmoid(self, t):
    return 1. / (1. + np.exp(-t))

def fit(self, X_train, y_train, eta=0.01, n_iters=1e4):
    X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
    initial_theta = np.zeros(X_b.shape[1])
    self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)

    self.intercept_ = self._theta[0]
    self.coef_ = self._theta[1:]
    return self
Copy the code

The code of gradient descent is as follows:

def J(theta, X_b, y):
    y_hat = self._sigmoid(X_b.dot(theta))
    try:
        return - np.sum(y*np.log(y_hat) + (1-y)*np.log(1-y_hat)) / len(y)
    except:
        return float('inf')

def dJ(theta, X_b, y):
    return X_b.T.dot(self._sigmoid(X_b.dot(theta)) - y) / len(y)

def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
    theta = initial_theta
    cur_iter = 0

    while cur_iter < n_iters:
        gradient = dJ(theta, X_b, y)
        last_theta = theta
        theta = theta - eta * gradient
        if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
            break

        cur_iter += 1

    return theta
Copy the code

Prediction method and scoring method:

def predict(self, X_predict):
    assert self.intercept_ is not None and self.coef_ is not None, \
        "must fit before predict!"
    assert X_predict.shape[1] == len(self.coef_), \
        "the feature number of X_predict must be equal to X_train"

    X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
    proba = self._sigmoid(X_b.dot(self._theta))
    return np.array(proba >= 0.5, dtype='int')

def score(self, X_test, y_test):
    y_predict = self.predict(X_test)
    assert len(y_true) == len(y_predict), \
               "the size of y_true must be equal to the size of y_predict"

    return np.sum(y_true == y_predict) / len(y_true)
Copy the code

Polynomial characteristic

All of our assumptions above are that the decision boundary is a straight line. A lot of times the distribution of sample points is nonlinear. We can change the distribution of samples by introducing polynomial terms.

First, we simulate a nonlinear distributed data set:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(Awesome!)
X = np.random.normal(0.1, size=(200.2))
y = np.array(X[:,0] * *2 + X[:,1] * *2 < 1.5, dtype='int')

plt.scatter(X[y==0.0], X[y==0.1])
plt.scatter(X[y==1.0], X[y==1.1])
plt.show()
Copy the code

For such a data set, you can only solve by adding polynomials as follows:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

def PolynomialLogisticRegression(degree):
    return Pipeline([
        # Add multiple form items to sample features;
        ('poly', PolynomialFeatures(degree=degree)),
        # Data normalization;
        ('std_scaler', StandardScaler()),
        ('log_reg', LogisticRegression())
    ])

poly_log_reg = PolynomialLogisticRegression(degree=2)
poly_log_reg.fit(X, y)

plot_decision_boundary(poly_log_reg, axis=[4 -.4.4 -.4])
plt.scatter(X[y==0.0], X[y==0.1])
plt.scatter(X[y==1.0], X[y==1.1])
plt.show()
Copy the code

The decision boundary of the final data is as follows:

def plot_decision_boundary(model, axis):
    x0, x1 = np.meshgrid(
        np.linspace(axis[0], axis[1], int((axis[1]-axis[0]) *100)).reshape(- 1.1),
        np.linspace(axis[2], axis[3], int((axis[3]-axis[2]) *100)).reshape(- 1.1)
    )
    X_new = np.c_[x0.ravel(), x1.ravel()]
    
    y_predict = model.predict(X_new)
    zz = y_predict.reshape(x0.shape)
    
    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A'.'#FFF59D'.'#90CAF9'])
    
    plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)
Copy the code

Multiple classification problem

As mentioned above, logistic regression can only solve dichotomies, and dealing with multiple dichotomies requires additional transformations. There are usually two approaches:

  1. OVR(One vs Rest), a pair of leftovers.
  2. OVO, One vs One, one-to-one.

These two methods can be used not only for logistic regression algorithm, but also for all binary machine learning algorithms to transform binary problems into multi-classification problems.

OVR

As shown in the figure above, in the classification of n types of samples, one type of sample is taken as one type, and the remaining n-1 type of samples is regarded as another type, which can be converted into N dichotomies. Finally, n algorithm models can be obtained (as shown in the figure above, there will be 4 algorithm models in the end). The samples to be predicted are respectively introduced into these N models, and the sample type corresponding to the model with the highest probability is the prediction result.

The ovo approach is used by default in Sklearn for the multiple classification problem of logistic regression. Sklearn also provides a common way to call:

from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
ovr = OneVsRestClassifier(log_reg)
ovr.fit(X_train, y_train)
ovr.score(X_test, y_test)
Copy the code

OVO

N types of samples, two types of samples are selected each time, and finally formedSo this is the dichotomous caseOne algorithm model, there areThe most diverse sample type of these results is the final prediction result.

In the sklearn logistic regression implementation, ovR is used by default for multiple categories. If ovo is used, you need to specify multi_class, and the solver argument needs to be changed (p.S: Sklearn is most useful for loss functions without the gradient descent described above:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

iris = datasets.load_iris()

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=Awesome!)

log_reg2 = LogisticRegression(multi_class="multinomial", solver="newton-cg")
log_reg2.fit(X_train, y_train)
log_reg2.score(X_test, y_test)
Copy the code
from sklearn.multiclass import OneVsRestClassifier

ovr = OneVsRestClassifier(log_reg)
ovr.fit(X_train, y_train)
ovr.score(X_test, y_test)

from sklearn.multiclass import OneVsOneClassifier

ovo = OneVsOneClassifier(log_reg)
ovo.fit(X_train, y_train)
ovo.score(X_test, y_test)
Copy the code

regularization

The usual expression of regularization is as follows:


You can switch it around and change the hyperparameter position. If the hyperparameter C is larger, the original loss functionIs of relative importance. If the hyperparameter is very small, the regular term is relatively important. If you want the regular term to be unimportant, you need to increase the parameter C. Sklearn generally uses this expression.


The logistic regression algorithm in SkLearn automatically encapsulates the regularization function of the model, only adjusting C and penalty (regular selection)or.