“This is the 14th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

Introduction to logistic regression

1.1 Application Scenarios

  • Click through rate
  • Whether it is spam
  • Whether sick
  • Financial fraud determination
  • False account determination

1.2 Principle of logistic regression

What are the input values & how do I determine the output

1.2.1 input


h ( Theta. ) = Theta. 1 x 1 + Theta. 2 x 2 + + Theta. n x n h(\theta)=\theta_1x_1+\theta_2x_2+\cdots+\theta_nx_n

= ∑ I = 1 n theta ixi + b = \ sum_ {I = 1} ^ n \ theta_ix_i + b = ∑ I = 1 n theta ixi + b

= theta Tx = \ theta ^ Tx = theta Tx

The input of logistic regression is the result of a linear regression

1.2.2 Activation function

  • The sigmoid function:


    g ( Theta. T x ) = 1 1 + e Theta. T x g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}

  • The judgment standard

    • The result of the regression is entered into the sigmoID function
    • Output result: A probability value in the interval [0,1]. The default value is 0.5, which is the threshold. The probability value greater than the threshold is 1(positive example), and the probability value less than the threshold is 0(negative example)

The final classification of logistic regression is to judge whether it belongs to a certain category by the probability value of belonging to a certain category, and this category will be marked as 1 by default (positive example), and another category will be marked as 0(negative example). (Convenient for loss calculation)

Interpretation of the output (important) : Suppose there are two categories A, B, and suppose our probability value is the probability value belonging to the category A(1). Now there is A sample input to the logistic regression output of 0.6, so the probability value is greater than 0.5, which means that the result of our training or prediction is category A(1). On the other hand, if the result is 0.3 then the training or prediction result is B(0).

So what we’re going to do is we’re going to recall the linear regression prediction and we’re going to measure it by the mean square error, and how do we measure the loss if we don’t get it right for the logistic regression? Let’s look at a picture like this

So how to measure the difference between the predicted result of logistic regression and the real result?

1.3 Loss and optimization

1.3.1 loss

The loss of logistic regression is called logarithmic likelihood loss, and the formula is defined as follows:

  • Separate categories

How do you understand individual sentence patterns? And this is going to be based on the graph of the log function

  • Synthesize the integrity loss function

So what we’re going to do is we’re going to plug in that example, and we’re going to see what it means.

1.3.2 optimization

The gradient descent optimization algorithm is also used to reduce the value of the loss function. In this way, the weight parameters of the corresponding algorithm before logistic regression are updated to improve the probability that originally belongs to the 1 category and reduce the probability that originally belongs to the 0 category.

2. Introduction to logistic regression API

  • Sklear.linear_model.LogisticRegression(Solver =’liblinear’, penalty= ‘l2 ‘, C = 1.0)
    • Solver Optional :{‘liblinear’, ‘sag’, ‘saga’,’ Newton-CG ‘,’ LBFGS ‘},
      • Default: ‘liblinear’; The internal use of the axis descent iterative optimization loss, used to optimize the problem of the algorithm.
      • ‘Liblinear’ is a good choice for small data sets, while ‘SAG’ and ‘Saga’ are faster for large data sets.
      • For multi-class problems, only ‘Newton-CG’, ‘SAG’, ‘saga’ and ‘LBFGS’ can handle multiple losses; “Liblinear” is limited to the “one-versus-rest” category.
    • Penalty: the type of regularization
    • C: Regularization force

The default is to take a small number of categories as a positive example

The LogisticRegression method is equivalent to SGDClassifier(Loss =”log”, penalty=” “), which implements a common stochastic gradient descent learning. Instead, use LogisticRegression(which implements SAG).

Case: Classification and prediction of cancer

  • Data is introduced

Data description

(1) a total of 11 columns of data were collected from 699 samples. The first column was the id retrieved, the last 9 columns were tumor-related medical characteristics, and the last column was the value of tumor type.

(2) contain 16 missing values, use “?” Mark.

3.1 analysis

1. Data acquisition 2. Basic data processing 2.1 Missing value processing 2.2 Determination of eigenvalues, target values 2.3 Segmentation of data 3. Feature engineering (standardization) 4. Machine learning (logistic regression) 5. Model evaluationCopy the code

3.2 code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

import ssl

ssl._create_default_https_context = ssl._create_unverified_context

# Fetch data
names = ['Sample code number'.'Clump Thickness'.'Uniformity of Cell Size'.'Uniformity of Cell Shape'.'Marginal Adhesion'.'Single Epithelial Cell Size'.'Bare Nuclei'.'Bland Chromatin'.'Normal Nucleoli'.'Mitoses'.'Class']
data = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",
    names=names)
data.head()

# Basic data processing
# Missing value handling
data = data.replace(to_replace="?", value=np.NaN)
data = data.dropna()
# Determine the eigenvalue, the target value
x = data.iloc[:, 1: 10]
x.head()
y = data["Class"]
y.head()
# Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)

# Feature Engineering (standardization)
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

# Machine learning: Logistic regression
estimator = LogisticRegression()
estimator.fit(x_train, y_train)

# Model evaluation
y_predict = estimator.predict(x_test)
print("y_predict:")
print(y_predict)
print(estimator.score(x_test, y_test))
Copy the code

In many classification scenarios, we may not only focus on the accuracy of prediction !!!!!

Take this cancer for example!! We don’t care about the accuracy of the prediction, we care about whether the cancer patients were all predicted (detected) in the sample.

Iv. Classification assessment methods

4.1 Classification assessment method

4.1.1 Accuracy and recall rate

  1. Confusion matrix

    In the classification task, four different combinations of Predicted Condition and True Condition can form confusion matrix (suitable for multi-classification).

  • TP(True Positive) : indicates the number of Positive cases.
  • FN(False Negative) : indicates the number of False Negative examples and positive examples
  • FP(False Positive) : indicates the number of False Positive cases
  • TN(True Negative) : indicates the number of false cases
  1. Accuracy and recall

    • Accuracy: The percentage of all samples that predict correctly


      A c c u r a c y = T P + T N T P + T N + F N + F P Accuracy=\frac{TP+TN}{TP+TN+FN+FP}

    • Accuracy: The percentage of positive cases in which the predicted result is true


      P r e c i s i o n = T P T P + F P Precision=\frac{TP}{TP+FP}

  • Recall rate: The percentage of samples with positive results predicted by positive results (completeness, ability to distinguish positive samples)


    R e c a l l = T P T P + F N Recall=\frac{TP}{TP+FN}

4.1.2 F1 – score

There are other evaluation criteria, F1-Score, which reflects the robustness of the model:


F 1 = 2 T P 2 T P + F N + F P = 2 P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l F1=\frac{2TP}{2TP+FN+FP}=\frac{2\cdot Precision\cdot Recall}{Precision+Recall}

4.1.3 Classification evaluation Report API

  • sklearn.metrics.classification_report(y_true, y_pred, labels=[], target_names=None )
    • Y_true: indicates the true target value
    • Y_pred: Estimator predicts target value
    • Labels: indicates the number corresponding to the category
    • Target_names: target category name
    • Return: accuracy and recall for each category
ret = classification_report(y_test, y_predict, labels=(2.4), target_names=("Benign"."Vicious"))
print(ret)
Copy the code

So let’s say that if 99 of my samples are cancer and 1 of my samples are non-cancer, and I predict all of the positive cases anyway, I get 99% accuracy but I don’t do that very well. So that’s the problem with the sample imbalance, right

Question: How to measure the assessment under sample imbalance?

4.2 ROC curve and AUC indicators

2 TPR and FPR

  • TPR=TP/(TP+FN)
    • The percentage of all samples of true category 1 that are predicted to be category 1
  • FPR=FP/(FP+TN)
    • The percentage of all samples with a true category of 0 that are predicted to be category 1

4.2.2 ROC curve

  • The horizontal axis of the ROC curve is FPRate, and the vertical axis is TPRate. When the two are equal, the meaning of the ROC curve is that the probability of the classifier predicting 1 is equal for the samples regardless of the real category is 1 or 0, and the AUC is 0.5

Holdings AUC indicators

  • The probabilistic significance of AUC is the probability that a pair of positive and negative samples are randomly selected and the score of positive samples is greater than that of negative samples
  • The geometric meaning of AUC is the integral of the ROC curve, that is, the area under the ROC curve.
  • The range of AUC is between [0, 1], and the closer it is to 1, the better. The closer it is to 0.5, the more random guess it is.
  • AUC=1, perfect classifier. When using this prediction model, perfect prediction can be obtained no matter what threshold value is set. For the vast majority of prediction situations, there is no perfect classifier.
  • 0.5<AUC<1, better than random guess. This classifier (model) can have predictive value if the threshold is set properly.

The final AUC range is between [0.5, 1], and the closer it is to 1, the better

4.2.4 AUC computing API

  • from sklearn.metrics import roc_auc_score
    • sklearn.metrics.roc_auc_score(y_true, y_score)
      • Calculate ROC curve area, namely AUC value
      • Y_true: The true category of each sample, which must be marked 0(negative example),1(positive example)
      • Y_score: The predicted score, which can be the estimated probability of a positive class, the confidence value, or the return value of a classifier method
# between 0.5 and 1, the closer it is to 1
y_test = np.where(y_test > 2.5.1.0)

print("AUC indicator:", roc_auc_score(y_test, y_predict)
Copy the code

4.3 summarize

  • AUC can only be used to evaluate dichotomies
  • AUC is very suitable for evaluating classifier performance in sample imbalance

5.ROC curve drawing

The drawing process of ROC curve is illustrated by the following examples

Suppose there are six display records, two of which are clicked, and a display sequence (1:1,2:0,3:1,4:0,5:0,6:0) is obtained, with the front indicating the serial number and the back indicating the click (1) or no click (0).

Then, the probability sequence of clicking was calculated by model during the six demonstrations.

So let’s look at three scenarios.

5.1 Curve Drawing

  1. If the probability of the sequence is (1-0. 9, 2-0. 7, 3-0. 8, 4-0. 6, 5:0. 5, lost. 4)

    With the original sequence, get the sequence (ranked from high to low probability)

    The serial number 1 3 2 4 5 6
    Whether to click 1 1 0 0 0 0
    Predict the probability of being clicked 0.9 0.8 0.7 0.6 0.5 0.4

    The steps for drawing are:

    1) the probability sequence from high to low order, get the order (1-0. 9, 3:0.8, 2-0. 7, 4, 0.6, 5:0. 5, 6, 0.4);

    2) Starting from the maximum probability, a point is taken as the positive class, and point 1 is taken. TPR=0.5 and FPR=0.0 are calculated.

    3) Starting from the maximum probability, another point is taken as the positive class, and TPR=1.0 and FPR=0.0 are calculated when point 3 is taken.

    4) Take a point starting from the maximum as the positive class, take point 2, calculate TPR=1.0, FPR=0.25;

    5) By analogy, 6 pairs of TPR and FPR are obtained.

    Then put these 6 on the data of six points (FPR, TPR) : (0,0.5),,1.0 (0), (0.25, 1), (0.5, 1), (0.75, 1), (1.0, 1.0).

    These six points can be plotted in two dimensional coordinates.

  1. If the probability of the sequence is (1-0. 9, 2-0. 8, 3-0. 7, 4-0. 6, 5:0. 5, lost. 4)

    With the original sequence, get the sequence (ranked from high to low probability)

    The serial number 1 2 3 4 5 6
    Whether or not it’s clicked 1 0 1 0 0 0
    Predict the probability of being clicked 0.9 0.8 0.7 0.6 0.5 0.4

    The steps for drawing are:

    6) the probability sequence from high to low order, get the order (1-0. 9, 2:0.8, 3-0. 7, 4, 0.6, 5:0. 5, 6, 0.4);

    7) Starting from the maximum probability, a point is taken as the positive class, and point 1 is taken. TPR=0.5 and FPR=0.0 are calculated.

    8) Starting from the maximum probability, another point is taken as the positive class, and TPR=0.5 and FPR=0.25 are calculated at point 2.

    9) Take a point starting from the maximum as the positive class, take point 3, calculate TPR=1.0, FPR=0.25;

    10) And so on, 6 pairs of TPR and FPR are obtained.

    Then put these 6 on the data of six points,0.5 (0), (0.25, 0.5), (0.25, 1), (0.5, 1), (0.75, 1), (1.0, 1.0).

    These six points can be plotted in two dimensional coordinates.

  1. If the probability of the sequence is (1-0. 4, 2, 0.6, 3-0. 5, 4, 0.7, 5:0. 8, 6, 0.9)

    With the original sequence, get the sequence (ranked from high to low probability)

    The serial number 6 5 4 2 3 1
    Whether or not it’s clicked 0 0 0 0 1 1
    Predict the probability of being clicked 0.9 0.8 0.7 0.6 0.5 0.4

    The steps for drawing are:

    11) the probability sequence from high to low order, get the order (lost. 9, 5, 0.8, 4-0. 7, 2, 0.6, 3-0. 5, 1, 0.4);

    12) Starting from the maximum probability, a point is taken as the positive class, and the point 6 is taken. TPR=0.0 and FPR=0.25 are calculated.

    13) Starting from the maximum probability, another point is taken as the positive class, and TPR=0.0 and FPR=0.5 are calculated at point 5.

    14) Take a point starting from the maximum as the positive class, take the point 4, calculate TPR=0.0, FPR=0.75;

    15) And so on, 6 pairs of TPR and FPR are obtained.

    Then put these 6 on the data of six points (0.25, 0.0), (0.5, 0.0), (0.75, 0.0), (1.0, 0.0), (1.0, 0.5), (1.0, 1.0).

    These six points can be plotted in two dimensional coordinates.

5.2 Explanation of Meaning

As shown in the example above, there are a total of 6 points, 2 positive samples, and 4 negative samples, and there are a total of 8 cases of taking a positive sample and a negative sample.

So in the first case up here, if you take it from the top down, the probability of the positive sample is always higher than the probability of the negative sample, so the probability of matching is 1, and AUC is equal to 1. Now that ROC curve, what’s the integral of that? Also 1, the integral of the ROC curve is equal to AUC.

In the second case above, if I got samples 2 and 3, I got it wrong, everything else I got it right; So the probability of matching is 0.875, AUC=0.875. Looking at the ROC curve, which also has an integral of 0.875, the ROC curve has an integral equal to AUC.

The third case above, no matter how you pick it, is misclassified, so the probability of getting it right is 0, and AUC=0.0. Looking at the ROC curve, its integral is also 0.0, and the integral of the ROC curve is equal to AUC.

The Area Under the ROC Curve is the Area Under the ROC Curve.

The significance of drawing the ROC curve is obvious. The possibility of misclassification is constantly deducted. Every negative sample from the point with the highest probability will lead to misclassification of all positive samples below it, so the number of positive samples below it should be deducted (1-Tpr, the proportion of the remaining positive samples). After the total ROC curve is drawn, the AUC is determined and the probability of pairing can be calculated.