Logistic regression, also called logistic regression. Although it has “regression” in its name, it is actually a classification method, which mainly solves the problem of dichotomies. Of course, it can also solve the problem of multiple dichotomies, but dichotomies are more common.

Application Scenarios:

  • Click through rate
  • Whether it is spam
  • Whether sick
  • Financial fraud
  • False account

The principle of

Logistic function, also known as Sigmoid function, is used in Logistic regression. Sigmoid function is one of the functions frequently used in deep learning. The function formula is as follows:

The graph of the function looks like S:

In sklearn, we build the LogisticRegression classifier using the LogisticRegression() function, which has some common construction parameters:

The value can be L1 or L2, and the default value is L2. When the model parameters satisfy the Gaussian distribution, L2 is used; when the model parameters satisfy the Laplace distribution, L1 is used.

Solver: represents the optimization method of logistic regression loss function. There are 5 parameters to choose from: Liblinear, LBFGS, Newton-CG, sag and Saga. The default value is Liblinear, which is suitable for small data sets. Sag or Saga methods can be used for large data sets. Max_iter: indicates the maximum number of iterations for algorithm convergence. The default value is 10. N_jobs: indicates the number of CPU cores during the fitting and prediction. The default value is 1, or it can be an integer. If the value is -1, it indicates the number of CPU cores.

Model evaluation index

For example, the model could not be used to assess the accuracy of terrorist detection in airport security. The percentage of terrorists is extremely low, so when it comes to accuracy, if the accuracy is 99.999%, does the model have to be good?

In fact, because the proportion of terrorists in real life is very low, even if you cannot identify a terrorist, you will get very high accuracy. Because the evaluation standard of accuracy is the proportion between the number of correctly classified samples and the total number of samples. Therefore, the proportion of non-terrorists will be very high. Even if terrorists cannot be identified, the number of correct classification will account for a high proportion of the total sample, which is high accuracy. There should be more focus on identifying terrorists.

Four cases of data prediction: TP, FP, TN, FN. The second letter, P or N, indicates whether the prediction is positive or negative, with P being positive and N negative. The first letter, T or F, indicates whether the prediction is correct, T for correct, F for wrong. ‘

So the four cases are:

  1. TP: The prediction is positive and the judgment is correct;
  2. FP: the prediction is positive and the judgment is wrong;
  3. TN: the prediction is negative and the judgment is correct.
  4. FN: The prediction is negative and the judgment is wrong.

Accuracy P = TP/ (TP+FP), corresponding to the terrorist example above, among all the people judged to be terrorists, the proportion is actually terrorists.

Recall rate R = TP/ (TP+FN), also known as recall rate. It’s the ratio of the number of terrorists that were correctly identified to the total number of terrorists.

There is an indicator that combines accuracy and recall to better assess the quality of the model. This index is called F1 and is expressed by the formula:

F1 is the harmonic average of accuracy rate P and recall rate R, and the larger the value is, the better the model results are.

The instance

Credit Card Fraud Analysis

import pandas as pd from matplotlib import pyplot as plt import seaborn as sns import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix, precision_recall_curve import itertools class CreditFraud: Def plot_confusion_matrix(self, cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues): plt.rcParams['font.sans-serif'] = ['SimHei'] plt.figure() plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=0) plt.yticks(tick_marks, classes) thresh = cm.max() / 2 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment='center', color='white' if cm[i, j] > thresh else 'black') plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label') plt.show() # Def show_metrics(self, cm): Tp = cm (1, 1] fn = cm (1, 0) fp = cm [0, 1] tn = cm (0, 0) print (' precise rate: {: 3 f} '. The format (tp / + fp) (tp)) print (' recall rate: {:.3f}'. Format (tp/(tp + fn)) print( {: 3 f} '. The format (2 * (((tp / + fp) (tp) * (tp / + fn (tp)))/((tp / + fp) (tp) + (tp / + fn (tp)))))) accuracy - recall rate curve drawing # def plot_precision_recall(self, recall, precision): Plt.rcparams ['font. Sans-serif '] = ['SimHei'] plt.step(recall, precision, color='b', alpha=0.2, Plot (recall, precision, step='post', alpha=0.2, color='b') plt.plot(recall, precision, step='post', alpha=0.2, color='b') Our linewidth = 2) PLT. Xlim ([0.0, 1]) PLT. Ylim ([0.0, Def show(self, data) def show(self, data) def show(self, data): Plt.rcparams ['font. Sans-serif '] = ['SimHei'] Data =data) plt.title(' data ') plt.show() num = len(data) num_fraud = len(data[data['Class'] == 1]) print(' total number of transactions: ', num) print(' fraud: % ', num) print(' fraud: % ') {:.6f}'. Format (num_fraud/num)) # c, (ax1, ax2) = plt.subplots(2, 1, sharex='True', figsize=(1, 1)) 8)) bins = 50 ax1.hist(data.Time[data.Class == 1], bins=bins, Hist (data.time [data.class == 0], bins=bins, Color =' deepSkyblue ') ax2.set_title(' open ') plt.xlabel(' time ') plt.ylabel(' open ') plt.show() def logic_regress(self, data): Data ['Amount_Norm'] = StandardScaler().fit_transform(data['Amount'].values. 0 Array (data.class.tolist ()) # y = data.class. Values data_new = data.drop(['Time', 'Amount', 'Class'], Axis =1) X = np.array(data_new.values) trian_x, train_y, train_test_split(X, y, test_size=0.1, Stratify =y, random_state=33) # CLF = LogisticRegression(n_jobs=-1) clf.fit(trian_x, Predict_y = clf.predict(test_x) # score_y = clF. decision_function(test_x) # calculate the confounding matrix cm = Self. Plot_confusion_matrix (cm, classes=class_names) Title =' logistic regression confounding matrix ') # Show model evaluation scores self.show_metrics(CM) # Calculate accuracy, recall, threshold for visualization precision, recall, thresholds = precision_recall_curve(test_y, score_y) self.plot_precision_recall(recall, precision) if __name__ == '__main__': Data_ori = pd.read_csv(r 'c :\My_data\Study\ data analysis \credit_fraud\creditcard.csv') print(data_ori.describe()) credit = CreditFraud() credit.logic_regress(data_ori)Copy the code

Recently, many friends have sent messages to ask about learning Python. For easy communication, click on blue to join yourselfDiscussion solution resource base