Machine learning project actual trading data anomaly detection

Linear regression for machine learning
Machine learning logistic regression and Python implementation
Machine learning project actual trading data anomaly detection
Decision Tree for Machine Learning
Python implementation of Decision Tree for machine learning
PCA for Machine Learning
Feature engineering for machine learning

We now have a batch of processed transaction data from credit card users, and we need to learn a model from this data that can be used to predict whether a new transaction is suspected of credit card fraud.

First, of course, you need to import the necessary Python libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline
Copy the code

Let’s take a look at the raw data

data = pd.read_csv("creditcard.csv")
print(data.shape)
data.head() Print the first 5 lines
Copy the code

(31), 284807,Copy the code

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	.	V21	V22	V23	V24	V25	V26	V27	V28	Amount
0	0.0	1.359807	0.072781	2.536347	1.378155	0.338321	0.462388	0.239599	0.098698	0.363787	.	0.018307	0.277838	0.110474	0.066928	0.128539	0.189115	0.133558	0.021053	149.62
1	0.0	1.191857	0.266151	0.166480	0.448154	0.060018	0.082361	0.078803	0.085102	0.255425	.	0.225775	0.638672	0.101288	0.339846	0.167170	0.125895	0.008983	0.014724	2.69
2	1.0	1.358354	1.340163	1.773209	0.379780	0.503198	1.800499	0.791461	0.247676	1.514654	.	0.247998	0.771679	0.909412	0.689281	0.327642	0.139097	0.055353	0.059752	378.66
3	1.0	0.966272	0.185226	1.792993	0.863291	0.010309	1.247203	0.237609	0.377436	1.387024	.	0.108300	0.005274	0.190321	1.175575	0.647376	0.221929	0.062723	0.061458	123.50
4	2.0	1.158233	0.877737	1.548718	0.403034	0.407193	0.095921	0.592941	0.270533	0.817739	.	0.009431	0.798278	0.137458	0.141267	0.206010	0.502292	0.219422	0.215153	69.99

5 rows × 31 columns

It can be seen that there are 284,807 samples in total, and each sample has 31 features. Among them, the 28 features V1 to V28 are clean data that have been processed and encrypted. Although I don’t know what they mean, they can be used directly. The few remaining features,

Time Indicates the transaction Time

Amount indicates the total Amount of a transaction

Class is output, indicating whether there is credit card fraud in this transaction. 0 is normal and 1 is abnormal. We call the sample of Class 0 negative, and the sample of Class 1 positive

Characteristics of the scale

In addition, let’s see that the Amount feature has a significant difference in value range compared to other V1 to V28 features, so we need to scale the Amount feature

from sklearn.preprocessing import StandardScaler
# print (data [' Amount ']. Reshape (1, 1))
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(- 1.1))
data = data.drop(['Time'.'Amount'],axis=1)  Delete columns that are not in use
data.head()
Copy the code

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	.	V21	V22	V23	V24	V25	V26	V27	V28	normAmount
0	1.359807	0.072781	2.536347	1.378155	0.338321	0.462388	0.239599	0.098698	0.363787	0.090794	.	0.018307	0.277838	0.110474	0.066928	0.128539	0.189115	0.133558	0.021053	0.244964
1	1.191857	0.266151	0.166480	0.448154	0.060018	0.082361	0.078803	0.085102	0.255425	0.166974	.	0.225775	0.638672	0.101288	0.339846	0.167170	0.125895	0.008983	0.014724	0.342475
2	1.358354	1.340163	1.773209	0.379780	0.503198	1.800499	0.791461	0.247676	1.514654	0.207643	.	0.247998	0.771679	0.909412	0.689281	0.327642	0.139097	0.055353	0.059752	1.160686
3	0.966272	0.185226	1.792993	0.863291	0.010309	1.247203	0.237609	0.377436	1.387024	0.054952	.	0.108300	0.005274	0.190321	1.175575	0.647376	0.221929	0.062723	0.061458	0.140534
4	1.158233	0.877737	1.548718	0.403034	0.407193	0.095921	0.592941	0.270533	0.817739	0.753074	.	0.009431	0.798278	0.137458	0.141267	0.206010	0.502292	0.219422	0.215153	0.073403

5 rows × 30 columns

Class imbalance problem

Now, let’s see how many negative samples we have and how many positive samples we have

count_class = pd.value_counts(data['Class'],sort=True).sort_index()
print(count_class)
Copy the code

0    284315
1       492
Name: Class, dtype: int64
Copy the code

We found that there were 284,315 negative samples and only 492 positive samples, indicating a serious imbalance in the proportion of positive and negative samples, namely the quasi-imbalance problem. Let’s talk about it more specifically.

Class imbalance means that when training the classifier model, categories in the sample set are not evenly distributed. For example, in the above question, 284,807 data, ideally, the number of positive and negative samples should be approximately equal. As shown above, there are 284,315 positive samples and only 492 negative samples, which indicates serious quasi-imbalance.

Why avoid this problem? From the perspective of the training model, if the number of samples of a certain type is very small, the “information” provided by this type of lock will be very small, which will lead to poor training effect of our model.

How do you avoid that? There are two ways to balance the samples of different categories, right

Downsampling: undersampling the classes (most classes) with a large number of samples in the training set, and discarding some samples to alleviate class imbalance.
Oversampling: Oversampling the classes (minority classes) with a small number of samples in the training set to synthesize new samples to alleviate class imbalance. SMOTE is going to use a very classic oversampling algorithm

So let’s deal with this problem with downsampling

First, feature X and output variable y are separated

# Separate out feature X and output variable y
X = data.iloc[:,data.columns != 'Class']
y = data.iloc[:,data.columns == 'Class']
# print(X.head())
# print(X.shape)
# print(y.head())
# print(y.shape)
Copy the code

The so-called downsampling is to randomly select a small number of samples from the majority of classes and then combine the original minority class samples as a new training data set.

Specifically, since there are only 492 positive samples, we randomly select 492 from the negative samples (284,315) as the majority class, and then re-compose a new training data set with the previous 492 positive samples

# Number of positive samples
positive_sample_count = len(data[data.Class == 1])
print("Number of positive samples is:",positive_sample_count)

The index corresponding to the positive sample is
positive_sample_index = np.array(data[data.Class == 1].index)
print("The index corresponding to the positive sample in the dataset is (print the first 5) :",positive_sample_index[:5])

The index of the negative sample
negative_sample_index = data[data.Class == 0].index
#numpy.random. Choice (a, size=None, replace=True, P =None) generates a random sample from the given one-dimensional array
#replace Whether the sample is replaced True means that it is randomly generated each time, false means that it is randomly generated only once
random_negative_sample_index = np.random.choice(negative_sample_index, positive_sample_count,replace = False)
random_negative_sample_index = np.array(random_negative_sample_index)
print("The index corresponding to the negative sample in the dataset is (print the first 5) :",random_negative_sample_index[:5])

under_sample_index = np.concatenate([positive_sample_index,random_negative_sample_index])
under_sample_data = data.iloc[under_sample_index,:]
X_under_sample = under_sample_data.iloc[:,under_sample_data.columns != 'Class']
y_under_sample = under_sample_data.iloc[:,under_sample_data.columns == 'Class']

print(Percentage of positive samples in the new data set after sampling:,
      len(under_sample_data[under_sample_data.Class==1])/len(under_sample_data))
print('After sampling, the proportion of negative samples in the new data set:,
      len(under_sample_data[under_sample_data.Class==0])/len(under_sample_data))
print('After sampling, the number of samples of the new dataset is:',len(under_sample_data))
Copy the code

Number of positive samples: 492 Index corresponding to positive samples in the data set is [541 623 4920 6108 6329] Index corresponding to negative samples in the data set is [541 623 4920 6108 6329] [38971 9434 75592 113830 203239] After down-sampling, the proportion of positive samples in the new data set is 0.5. After down-sampling, the proportion of negative samples in the new data set is 0.5. After down-sampling, the number of samples in the new data set is 984Copy the code

Training set test set division and cross validation

The next thing we need to do is divide the current data set into training set and test set.

The training set is used to train the generated model, the test set is used to test the final model and in the training set, we also need to perform an operation called cross-validation for parameter tuning and model selection.

First of all, it needs to be emphasized that cross validation is for the training set, never touch the test set!!

The so-called K-fold cross verification is to randomly divide the training set into K parts. Then, we successively select k-1 parts for training, and the remaining 1 part for testing. Cycle k times (each combination of K-1 parts is different), and then take the final average accuracy as the accuracy of the currently trained model.

We can randomly divide training and test sets by train_test_split under the sklearn.cross_validation module in Sklearn.

X_train,X_test, y_train, y_test =train_test_split(train_data,train_target,test_size=0.4, random_state=0) X_train, y_train, y_test =train_test_split(train_data,train_target,test_size=0.4, random_state=0)

Train_data: Special collection of samples to be divided

Train_target: divided sample output

Test_size: indicates the percentage of the test set if it is a decimal, or the number of test sets if it is an integer

Random_state: Random number seed, in fact, is the number of the group of random numbers, in the case of repeated trials, to ensure that the same set of random numbers. For example, if you fill in 1 every time, all other parameters are the same and you get the same set of random numbers. But zero or no zero, it’s going to be different every time.

from sklearn.cross_validation import train_test_split

# This is to divide the initial sample data before the next sampling into training set and test set.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0)

X_train_under_sample, X_test_under_sample, y_train_under_sample, y_test_under_sample = train_test_split(X_under_sample,
                                                                                                        y_under_sample,
                                                                                                        test_size=0.3, 
                                                                                                        random_state=0)
print('Training Set sample size:',len(X_train_under_sample)) 
print('Test set Sample size:',len(X_test_under_sample)) 
Copy the code

C:\Anaconda3\lib\site-packages\ skLearn \cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. "This module will be removed in 0.20.", DeprecationWarning)Copy the code

Next, we will use logistic regression to train the model through cross-validation

Cross validation can be handled using the KFold method in the sklearn.cross_validation module. The call method is:

class sklearn.cross_validation.KFold(n, n_folds=3, shuffle=False, random_state=None)

Parameter Description:

N: Number of data sets to be split

N_folds: number of folds corresponding to cross validation

Shuffle: Indicates whether to shuffle data before partitioning.

Model training and selection

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report 

def Kfold_for_TrainModel(X_train_data, y_train_data):
    fold = KFold(len(X_train_data),5,shuffle = False)
    
    Regularize the previous C parameter
    c_params = [0.01.0.1.1.10.100]
    This block generates a DataFrame to hold the different C parameters, and the corresponding recall rate
    result_tables = pd.DataFrame(columns = ['C_parameter'.'Mean recall score'])
    result_tables['C_parameter'] = c_params
    j = 0
    for c_param in c_params:
        print('-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --')
        print('C argument: ',c_param)
        print('-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --')
        print(' ')
        
        recall_list = []
        for iteration, indices in enumerate(fold,start=1) :# Use L1 regularization
            lr = LogisticRegression(C=c_param, penalty = 'l1')
            #indices[0] keeps an index of the data used for validation at one of the k=5 training sessions
            #indices[1] keeps an index of the test data for one of the k=5 training sessions
            lr.fit(X_train_data.iloc[indices[0],:], 
                   y_train_data.iloc[indices[0],:].values.ravel())#. Ravel can reduce the output to one dimension
            # Test with the remaining data (the subscript stored in indices[1])
            y_undersample_pred = lr.predict(X_train_data.iloc[indices[1],:].values)
            
            recall = recall_score(y_train_data.iloc[indices[1],:].values,
                                  y_undersample_pred)
            recall_list.append(recall)
            print('Iteration ',iteration,"Recall rate is:",recall)
        print(' ')
        print('Average recall rate is:', np.mean(recall_list))
        print(' ')
        result_tables.loc[j,'Mean recall score'] = np.mean(recall_list)
        j = j+1
    
# print(result_tables['Mean recall score'])
    result_tables['Mean recall score'] = result_tables['Mean recall score'].astype('float64')
    best_c_param = result_tables.loc[result_tables['Mean recall score'].idxmax(), 'C_parameter']
    print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
    print('C parameter corresponding to the best model =', best_c_param)
    print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
    return best_c_param  
Copy the code

best_c_param = Kfold_for_TrainModel(X_train_under_sample, y_train_under_sample)
Copy the code

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 0.01 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate for: 0.9315068493150684 Iteration 2 recall rate: 0.9178082191780822 Iteration 3 recall rate: 1.0 Iteration 4 Recall rate: 0.972972972973 Iteration 5 The recall rate is: 0.95454545454546 The average recall rate is: 0.9553666992023157 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 0.1 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.8356164383561644 Iteration 2 recall rate for: 0.863013698630137 Iteration 3 recall rate: 0.9491525423728814 Iteration 4 recall rate: 0.9459459459459459459 Iteration 5 recall rate: Average recall rate was 0.9090909090909091:0.9005639068792076 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 1 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.8493150684931506 Iteration 2 recall rate for: 0.8904109589041096 Iteration 3 recall rate: 0.9830508474576272 Iteration 4 recall rate: 0.9459459459459459 Iteration 5 recall rate: Average recall rate was 0.9090909090909091:0.9155627459783485 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 10 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.863013698630137 Iteration 2 recall rate for: 0.863013698630137 Iteration 3 recall rate: 0.9830508474576272 Iteration 4 recall rate: 0.9459459459459459 Iteration 5 recall rate: Average recall rate was 0.8939393939393939:0.9097927169206482 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 100 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.8767123287671232 Iteration 2 recall rate for: 0.863013698630137 Iteration 3 recall rate: 0.9661016949152542 Iteration 4 recall rate: 0.9459459459459459 Iteration 5 recall rate: The average recall rate is: 0.9091426124395708 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * the best model corresponding to the parameter C = 0.01 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code

When the regularization parameter C is set to 0.01, the recall rate of the obtained model is the highest. Therefore, we choose the f1 regular model with the regular parameter of 0.01 as the final model. Then, we will use the finally trained model to predict the final test set. Let’s see what happens

Performance measurement

First, I need to say something about performance measures in the classification problem. It is the obfuscation matrix, and in the obfuscation matrix, we can get a lot of useful information about performance measures.

In machine learning, confusion matrix is a visual display tool to evaluate the quality of classification models. Where rows represent actual categories and columns represent predicted categories

So let’s explain TP,FP,TN and FN

TP(True Positive): indicates that a sample that is actually a Positive example is correctly judged as a Positive example

FP(False Positive): a False Positive example, that is, a sample that is actually a negative example is wrongly judged as a Positive example

TN (True Negtive) : True negative example, that is, a sample that is actually a negative example is correctly judged as a negative example

FN (False Negtive) : False negative example, that is, a sample that is actually positive is wrongly judged as a negative example

Here are some more commonly used performance metrics

Precision: the proportion of positive cases in the samples predicted to be positive. The formula is as follows:

Recall: The ratio of the number of correctly predicted positive samples to all positive samples. The formula is as follows:

Accuracy: the proportion of accurate prediction in all samples, the formula is:

Now, let’s draw the sparse matrix

import itertools
def plot_confusion_matrix(confusion_matrix, classes):
# print(confusion_matrix)
    #plt.imshow Heat map
    plt.figure()
    plt.imshow(confusion_matrix, interpolation='nearest',cmap=plt.cm.Blues)
    plt.title('confusion matrix')
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)
    thresh = confusion_matrix.max() / 2.
    for i, j in itertools.product(range(confusion_matrix.shape[0]), range(confusion_matrix.shape[1])):
        plt.text(j, i, confusion_matrix[i, j],
                 horizontalalignment="center",
                 color="white" if confusion_matrix[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()
    print('Precision is:',confusion_matrix[1.1]/(confusion_matrix[1.1]+confusion_matrix[0.1]))
    print('Recall rate is:',confusion_matrix[1.1]/(confusion_matrix[1.1]+confusion_matrix[1.0]))
    print('Accuracy is:',(confusion_matrix[0.0]+confusion_matrix[1.1])/(confusion_matrix[0.0]+confusion_matrix[0.1]+confusion_matrix[1.1]+confusion_matrix[1.0]))
    print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
Copy the code

lr = LogisticRegression(C = best_c_param, penalty = 'l1')
lr.fit(X_train_under_sample, y_train_under_sample.values.ravel())

Get test results for the test set
y_undersample_pred = lr.predict(X_test_under_sample.values)
# Construct sparse matrix
conf_matrix = confusion_matrix(y_test_under_sample,y_undersample_pred)
# np.set_printoptions(precision=2)
class_names = [0.1]

plot_confusion_matrix(conf_matrix
                      , classes=class_names)

Copy the code

Recall rate: 0.9387755102040817 Accuracy rate: 0.918918918918919 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code

The metrics obtained above are all for the test set in the lower sample, and we need to test the test set in the entire sample data. So, let’s use the model we just trained to test the test set before the lower sample

Get test results for the test set
y_pred = lr.predict(X_test.values)
# Construct sparse matrix
conf_matrix = confusion_matrix(y_test,y_pred)
# np.set_printoptions(precision=2)
class_names = [0.1]

plot_confusion_matrix(conf_matrix
                      , classes=class_names)
Copy the code

Recall rate: 0.9251700680272109 Accuracy rate: 0.8694217197429865 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code

Although the recall rate and accuracy rate are good, the accuracy rate is too low, that is to say, although 136 out of 147 positive samples have been correctly predicted, the cost is that 10894 negative cases have been predicted as positive cases. Really would rather kill a thousand wrong, also don’t let go of a

Let’s think about it. Why does this happen? In fact, we selected too few negative samples. Out of 284,315 negative samples, we only selected 492 for the training model, resulting in poor generalization ability. This is also the disadvantage of downsampling because the sample size is smaller than the original sample set, so some information is missing. The unsampled samples often carry important information

Oversampling, SMOTE

Now, we use oversampling, SMOTE algorithm for data processing

SMOTE is the Synthetic Minority Oversampling Technique. The specific idea is: analyze the Minority sample and add the new sample to the data set artificially based on the Minority sample

The algorithm flow is as follows:

For every sample in a few classesThe Euclidean distance is used to calculate it to the sample set of a few classesThe k nearest neighbor is obtained from the distance of all samples in
Determine the sampling rate, for each minority class sample, select some samples randomly from k nearest neighbor, suppose the selected nearest neighbor is the following, we use oversampling, SMOTE algorithm for data processing

SMOTE is the Synthetic Minority Oversampling Technique. The specific idea is: analyze the Minority sample and add the new sample to the data set artificially based on the Minority sample

The algorithm flow is as follows:

For every sample in a few classesThe Euclidean distance is used to calculate it to the sample set of a few classesThe k nearest neighbor is obtained from the distance of all samples in
Determine the sampling rate, for each minority class sample, a number of samples are randomly selected from its k neighbors, assuming that the selected neighbors are
For every random neighbor, respectively, and the original sample according to the following formula to build new samples

Now, let’s look at the code

# PIP install imblearn
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
oversampler = SMOTE(random_state = 0)

X_over_samples, y_over_samples = oversampler.fit_sample(X_train, y_train.values.ravel())
Copy the code

Find out how many positive and negative samples do you have after oversampling, SMOTE

len(y_over_samples[y_over_samples == 1]), len(y_over_samples[y_over_samples == 0])
Copy the code

(199019, 199019)
Copy the code

You can see that the positive and negative samples are well balanced. Ok, now we can retrain and select the model with the new sample set

Note that the Kfold_for_TrainModel function is defined as a DataFrame
best_c_param = Kfold_for_TrainModel(pd.DataFrame(X_over_samples), 
                                    pd.DataFrame(y_over_samples))
Copy the code

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 0.01 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate for: The Iteration 2 recall rate is: 0.912 Iteration 3 recall rate is: 0.9129489124936773 Iteration 4 recall rate is: 0.8972829022573392 Iteration 5 The recall rate is: 0.8974462044795055 The average recall rate is: 0.9096498895603901 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 0.1 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.9285714285714286 Iteration 2 recall rate for: 0.92 Iteration 3 recall rate: 0.9145422357106727 Iteration 4 recall rate: 0.8986521285816574 Iteration 5 recall rate: Average recall rate was 0.8987777456756315:0.9121087077078782 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 1 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.9285714285714286 Iteration 2 recall rate for: 0.92 Iteration 3 recall rate: 0.9146686899342438 Iteration 4 recall rate: 0.8987777456756315 Iteration 5 recall rate: Average recall rate was 0.8989536096071954:0.9121942947576999 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 10 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.9285714285714286 Iteration 2 recall rate for: 0.92 Iteration 3 recall rate: 0.9146686899342438 Iteration 4 recall rate: 0.8988028690944264 Iteration 5 recall rate: Average recall rate was 0.8991294735387592:0.9122344922277715 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- C parameters as follows: 100 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Iteration 1 recall rate is: 0.9285714285714286 Iteration 2 recall rate for: 0.92 Iteration 3 recall rate: 0.9146686899342438 Iteration 4 recall rate: 0.8991169118293617 Iteration 5 recall rate: The average recall rate is: 0.9122646403303254 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * the best model corresponding to the parameter C = 100.0 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code

It can be seen that when C parameter is 100, the model recall rate is the best, so we use this model to predict the test set data

lr = LogisticRegression(C = best_c_param, penalty = 'l1')
lr.fit(X_over_samples, y_over_samples)
# lr.fit(pd.DataFrame(X_over_samples), pd.DataFrame(y_over_samples).values.ravel())
Get test results for the test set
y_pred = lr.predict(X_test.values)
# Construct sparse matrix
conf_matrix = confusion_matrix(y_test,y_pred)
# np.set_printoptions(precision=2)
class_names = [0.1]

plot_confusion_matrix(conf_matrix
                      , classes=class_names)
Copy the code

Recall rate: 0.9183673469387755 Accuracy rate: 0.9752817667918964 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Copy the code

Relative to the above undersampling, this effect is significantly better.

Welcome to pay attention to my personal public account AI computer vision workshop, this public account irregularly push machine learning, deep learning, computer vision and other related articles, welcome to learn with me, communication.

Machine learning project actual trading data anomaly detection

Characteristics of the scale

Class imbalance problem

Training set test set division and cross validation

Model training and selection

Performance measurement

Oversampling, SMOTE

Related Posts

Use TensorFlowJS to fit straight lines

Read the AI paper Transformer Structure for Fine-grained Classification — TransFG

【VRP】 A tabu search algorithm based on MATLAB to determine the initial point and endpoint of the delivery path problem