“This is the 27th day of my participation in the Gwen Challenge.

1. Divide training set and test set

The data is divided into training set and test set by the following code. The training set is used to train the model, and the test set is used to evaluate the prediction effect of the model.

from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = Train_test_split (X, y, test_size = 0.2, random_state = 1)Copy the code

X_train and y_train are characteristic variable and target variable data in the training set, X_test and y_test are characteristic variable and target variable data in the test set. The 569 sets of data in this case are not very much, so the test_size is set to 0.2, that is, to divide the training set and the test set according to the ratio of 8:2, then the training set has 455 sets of data, and the test set has 114 sets of data.

The train_test_split() function randomly splits the data each time you run the program, and if you want a consistent result each time you split the data, you can set the random_state parameter to 1 or some other number.

2. Model structures,

The naive Bayes model is relatively easy to build, and the code is as follows.

Naive_bayes import GaussianNB nb_clf = GaussianNB() # nb_clf. Fit (X_train,y_train)Copy the code

At this point, a naive Bayes model is built, which can then be used to make predictions. At this point, the test set divided before can come into play. The test set can be used to predict and evaluate the prediction effect of the model.

3. Model prediction and evaluation

First, the data in the test set were imported into the model for prediction. The code is as follows, where NB_CLF is the naive Bayesian model established above.

y_pred = nb_clf.predict(X_test)
Copy the code

View the first 100 items of the predicted results by printing y_pred[:100], as shown in the figure below, 0 indicates the predicted malignant tumor and 1 indicates the predicted benign tumor.

Using the DataFrame creation knowledge, summarize the predicted result y_pred and the actual value y_test in the test set as follows.

A = pd.dataframe () # create a null DataFrame a[' predicted value ']=list(y_pred) a[' actual value ']=list(y_test)Copy the code

The first five rows of the generated comparison table are shown in the table below.

As you can see, the prediction accuracy of the first five items is 80%. You can view the prediction accuracy of all test set data by using the following code.

from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred,y_test)
Copy the code

The first line introduces the accuracy_score() function that calculates the accuracy of the prediction; Line 2 calculates the accuracy of the prediction by passing the predicted value y_pred and the actual value y_test into the accuracy_score() function as parameters. The printed score value was 0.947, that is, the prediction accuracy was 94.7%, indicating that among 114 groups of test set data, about 108 groups of data were correctly predicted and 6 groups of data were incorrectly predicted.

Naive Bayes model belongs to classification model, so ROC curve can also be used to evaluate its prediction effect, and its evaluation method is the same as that of logistic regression model and decision tree model.

In conclusion, naive Bayes model is a very classical machine learning model, which is mainly based on Bayes formula. In the application process, features in data sets are regarded as independent of each other, without considering the association between features, so the operation speed is fast. Compared with other classical machine learning models, the generalization ability of naive Bayes model is weak, but its prediction effect is good when the number of samples and features increases.