We have repeatedly said that machine learning models are derived from the knowledge of statistics, and naive Bayes classification algorithm should be one of the algorithms with the strongest taste of statistics.

The basic idea of Bayes’ formula

Naive Bayes refers to the application of Bayes’ formula under the naive assumption.

The core idea of Bayesian prediction is five subcategories – “it looks more like”.

In Bayes’ opinion, the world is not static and absolute, but dynamic and relative, hoping to make use of known experience to make judgment. Use “experience” to “judge”, how does experience come? How to judge with experience? One sentence actually contains two rounds of the process:

  • The first round of classification: categories are known and statistical features, i.e. the probability of occurrence of a feature in the class, are the process of classifying categories into feature probabilities.
  • The second round of reduction: it is the process of the reduction of the features known to the statistical situation into a certain category by using the results of the first round.

Examples using Bayesian prediction

How are prior and posterior used for prediction? Here I want to show you the “technique” I learned in high school to guess a girl by her hair style. Suppose there are 10 female students in my class, one of whom is Called Angeli. In high school, girls are almost the same height and wear the same uniform. The probability of guessing angeli by looking at the back is 10%. But one day I suddenly found that Angeli students particularly like to tie a ponytail, but the ponytail is not an exclusive invention patent, and girls at this age are all love to tie a ponytail, so not all female students wearing a ponytail are called Angeli. What to do? I also use the class time statistics, the class of female students have three kinds of hair style, the probability of ponytail is about 30%, marked as P(ponytail). City center and students really like firm horsetail, her: probability as high as 70%, denoted by P (horsetail center |). So we use the conditional probability that we introduced earlier, P means the probability of having a ponytail if the female student is an Anglia, which is also called Likelihood in Bayes’ formula. With these three statistics, MY heart will have the bottom, in the future to see the female students with ponytail, there is more than 20% of the probability is our Angeli.

The secret is Bayes’ formula. You may have noticed that in the female classmate: is the probability of geely also is a kind of conditional probability, denoted by P (Angelina ∣ horsetail) P (Angelina | horsetail) P (Angelina ∣ horsetail), that is the a posteriori probability. According to Bayes’ formula, we have:


P ( horsetail ) P ( Angelina horsetail ) = P ( Angelina ) P ( horsetail Angelina ) P (horse) \ cdot P (Angelina | horsetail) = P (Angelina) \ cdot P center (|)

Substituting the above data, it can be calculated:


P ( Angelina horsetail ) = 10 % x 70 % / 30 % = 23.3 % P (Angelina | horsetail) = 10 \ % \ \ % / 30 times 70 \ \ % % = 23.3

We said that there are two stages of making choices using a prior and a posterior, and now we have likelihood to explain that. The prior probability is already known, and what we need to know empirically or experimentally is this likelihood, knowing the likelihood plus the prior, we know the posterior probability.

Algorithm principle of naive Bayes classification

For classification problems, the data record for each sample may be in the following form:


[ Characteristics of the X 1 , the characteristics of X 2 , the characteristics of X 3 . . . . The category C 1 ] [Feature X_1, Feature X_2, Feature X_3… Category C_1]

[ Characteristics of the X 1 , the characteristics of X 2 , the characteristics of X 3 . . . . The category C 2 ] [Characteristic X_1, characteristic X_2, characteristic X_3… Category C_2]

What we really want to solve is the posterior probability of category C1C_1C1:


P ( category C 1 Characteristics of the X 1 , the characteristics of X 2 , the characteristics of X 3 . . . . ) P (category C_1 | features X_1, X_2, X_3,…).

This formula means the probability of the occurrence of category C1 under the condition that feature X1, feature X2, feature X3 and so on all occur together. We already know that we can get a posteriori probability from likelihood. The likelihood of a feature is expressed as follows:


P ( Characteristics of the X category C 1 , the characteristics of X 2 , the characteristics of X 3 . . . . ) P (X | category C_1, characteristics, X_2, X_3,…).

Mathematical analysis

First, a review of Bayes’ formula:


P ( y x ) = P ( x ) P ( x y ) P ( x ) P( y | x ) = \frac { P(x) P(x|y)}{P(x)}

There are two forms, P(A), P(A), P(A), which means the probability of A occurring, and P(AB), P(AB), P(AB), which is called conditional probability, is the probability of A occurring if B occurs. In the Bayesian formula, P(y)P(y)P(y) is called prior probability, P(x)P(x)P(x) P(x) is called posterior probability, and P(xy)P(xy)P(xy) P(xy) is called likelihood, which is also the object that naive Bayes classification algorithm needs to “learn”.

Naive Bayes makes a “naive” assumption that features are independent of each other and do not affect each other. In this way, the likelihood of a characteristic can be simplified as:


P ( x i y . x 1 . . . . . x i 1 . x i + 1 . . . . . x n ) = P ( x i y ) P(x_i | y, x_1, … , x_{i-1}, x_{i+1}, … , x_n) = P(x_i|y)

It can be seen that under the assumption of “simplicity”, it is much easier to obtain the likelihood of a feature. Of course, finding a posterior probability is also easy:


P ( y x 1 . . . . . x n ) P ( y ) i = 1 n P ( x i y ) P(y|x_1, … , x_n) \propto P(y) \prod_{i=1}^n P(x_i|y)

Notice the use of the symbol “proportional to”. Naive Bayes algorithm uses posterior probability to predict, the core method is to predict posterior probability through likelihood, and the learning process is the process of constantly improving likelihood. Since the posteriori probability is proportional to the likelihood, it is natural to improve the posteriori probability while increasing the likelihood. This is also an example of simplifying for convenience and utility.

Naive Bayes optimization method

So much for naive Bayes’ mathematical background and you might wonder: why don’t you see the old acquaintance hypothesis function and the loss function? This is related to the learning process of naive Bayes algorithm. Instead of learning anything, the Naive Bayes algorithm seeks a statistical result, compares the likelihood relationship between different features and classes, and finally takes the class with the maximum likelihood as the prediction result, namely:


y ^ = a r g   m a x P ( y ) i = 1 n P ( x i y ) \hat{y} = arg\space max{P(y) \prod_{i=1}^n P(x_i | y)}

We know that the y represent classes, each class and feature of likelihood degree, namely, P (xily) P (x_ily) P (xily) is different in the expression of P (y) is the prior probability is a fixed value, the expression to make the overall result is the largest, can only rely on P (xi ∣ y), P (x_i | y), P (xi ∣ y). Might as well see statistics as a major form, learning algorithm’s job is to find P (xi ∣ y), P (x_i | y), P (xi ∣ y) that a maximum value. The final prediction output is whatever y is.

It can be seen that the naive Bayes algorithm is actually a table lookup process, rather than the previous iterative approximation, so the hypothesis function and loss function that drive the iterative approximation process are no longer needed.

The concrete steps of naive Bayes classification algorithm

Naive Bayes classification algorithm is a supervised classification algorithm. The input is also the sample eigenvalue vector and the corresponding class label, and the output is a model with classification function, which can predict the classification results according to the input eigenvalue.

According to the Naive Bayes classification algorithm, there are three specific steps:

  1. Statistical sample data. Need statistical prior probability P (y), P (y), P (y) and the likelihood of P (x ∣ y), P (x | y), P (x ∣ y)
  2. According to the features contained in the samples to be predicted, a posterior probability calculation is performed for different classes. Such as the general characteristics of A, B, C three, but the sample under test contains both A and C, only the y1y_1y1 posterior probability calculation method for P (y1) P (A ∣ y1) P (B ∣ y1) P (y_1) P (A | y_1) P (B | y_1) P (y1) P (A ∣ y1) P (B ∣ y1).
  3. Comparison of y1, y2,… , yny_1, y_2,… , y_ny1, y2,… The posterior probability of yn, whichever has the highest probability value is output as the predicted value.

In the Scikit-learning-library, libraries of algorithms based on bayesian models are included in the Sklear.naive_Bayes package. Naive Bayes can also be divided into several specific algorithm branches according to different likelihood calculation methods.

The naive Bayes classification algorithm introduced in this chapter can be invoked using the MultinomialNB class as follows:

from sklearn.datasets import load_iris
Import polynomial naive Bayesian classification algorithms from the Scikit-learning-library
from sklearn.naive_bayes import MultinomialNBultinomialNB

# Load iris data set
X, y = load_iris(return_X_y=True)
# Training model
clf= MultinomialNB().fit(X, y)
# Use models for classification prediction
clf.predict(X)
Copy the code

The following output is displayed: