1. Bayesian classifier concept

Bayesian classifier is one of the classifiers with the least probability of classification error or the least average risk in the case of given cost in advance. Its design method is one of the most basic statistical classification methods. The classification principle is to calculate the posteriori probability of an object by using the bayesian formula, that is, the probability that the object belongs to a certain category, and select the class with the maximum posteriori probability as the class to which the object belongs.

Advantages of Bayesian classifier: simple and easy to understand, high learning efficiency; The temporal and spatial cost of classification is small

Disadvantages of Bayesian classifier: The algorithm is based on the assumption of independence between independent variables (conditional feature independence) and normality of continuous scalar, which will affect the accuracy of the algorithm to some extent

2. Related concepts

This section mainly introduces some correlation probabilities used in Bayesian classifier for better understanding.

2.1 Prior probability

Definition: probability based on past experience and analysis (probability based on past experience when we know nothing about the characteristics of the data).

For example, if we buy a watermelon and we know nothing about its color, texture, root and other features, the general rule is that there is a 60% chance that it will be a good melon. And from that, the probability P (good melon) is called the prior probability.

2.2 Posterior probability

Definition: The probability (similar to conditional probability) that an event will happen because of some factor, given that it has already happened.

For example, let’s say we already know that texture is a good indicator of watermelon quality. Seventy-five percent of watermelons are well textured. If we use watermelon texture to speculate that watermelon is good or bad, then P (good melon | texture clear) is called the a posteriori probability.

2.3 Joint probability

Definition: a set of two-dimensional discrete random variables (X, Y) (X, Y) (X, Y) all possible values for (Xi, Yj), I, j = 1, 2,… (Xi, Yj), I, j = 1, 2,… (Xi, Yj), I, j = 1, 2,… And P (X = xi, Y = yi) = PI, j, I, j = 1, 2,… , nP (X = xi, Y = yi) = PI, j, I, j = 1, 2,… , nP (X = xi, Y = yi) = PI, j, I, j = 1, 2,… N, pijpijPIj is called the joint probability of random variables X and Y.

Calculation is as follows: P {X = I, Y = j} = P {Y = j | X = I} P {X = I}, I = 1, 2, 3,… ,j<=i

For example, in the case of buying watermelon above, P(good melon, clear texture) is a joint distribution, which means the probability that the melon is clear and good. His joint probability should satisfy the following formula.


P ( Good melon, clear texture ) = P ( Texture clear Good melon ) P ( Good melon ) P (good melon, texture clear) = P (good texture clear | melon), P (good melon)

2.4 Total probability formula

If events A1, A2… Constitute a complete event group with positive probabilities, then for any event B, the following formula holds: P(B)=P(BA1)+P(BA2)+… +P(BAn)=P(B|A1)P(A1) + P(B|A2)P(A2) + … + P (B | An) P (An), the formula of full probability formula. (when a direct calculation of P (B) is difficult, and P (Ai), P (B | Ai), I = 1, 2,… When the calculation of P(B) is relatively simple, the total probability formula can be used to calculate P(B).

Example: In the example above of the joint probability concept for buying a watermelon, when we calculate the joint probability of P (good melon, clear texture), we need to know the probability of P (clear texture). So how do you calculate the probability of a clear texture? In fact, it can be divided into two cases: one is the probability of clear texture in the state of good melon, the other is the probability of clear texture in the state of bad melon. The probability of clear texture is the sum of these two cases. Therefore, we can derive the total probability formula:


P ( Texture clear ) = P ( Texture clear Good melon ) P ( Good melon ) + p ( Texture clear Bad melon ) P ( Bad melon ) P (texture clear) = P (good texture clear | melon), P (good melon) + P (texture clear | bad melon), P (bad melon)

2.5 Bayes’ theorem

Bayes’ theorem is A theorem about the conditional (or edge) probability of random events A and B. P (A | B) is in B occurs under the condition of A likelihood of occurrence, its formula is as follows.

For each feature X, the samples we want to know this feature again under X belongs to which category, namely for posterior probability P (c | X) the largest class tag. In this way, based on Bayes formula, we can conclude:

Let’s use watermelon as an example to sort things out:

Watermelon can be divided into two states: good melon and bad melon, the probability is 0.6 and 0.4 respectively, and the probability of good melon with clear texture is 0.8, the probability of bad melon with clear texture is 0.4. So, now THAT I’ve picked a clear melon, what’s the probability that it’s a good melon?

Obviously, this is a posterior probability problem, and we can give the formula directly:

Analyze the probabilities in the formula one by one:

A posteriori probability, P (good texture clear | melon) = 0.8

Prior probability: P (good melon) =0.6

Posterior probability: P (texture clear | bad melon) = 0.4

Prior probability: P (bad melon) =0.4

Based on the value analyzed above, we can directly solve the result of the above equation as 0.75

2. Naive Bayesian classifier

Is not hard to find, based on the bayesian formula to estimate the a posteriori probability P (c | x) is the main difficulties: type of conditional probability P (x | c) is so attributes on the joint probability (i.e., x represents the multiple attributes), is difficult to estimate directly from limited training samples. To get around this obstacle, naive Bayes classifier adopts the “attribute conditional independence hypothesis” : for known categories, all attributes are assumed to be independent of each other. In other words, it is assumed that each attribute independently affects the classification result.

Naive Bayes algorithm steps

  1. Let some sample attribute set x={x1,x2… ,xn} where n is the number of attributes and xi is the value of x on the ith attribute.
  2. Divide this sample into some category in the set c of categories, c ={y1,y2… ,ym}.
  3. Calculate the posterior probability:

Among them
p ( x y i ) = P ( x 1 y i ) p ( x 2 y i ) . . . p ( x n y i ) p(x|yi)=P(x1|yi)p(x2|yi)… p(xn|yi)
. (Each feature is independent)

Calculate p (xi ∣ yi) p (xi | yi) p (xi ∣ yi) : first find a known categories classified collection, statistical characteristic properties in this collection under each category of conditional probability, namely get what we want to calculate p (xi ∣ yi) p (xi | yi) p (xi ∣ yi).

Note: The denominator in the above formula is the same for all categories. So can be omitted, for different yiyiyi, only need to compare the P (yi ∣ x) P (yi | x) P (yi ∣ x) molecules.

  1. If P (yi ∣ x) = maxP (y1 ∣ x), P (y2 ∣ x),… , P (ym ∣ x) P (yi | x) = Max {P (y1 | x), P (y2 | x),… P, P (ym | x)} (yi ∣ x) = maxP (y1 ∣ x), P (y2 ∣ x),… P of YM given x, the sample belongs to yiyiyi in attribute set x.

3. Examples of Naive Bayes classification

Let’s continue with the example of buying a watermelon. Now, we have a data set of 10 samples, and the data set is a good melon or a bad melon based on texture, color, and tapping. The data set is as follows:Among them, the texture is divided into: clear and fuzzy, color is divided into: green and black, knock is divided into: turquoise, dull and crisp. Different combinations of eigenvalues correspond to two categories: good or bad.

Now, I pick a watermelon from the supermarket that has a clear texture, a green color and a dull tap. We can calculate whether the watermelon is a good melon or a bad melon based on the sample data set and naive Bayesian algorithm.

(1) First of all, calculate the situation of melon:

Prior probability: P (good melon) =6/10=0.6

: conditional probability P (good texture clear | melon) = 4/6 = 2/3

Good: conditional probability P (color green | melon) = 4/6 = 2/3

Conditional probability P (knock sound depressing | good melon) = 2/6 = 1/3

Calculate the posterior probability P (good melon | texture clear, color green, knock sound depressing) molecular parts:

P (good melon) x P (good texture clear | melon) x P (color green | good melon) x P (knock sound depressing | good melon) = 0.6 * (2/3) * (2/3) (1/3) = 4/45.

(2) Then, calculate the case of bad melon:

Prior probability: P (bad melon) =4/10=0.4

: conditional probability P (texture clear | bad melon) = 1/4 = 0.25

: conditional probability P (color green | bad melon) = 1/4 = 0.25

Conditional probability P (knock sound depressing | bad melon) = 1/4 = 0.25

Calculate the posterior probability P (bad melon | texture clear, color green, knock sound depressing) molecular parts:

P (bad melon) * P (texture clear | bad melon) * P (color green | bad melon) * P (knock sound depressing | bad melon) = 0.4 * 0.25 * 0.25 * 0.25 = 1/160.

(3) Posterior probability of good melon and bad melon categories:

P (good melon | texture clear, color green, knock sound depressing) > P (bad melon | texture clear, color green, knock sound depressing), namely 4/45 > 1/160, so the texture clear, color green, knock sound depressing watermelon for good melon.

4. Points about naive Bayes that are easily ignored

  1. Calculated by the above see, each division of the conditional probability P (x ∣ yi) P (x | yi) P (x ∣ yi) is the key step, naive bayesian classification when attributes for discrete values, Can very convenient statistical training samples of each division in each category in frequency can be used to estimate the P (x ∣ yi) P (x | yi) P (x ∣ yi) when the characteristic properties for continuous values, generally assumed that its value obey gaussian distribution (also called a normal distribution). Therefore, as long as the means and standard deviations of this feature item in each category of the training sample are calculated, the required estimates can be obtained by substituing them into the above formula.
  2. Another need to discuss the problem is that when P (x ∣ yi) = 0, P (x | yi) = 0, P (x ∣ yi) = 0 to do, as a category a feature partition does not appear, is to produce this kind of phenomenon, which quality of classifier is reduced greatly. In order to solve this problem, we introduce Laplace calibration. The idea of Laplace calibration is very simple, which is to add 1 to the count of all partitions under each category. In this way, if the number of training sample sets is large enough, the results will not be affected, and the embarrassing situation of 0 frequency is solved.