Naive Bayesian Algorithm is a classification method based on probability and statistics, which mainly uses Bayes’ theorem and the assumption of characteristic condition independence. Naive Bayes has a long history, its algorithm logic is relatively simple, has robust performance, usually can be used in text classification, credit evaluation and other aspects.

Naive Bayes

1.1 Bayes’ formula

Given a sample, we use X to represent the characteristics of the sample and Y to represent the category of the sample. We can first calculate the joint probability P(X,Y) of X and Y:

From the joint probability P(X,Y), the bayesian formula can be derived:

Referred to as the prior probability of P (Y), P (Y | X) is called the a posteriori probability.

For binary classification problems, if the P (Y = 0 | X) > P (Y | X = 1), a judgment sample X class label as Y = 0. Because for any X, P (X) is fixed, we only need to compare the P (X | Y = 0) P (Y = 0) and P (X | Y = 1) P (Y = 1) that can know the size of the X belongs to which a class.

P(Y = 0) and P(Y = 1) can be calculated by the following formula, where D represents the entire data set, D0 represents all data belonging to class 0, and D1 represents all data belonging to class 1:

And P (X | Y = 0) and P (X | Y = 1) is generally not easy to calculate, simple bayesian conditional independence assumption is adopted, therefore, makes it easier to calculate P (X | Y), but also has better robustness.

1.2 Conditional independence hypothesis

Assuming that feature X has N dimensions, X = [X1, X2… XN], then Bayes’ formula can be expressed as

P (X | Y = 0) and P (X | Y = 1) can be represented as:

Naive Bayes calculates this formula using the conditional independence hypothesis, which says that the features of sample X are independent of each other.

And P (Xi = x | Y = Y) of the probability for different data and tasks, there are different ways to calculate common are:

D (Xi = x, Y = Y) said the dataset class label is equal to Y and the attribute value for x I all of the data, the data set D | D | said o data number inside. Through the above method, we can calculate the P (Y) and P (X | Y), and then used for contrast P (Y | X), the size of the sample belongs to find the class.

2. Naive Bayes optimization technique

2.1 Take logarithmic addition instead of probability multiplication

In conditional independence assumption, we transform P (X | Y) is for each feature of probability multiplication. Since the probability is less than 1, the number multiplied many times may be very small, and the operation time of multiplication is relatively long, so logarithms are usually taken and then added.

The time cost of logarithm calculation is also relatively high, so a large number of log P corresponding to probability value P are usually calculated in advance and saved, and the value of log P is directly read from the saved table in the subsequent calculation.

2.2 Conversion to feature weight addition

We have to compare the P (Y = 0 | X) and P (Y | X = 1), equivalent to compare the log P (Y = 0 | X) and the log P (Y | X = 1), If the log P (Y = 0 | X) – log P (Y | X) = 1 > 0 X belongs to the probability of Y = 0.

This can eventually be converted to:

We can save the weights of all options xi of the ith feature in the hash table for searching when using to speed up the algorithm.

2.3 Smoothing

The original P (X | Y) calculation method for:

Using the above formula to calculate the probability P (X | Y) likely probability is 0, the a posteriori probability P (Y | X) values to 0. Therefore, in practical use, it is usually smoothed.

Assuming that the ith feature Xi has k possible values, the following formula can be used for smoothing.

3. Text classification

We take spam classification as an example to understand the application of naive Bayes in text classification.

3.1 Spam

Our sample is an email, and the feature X of the sample is the content of the email, and the class Y indicates whether the email is spam. X = content, such as “the last stock earned, add wechat to get more stock information” Y = category, such as “spam”, “normal mail”

So given an email “last time stock gain, add wechat to get more stock information”, the probability of this email belonging to spam can be calculated according to The Bayesian formula.

3.2 Word segmentation and the elimination of stop words

Directly on the statistical data set P (” the last stock made, plus WeChat for more stock information “|” spam “) is not practical, because the content of every E-mail is different, the next email content into “the last stock made, plus QQ for more stock information”, was the probability of 0.

Therefore, word segmentation is usually required and stop words are removed. Stop words refer to some words that are not helpful for our classification, usually some very neutral words, such as “and” I “.

Participle: “last time”, “”,” stock “, “earned”, “”,” add “, “WeChat”, “get”, “more”, “stock”, “information” to get rid of stop words: “last time”, “stock”, “earned”, “WeChat”, “get”, “more” and “stock”, “information”

Is the final P (” the last stock made, plus WeChat for more stock information “|” spam “) can be converted to

According to the assumption of conditional independence, it finally becomes:

After the above steps, we can calculate the P (Y = “spam” | X) and P (Y = “normal” message | X) probability of final judgment X is spam mail.

4. To summarize

Naive Bayes is a relatively simple algorithm with good robustness.

Naive Bayes mainly uses Bayes formula and conditional independence hypothesis.

Naive Bayes text classifier is actually a word bag model, that is, the text is regarded as a bag containing many words, regardless of the text word order. So the sentence “I like to hear Jacky Cheung sing” and “Jacky Cheung likes to hear me sing” is no different from naive Bayes.

5. References

Statistical Learning Methods by Li Hang