Theoretical basis

Naive Bayes is a classification model in machine learning. This section introduces the theoretical basis of naive Bayes.

The relevant knowledge

Bayes school of Thought:


Here’s an example to give you a simple idea: the probability of flipping a coin 100 times and getting 20 heads.

  • Frequency school:
  • Bayes:
    • Prior probability: This coin was taken out of the bank, and I think the probability of it coming up heads is 0.5, assuming it has been tossed 1,000 times before: 500 heads and 500 negatives.
    • Data distribution: the throwing result of this time: 20 times positive, 80 times negative.
    • Posterior probability:

What the Bayesians were criticized for by the frequency school was the so-called prior probability. Generally speaking, prior probability is our historical experience in the field of data, but this experience is often difficult to quantify or model, so Bayesians boldly assume prior distribution models, such as normal distribution, beta distribution, etc. This assumption generally has no specific basis and has been considered absurd by the frequency school.

Model features

  • Naive Bayes model capabilities:classification

    Given the characteristics of sample X, judge the label of the sample.
  • Naive Bayes learning method: supervised learning requires training samples to train the model.
  • Model types of naive Bayes:Generative model

    Features are calculated from training samplesAnd the labelThe joint distribution of. Joint distributionYou can view it as a naive Bayes model, where each probability is a parameter of the model.
  • Characteristics of Naive Bayes:simple

    Each dimension has independent influence on labels.

Mathematical basis

So let’s look at the conditional independent formula, ifandIndependent of each other,

According to the conditional probability formula:


Bayes’ formula can be derived:


Among them,Represents the prior distribution of category labels,Represents a label in a known categoryIn the case of featuresConditional probability of the value.Represents the distribution of features under all categoriesThey’re all the same.

For dichotomous problems, by comparisonand, can achieve classification.

Model is derived

Suppose we have the following m samples, each with n-dimensional characteristics:


Training process: Establish joint distribution according to training samples.

Step 1: Through the training sample, the distribution of labels in the training sample can be calculated, we can think that the distribution of category labels in the sample is the prior distribution of category labels through maximum likelihood estimation.Maximum likelihood estimation: In layman’s terms, it is to obtain the parameters that can lead to the maximum probability of the occurrence of such a situation (likelihood) given the sample distribution.Suppose all the labels areThe number of training samples wasAll labels areThe number of training samples wasA. the.

Step 2: For the samples under each label, make statistics of each one-dimensional featureThe conditional distribution is obtained. For example, for all labelsTraining samples, statistics noD characteristicsLet’s say I have two values, 0 and 1, 0 totalA,. Then conditional probability.

Step 3: According to the prior distribution of labelsAnd conditional distribution of each dimension features based on labelsAnd establish joint distribution. For example,.Note: joint distribution hereCan be regarded as the parameters of model training.

Prediction process: based on joint distributionAnd characteristicsTo determinewithThe size of the.

Step 1: According to the characteristics of each dimension of the predicted sampleAnd associated distribution, can be calculatedand.

Step 2: According to the independent influence of features on labels, we have:


Step 3: Compareand, identify the label.

Instance to explain

I’m going to give you a simple example to help you understand the process.

Suppose we were given information about 10 men and asked a woman to use this information to decide whether she would consider marrying them (the following random examples are not true).

Characteristics of the high rich handsome gentle The girl marries not to marry
Boy 1 high rich handsome gentle marry
The boys 2 high poor rub gentle Don’t marry
The boy 3 short poor handsome Not gentleness Don’t marry
The boy 4 short poor rub gentle Don’t marry
The boy of five short rich handsome Not gentleness Don’t marry
The boy 6 high poor handsome gentle marry
The boy 7 short poor handsome gentle Don’t marry
The boy of eight short rich rub gentle marry
The boy of nine high poor rub Not gentleness Don’t marry
The boy 10 high rich handsome Not gentleness marry

Now, there is a boy X (sample), he tall, rich, rub, gentle (characteristic), excuse me whether this girl will choose to marry him (classification). This is a classic classification problem. According to Bayes’ formula:? P (marry) | high, rich, rub, tender = \ frac {P (marry) P (high, rich, rub, gentle | marry)} {P (high, rich, rub, gentle)}?

Step 1: We need to consider the subjective will of girls to get married, that is, the prior distribution of labels.

Is no matter how the other side of the boy condition, this girl in the end how much want to get married. Based on these 10 samples, we know the probability that this girl wants to get marriedThe probability that this girl doesn’t want to get married.

Here is a point to clarify: we do not know the prior distribution, nor can calculate (girls mind you do not understand). But the reality is that out of 10 boys, girls want to marry four. So we can assume that there is some prior distribution that maximizes the probability of this realistic outcome. So you have a prior distribution.

Let’s imagine an extreme example, if out of 10 boys, all girls choose not to marry, we can infer that this girl may be a celibate at the current stage, and will not consider marrying anyone at all (even if the sample does not represent anyone, such as Principal Wang), i.e. Even if the sample is Principal Wang, naive Bayes would say that girls choose not to marry.

The second step: we know the subjective will of the girl to marry, we also need to know, if the girl chooses to marry the boy X, his high, rich, rub to marry the result of how much influence, that is, conditional probability.

And then we use naive Bayes’ simplicity. Naive Bayes believes that the influence of height, wealth and rubbing (characteristics) on the result of marriage (label) is independent of each other, i.e.

Here’s a caveat: Why assume that characteristics have independent effects on tags? Because if it’s not independent, the feature space is one three-dimensional space, and the features are 2 times 2 times 2 is 8, and if it’s independent, the feature space is three one-dimensional Spaces, 2+2+2 is 6. A multiplication, a summation. In real life, there are often a lot of features, and the value of each dimension of features is also very large, and “simplicity” is to greatly reduce the calculation intensity at the cost of sacrificing accuracy.

We chose to rearrange the sample to look only at the four men who chose to marry:

Characteristics of the high rich handsome gentle The girl marries not to marry
Boy 1 high rich handsome gentle marry
The boy 6 high poor handsome gentle marry
The boy of eight short rich rub gentle marry
The boy 10 high rich handsome Not gentleness marry

  In all the boys who choose to marry, tall boys accounted for several. You can get. And you get the same thing...

For the same sample,It’s the same thing, so you just have to compare bayes’ moleculesandCan. We can calculate:



We can get:

Naive Bayes model can tell the girl, for the boy X: marry!

Advanced optimization

Since features are assumed to be independent of each other in the naive Bayes model, when calculating joint probability, it is possible to use the multiplication of probabilities in the feature space of each dimension. The fatal drawback of multiplicative multiplication is that once one of the terms is zero, the whole term is zero.

When there are a lot of features, it’s very easy to get sparse,, resulting inLead directly to the final result. Clearly not reasonable enough. To avoid this, a smoothing strategy can be adopted.

The most common smoothing strategy is Laplacian smoothing, that is, in the statistical calculation of probability, by default, each value of a certain type of sample appears at least once in each dimensional space. If the number of training sample sets is sufficiently large, the results will not be affected, and the above embarrassing situation of 0 probability is solved.

The advantages and disadvantages

advantages

  • Simple idea.
  • The space complexity is low. (Joint probabilityIt only needs two dimensions.)
  • Low time complexity. (You can go from federated distributionDirect calculation in)

disadvantages

  • It is assumed that the influence of each dimension on the result is independent of each other, which is not true in reality. (Being rich, for example, has an effect on being handsome, making it easier to look handsome.) The performance of naive Bayes is not good when the influence between features is large.

Project implementation

NB_classify is the naive Bayes model in the code, and the specific code is as follows.

code

# -*- coding: utf-8 -*-
""" Created on Thu Aug 23 19:52:02 2018 @author: huangzhaolong """
# Utilize sklearn's own iris data set
from sklearn.datasets import load_iris
import random

def load_data():
    # Load iris data set
    iris = load_iris()
    X = iris.data
    Y = iris.target
    # Feature discretizationX = x.ound () data_num = x.shape [0] index_List = [random.random() >= 0.2for _ in range(data_num)] 
    train_index_list = []
    test_index_list = []
    for i in range(len(index_list)):
        if index_list[i]:
            train_index_list.append(i)
        else: test_index_list.append(i) X_train = X[train_index_list] X_test = X[test_index_list] Y_train = Y[train_index_list] Y_test  = Y[test_index_list]return X_train, X_test, Y_train, Y_test

class NB_classify(object):
    
    def NB_train(self, X_train, Y_train):
        ' ''in the case of known X and Y, construction joint distribution P (X, Y), the parameter calculation, the calculation of statistics According to the bayesian formula, calculate P (X, Y) = P (X | Y) * P (Y) '' '
        p_xy = []
        # For each category
        p_y = {}
        for k in set(Y_train):
            total_xi_y = list(Y_train).count(k)
            print("For a category of",k,"Sample common",total_xi_y,"个")
            # to calculate P (Y)
            p_y.setdefault(k,list(Y_train).count(k)/Y_train.shape[0])
            p_x_y_sub = []
            # on the characteristics of each dimension, calculate P (x | Y)
            for i in range(X_train.shape[1]):
                # for each dimension feature, statistical sample number under the current category, P (x | Y)
                p_x_sub_y_sub = {}
                print("For a category of",k,Sample of ", no.,i,"The discrete characteristics of the dimensions are common.",len(set([(round(item)) for item in X_train[[l for l in range(len(Y_train)) if Y_train[l] == k],i]])),"Kind")
                # for each dimension of feature values, to calculate p (x = xi | Y)
                for j in set(X_train[[l for l in range(len(Y_train)) if Y_train[l] == k],i]):
                    # Count the proportion of each feature to the sample number of each category, and perform Bayesian smoothing
                    print("For a category of",k,Sample of ", no.,i,"The dimensional characteristic is",j,"Sample common",list(X_train[[l for l in range(len(Y_train)) if Y_train[l] == k],i]).count(j), "个")
                    # to calculate p (y) * p (x | y)
                    p_x_sub_y_sub.setdefault(j,p_y[k]*(list(X_train[[l for l in range(len(Y_train)) if Y_train[l] == k],i]).count(j)+1)/(total_xi_y+len(set(X_train[[l for l in range(len(Y_train)) if Y_train[l] == k],i]))))
                p_x_y_sub.append(p_x_sub_y_sub)
            ' ''get the model parameters of P (X | Y), namely the matrix M (X, Y), including M M n columns, rows M as the number of categories, n for the characteristic dimension. Each element is a dictionary, the key of the dictionary is the value of the feature of the dimension, the value of the dictionary is the probability for example M_ij of matrix M, M_ij is a dictionary. M_ij key is the first j d characteristics of all values, such as k_j M_ij value is the first j d characteristic values for k_j, category for I probability P (xj = k_j | I), which is under the samples of all categories for I, the first for the probability of k_j v_ij j d features. V_ij = number of samples with JTH dimension characteristic k_j/number of samples with category I '' '
            p_xy.append(p_x_y_sub) 
        print("Prior probability of P(Y) : \n",p_y)
        print("Joint probability of P(X,Y) : \n",p_xy)
        self.p_xy = p_xy
        
    def NB_test(self, X_test):
        p_xy = self.p_xy
        pred = []
        # for each sample
        for z in range(len(X_test)): 
            # For each category
            pj = [1]*3
            for j in range(3):
                # for each characteristic, through the joint probability distribution of each feature under the category of the P (yj | xi)
                for i in range(len(X_test[z])):
                    if X_test[z][i] in p_xy[j][i]:
                        p_xi_yj = p_xy[j][i][X_test[z][i]]
                    else:
                        # If this feature does not appear in the training set, it indicates that the constructed P(X,Y) joint distribution is not accurate, and the generative model requires sufficient training samples to be large and completeP_xi_yj = 0.0000001# all computing feature to each category of probability P (yj | X), here is the thought of categories are independent of each characteristic, can LianCheng
                    pj[j] = pj[j]*p_xi_yj
            # according to the highest probability calculation the most likely category, namely argmax P (yj | X)
            pred.append(pj.index(max(pj)))
        
        return pred
        
def cal_accuracy(pred, Y_test):
    acc = list(pred-Y_test).count(0)/len(pred-Y_test)
        
    return acc
    
if __name__=="__main__":
    acc = 0
    
    for i in range(1):
        X_train, X_test, Y_train, Y_test = load_data()
        
        nb = NB_classify()
        nb.NB_train(X_train, Y_train)
        pred = nb.NB_test(X_test)
        acc += cal_accuracy(pred,Y_test)
        
    acc = acc/1
    
    print("acc:",acc)
        
Copy the code

The results of

There are 42 samples of category 0 and there are 3 discrete features in dimension 0 and there are 4 discrete features in dimension 4.0 and there are 42 samples of category 0, There are 34 samples of 0 dimension 5.0 for samples of category 0, 4 samples of 0 dimension 6.0 for samples of category 0, 2 discrete features of the first dimension for samples of category 0, There are 25 samples of 3.0 in the first dimension and 17 samples of 4.0 in the first dimension and there are 2 discrete features in the second dimension and for samples of 0 in the category, There are 17 samples of 1.0 in dimension 2 for samples of class 0, 25 samples of 2.0 in dimension 2 for samples of class 0, 2 discrete features in dimension 3 for samples of class 0, There are 41 of the samples of the third dimension 0.0 for the sample of category 0, 1 of the third dimension 1.0 for the sample of category 1 there are 40 of the samples of category 1, there are 3 of the discrete features of the zeroth dimension for the sample of category 1, There are 5 samples of 0 dimension 5.0 for the sample of category 1, 27 samples of 0 dimension 6.0 for the sample of category 1, 8 samples of 0 dimension 7.0 for the sample of category 1, There are 2 discrete features in the first dimension for the sample of category 1, there are 12 in the first dimension of feature 2.0 for the sample of category 1, there are 28 in the first dimension of feature 3.0 for the sample of category 1, There are 3 discrete features in the second dimension there are 3 discrete features in the second dimension there are 3 discrete features in the second dimension there are 24 discrete features in the second dimension there are 24 discrete features in the second dimension there are 3 discrete features in the second dimension there are 24 discrete features in the second dimension There are 13 samples of 5.0 in dimension 2 for samples of category 1, 2 discrete features in dimension 3 for samples of category 1, there are 29 samples of 1.0 in dimension 3 for samples of category 1, There are 11 samples of 2.0 in dimension 3 and there are 41 samples of category 2 and there are 4 discrete features in dimension 0 and there are 5 samples of 8.0 in dimension 0 and for example, So we have one sample of 0 dimension 5.0 for the sample of category 2, 23 samples of 0 dimension 6.0 for the sample of category 2, 12 samples of 0 dimension 7.0 for the sample of category 2, There are 3 discrete features in the first dimension for the sample of category 2, there are 4 discrete features in the first dimension 2.0 for the sample of category 2, there are 34 discrete features in the first dimension 3.0 for the sample of category 2, There are 3 samples of 4.0 in the first dimension and there are 4 discrete features in the second dimension and there are 1 samples of 4.0 in the second dimension and there are 3 samples of 4.0 in the second dimension There are 17 samples of 5.0 in the second dimension and 20 samples of 6.0 in the second dimension and 3 samples of 7.0 in the second dimension and for samples of 2, There are 2 discrete features in the third dimension. For the sample of category 2, there is 1 in the sample of dimension 1.0. For the sample of category 2, there are 40 prior probabilities of P(Y) in the sample of dimension 2.0: {0: 0.341463414637, 1: 0.3252032520325203, 2: 0.3333333333333333333333} P(X,Y) joint probability: [[{4.0: 0.03794037940379404, 5.0: 0.2655826558265583, 6.0:0.03794037940379404}, {3.0:0.2017738359201774, 4.0:0.13968957871396895}, {1.0: 0.13968957871396895, 2.0:0.2017738359201774}, {0.0:0.3259423503325942, 1.0:0.015521064301552107}], [{5.0: 0.04537719795802609, 6.0:0.21176025713745508, 7.0:0.06806579693703914}, {2.0:0.10065814943863724, 3.0: 0.22454510259388305}, {3.0:0.030251465305350726, 4.0:0.189071658158442, 5.0:0.10588012856872754}, {1.0: 0.23228803716608595, 2.0:0.09291521486643438}], [{8.0:0.04444444444444444444444446, 5.0:0.01481481481481481481414, 6.0: 0.1777777777778, 7.0:0.0962962962962962963}, {2.0: 0.03787878787878787, 3.0:0.26515151515151514, 4.0: 0.0303030303030303}, {4.0:0.014814814814814814814814, 5.0:0.1333333333333, 6.0:0.155555555555556, 7.0: 0.029629629629627}, {1.0:0.015503875968992248, 2.0:0.3178294573643411}]] ACC: 0.9629629629629629629Copy the code