Machine Learning bayesian classification (Python implementation)

Naive Bayesian is the most widely used classification method. It is based on probability theory and bayes' theorem and the assumption of feature condition independence.

The principle of

Naive Bayesian is a classification method based on Bayes’ theorem and the principle of characteristic condition independence assumption. The probability of classification is calculated by the given features, and the situation with high probability is selected for classification. It is also a machine learning classification method based on probability theory. Classification objectives are determined and belong to supervised learning.

Probability measures the likelihood of an event happening. Probability theory and statistics are exactly opposite concepts. Statistics estimates the occurrence of a single event or part of an event by taking a part of a sample for statistics. Therefore, probability theory requires known data to predict unknown events. For example, if we see a weather pattern (F) with overcast clouds, thunder and lightning, and gusts of wind, we infer that the probability of rain is greater than the probability of no rain, that is, p(rain)> P (no rain), and therefore that rain will come later. This is an empirical judgment of probability. And the weather bureau through years of long-term accumulation of data, after calculation, today's probability of rain P (rain)=85%, P (not rain)=15%, the same, P (rain) > P (not rain), so today's weather forecast certainly forecast rain. This is to judge the rain event by calculating the probability in a certain way.Copy the code

Why it’s called Naive Bayes: Simple and easy to operate, based on the assumption of feature independence, that is, features are independent of each other and do not affect each other.

Conditional probability

The probability of another event occurring when one event has already occurred. Computation formula is as follows: P (A | B) = P (A studying B)/P (B) is simple to understand: draw A Venn diagram, two circles intersecting part is A, B also occurred because of happened B occurs under A probability. B is equal to a new sample space. AB/B.

Probability multiplication principle: P (A studying B) = P (A) P (B | A) or P (A studying B) = P (B) P (A | B) the probability of independent events: P (A studying B) = P (A) P (B)

Bayes’ theorem

⋅⋅+P(B2)+ P(B2)+ P(Bk)=1

Classification principle

Based on probability theory, the dichotomy problem is as follows: if P1 > P2, it is classified into category 1; Otherwise, categorize into category 2.

Second, bayes’ theorem, there is p (ci | x, y) = p (x, y | ci) * p (ci)/p (x, y) x, y characteristics variables, words in the following example. Ci stands for category. P (ci | x, y) indicates that the feature x, y in the cases, points into the probability of type ci. Combining with the above: p (ci | x, y) > p (cj | x, y), divided into the categories I, otherwise j points into the category.

Bayes' theorem can be used is the main benefit of the three known probability to calculate the probability of the unknown, and if only is to compare the p (ci | x, y) and p (cj | x, y) size, you just need to known two probability, the denominator is the same, to compare the p (x, y | ci) p (ci) and p (x, y | cj) p (cj).

Assumption principle of characteristic condition independence

Naive Bayes is commonly used to classify documents. Determine what category the article belongs to based on the words that appear in the document. The feature condition word vector W is represented by multiple values, and the number of values is the same as the number of vocabularies in the training set. The bayesian formula can be expressed as: p (ci | omega) = p (omega | ci) * p (ci)/p (omega) the emergence of the individual words won’t influence each other, then p (omega | ci) = p (omega zero | ci) * p (omega 1 | ci) *… * p (omega k | ci)

Algorithm implementation

import numpy as np
np.seterr(divide='ignore', invalid='ignore')  # Remove the warning of dividing by 0 in vectors
# Fetch data
def loadDataSet():
    postingList = [['my'.'dog'.'has'.'flea'.'problems'.'help'.'please'],
                   ['maybe'.'not'.'take'.'him'.'to'.'dog'.'park'.'stupid'],
                   ['my'.'dalmation'.'is'.'so'.'cute'.'I'.'love'.'him'],
                   ['stop'.'posting'.'stupid'.'worthless'.'garbage'],
                   ['mr'.'licks'.'ate'.'my'.'steak'.'how'.'to'.'stop'.'him'],
                   ['quit'.'buying'.'worthless'.'dog'.'food'.'stupid']]
    classVec = [0, 1, 0, 1, 0, 1] #1 means insulting and 0 means normal
    return postingList, classVec
Copy the code

Construct word vector based on document vocabulary:

def createVocabList(dataSet):
    vocabSet = set([])
    for document in dataSet:
        vocabSet = vocabSet | set(document)
    return list(vocabSet)

Construct word vectors for the input vocabulary
def setOfWords2Vec(vocabList, inputSet):
    returnVec = np.zeros(len(vocabList)) Generate an array of zero vectors
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1 # If there is a word, fill the position with 1
        else:
            print("the word: %s is not in my Vocabulary" % word)
            # pass
    return returnVec  # return the vector 0,1

if __name__ == '__main__':
    listPosts, listClasses = loadDataSet()
    myVocabList = createVocabList(listPosts)
    print(myVocabList)
 
Copy the code

The following output is displayed: [‘flea’, ‘ate’, ‘how’, ‘licks’, ‘quit’, ‘problems’, ‘dog’, ‘I’, ‘garbage’, ‘help’, ‘is’, ‘cute’, ‘steak’, ‘to’, ‘worthless’, ‘please’, ‘has’, ‘posting’, ‘buying’, ‘love’, ‘food’, ‘so’, ‘my’, ‘take’, ‘dalmation’, ‘stop’, ‘park’, ‘not’, ‘stupid’, ‘him’, ‘Mr ‘, ‘maybe’]. [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

As mentioned above, this method only records whether each word appears or not, but does not record the frequency of occurrence, which becomes the word set model. If the frequency of word occurrence is recorded, such word vector construction method is called word bag model, as follows. This article uses only the word set model.

# Word bag model
def bagofWords2VecMN(vocabList, inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return vocabList # return a word vector of non-negative integers
Copy the code

Use word vector to calculate probability:

def trainNB0(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix)  # Number of documents
    numWord = len(trainMatrix[0])  # number of vocabularies
    print(numTrainDocs, numWord)
    pAbusive = sum(trainCategory) / len(trainCategory) #p1, probability of insulting comments [0, 1, 0, 1, 0, 1]
    p0Num = np.zeros(numWord)
    p1Num = np.zeros(numWord)

    p0Demon = 0
    p1Demon = 0

    for i in range(numTrainDocs):
        if trainCategory[i] == 0:
            p0Num += trainMatrix[i] # vector addition
            p0Demon += sum(trainMatrix[i]) The alpha vector 1 adds up
        else:
            p1Num += trainMatrix[i]
            p1Demon += sum(trainMatrix[i])
    p0Vec = p0Num / p0Demon
    p1Vec = p1Num / p1Demon

    return p0Vec, p1Vec, pAbusive

if __name__ == '__main__':
    listPosts, listClasses = loadDataSet()
    myVocabList = createVocabList(listPosts)
    trainMat = []
    trainMat = []
    for postinDoc in listPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    print(trainMat)
    p0Vec, p1Vec, pAbusive = trainNB0(trainMat, listClasses)
    print(p0Vec, p1Vec, pAbusive)
Copy the code

TrainMat: Represents the occurrence of six given features in the data in the word set model.

array([ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.]), array([ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.]), array([ 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1.]), array([ 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]), array([ 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1.]), array([ 0., 0., 1., 0., 1, 0. 0. 0., 1, 0. 0. 0. 0.,. 0, 0. 0. 0., 1, 0. 0., 0. 0., 0, 0. 1., 1, 0. 0.,. 0, 0. 0. 0.]]Copy the code

Print (p0Vec, p1Vec, palso causes/leads): palso causes/leads to pollution. While p0Vec represents the probability of words in category 0 (non-insulting speech) appearing in the word vector:

[0. 0.04166667 0.04166667 0.04166667 0.04166667 0.04166667 0. 0.08333333 0.04166667 0. 0.04166667 0. 0.04166667 0.04166667 0.04166667 0.04166667 0.04166667 0. 0.04166667 0.04166667 0.04166667 0. 0.125 0. 0.04166667 0.04166667 0.04166667]Copy the code

Algorithm improvement:

The partial probabilities are 0, and the multiplication of the independent feature probabilities used above is always 0. Therefore, the number of occurrences of all words is initialized to 1, and certain word items are initialized to 2.
Because the calculated probabilities are so small, multiplying over and over might cause the results to overflow. So if you take the logarithm of it, it’s the same monotonicity, and it doesn’t affect the comparison of the results. The function is as follows:

def trainNB1(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix)  # Number of documents
    numWord = len(trainMatrix[0])  # number of vocabularies
    pAbusive = sum(trainCategory) / len(trainCategory) #p1, the probability of an abusive comment
    p0Num = np.ones(numWord)  # change to 1
    p1Num = np.ones(numWord)

    p0Demon = 2 # change to 2
    p1Demon = 2

    for i in range(numTrainDocs):
        if trainCategory[i] == 0:
            p0Num += trainMatrix[i] # vector addition
            p0Demon += sum(trainMatrix[i]) The alpha vector 1 adds up
        else:
            p1Num += trainMatrix[i]
            p1Demon += sum(trainMatrix[i])
    p0Vec = np.log(p0Num / p0Demon)  # please log
    p1Vec = np.log(p1Num / p1Demon)

    return p0Vec, p1Vec, pAbusive
Copy the code

Note: the resulting p0Vec may be irregular, but it has no effect on the final probability comparison.

Classifies documents using classifier functions

def classifyNB(vec2Classify, p0Vc,  p1Vc, pClass1):
    p1 = sum(vec2Classify * p1Vc) * pClass1
    p0 = sum(vec2Classify * p0Vc) * (1-pClass1)
    # p1 = sum(vec2Classify * p1Vc) + np.log(pClass1) #
    # p0 = sum(vec2Classify * p0Vc) + np.log(1 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0
Copy the code

To clarify: vec2Classify is the word required to classify documents. According to the formula of p (ci | omega) = p (omega | ci) p (ci)/p (omega), known feature vectors for classification is equal to the probability p (omega | ci) p (ci). Ignore the denominator:

P (ci), sample concentration, the amount of ci/total number of samples to p (omega | ci) characteristics due to the various conditions are independent of each other and status, same ` p (omega | ci) = p (w0 | ci) p (w1 | ci) p (w2 | ci)... P (wN | ci) `, can respectively for p (w0 | ci), p (w1 | ci), p (w2 | ci),... , p (wN | ci), p (omega | ci) is obtained. P and o (omega k | ci) becomes a beg in classification categories for ci document vocabulary in the collection, omega k appears the probability of a single term.Copy the code

Test classification function

Use two different samples to test the classification function:


# Construct sample tests
def testingNB():
    listPosts, listClasses = loadDataSet()
    myVocabList = createVocabList(listPosts)
    trainMat = []
    for postinDoc in listPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0v, p1v, pAb = trainNB0(trainMat, listClasses)
    # print(p0v, p1v, pAb)
    testEntry = ['love']
    thisDoc = setOfWords2Vec(myVocabList, testEntry)
    print(testEntry, 'classified as', classifyNB(thisDoc, p0v, p1v, pAb))

    testEntry = ['stupid'.'garbage']
    thisDoc = (setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, 'classified as:', classifyNB(thisDoc, p0v, p1v, pAb))

if __name__ == '__main__':
    testingNB()
Copy the code

As a result, you can see that the two documents are classified correctly. See the complete code:

github:naive_bayes

conclusion

Naive Bayes classification
Conditional probability
Bayes’ theorem
Assumption principle of characteristic condition independence
Build word vectors from documents
Word set model and word bag model
Probability is 0, easy to calculate the improvement and prevent overflow of logarithmic improvement

See article: Naive Bayes (NB) Classification algorithm for Machine Learning and Python implementation