A brief introduction to Bayes’ formula (Trail Book)

Bayes’ theorem is A theorem about the conditional (or edge) probability of random events A and B. P (A | B) is in B under the condition of the possibility of A. Mathematically, Bayes formula is derived from the total probability formula and multiplication formula, and the derivation process is as follows:

The conditional probability can be obtained as follows: P (Bi ∣ A) = P (ABi) P (A) P \ left (B_ {I} | A \ right) = \ frac {P \ left (A B_ {I} \ right)} {P} (A) P (Bi ∣ A) = P (A) P (ABi), using multiplication formula for molecules, the denominator use full probability formula:


P ( A B i ) = P ( B i ) P ( A B i ) P\left(A B_{i}\right)=P\left(B_{i}\right) P\left(A | B_{i}\right)


P ( A ) = i = 1 n P ( B j ) P ( A B j ) P(A)=\sum_{i=1}^{n} P\left(B_{j}\right) P\left(A | B_{j}\right)

The bayes formula can be obtained by substituting it into the original formula: P (B ∣ A) = P (Bi) P (A ∣ Bi) np ∑ I = 1 (Bj) P (A ∣ Bj) P (B | A) = \ frac {P \ left (B_ {I} \ right) P \ left (A | B_ {I} \ right)} {\ sum_ {I = 1} ^ {n} P \ left (B_ {j} \ right) P \ left (A | B_ {j} \ right)} P (B ∣ A) = ∑ I = 1 np (Bj) P (A Bj) ∣ P (Bi) P (A ∣ Bi)

The Naive Bayes Method (Book of Li)

Naive Bayes method is a classification method based on Bayes theorem and independent assumption of feature conditions.

Theoretical derivation

Symbol explanation: output class space for: y = {c1 and c2,…, cK} y = \ left \ {c_ {1}, c_ {2}, \ \ cdots, c_ {K} \} \ right y = {c1 and c2,…, cK} the characteristics of the input vector x

In theoretical derivation, conditional distribution probability is required to meet the following conditional opposition assumptions:

P (X = X ∣ Y = ck) = P (X = X (1) (1),…, X (n) = X (n) ∣ Y = ck) = ∏ j = 1 np (X (j) = X (j) ∣ Y = ck) P \ \ begin} {aligned left (X = X | Y = c_ {k} \ right) =P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} | Y=c_{k}\right) =\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) } {\ end aligned P (X = X ∣ Y = ck) = P (X = X (1) (1),…, X (n) = X (n) ∣ Y = ck) = j = 1 ∏ nP (X (j) = X (j) ∣ Y = ck) (1)

In the naive bayesian method is used to classify, for x input, we generally get by learning model to calculate the posterior probability distribution: P (Y = ck ∣ x = x) P \ left (Y = c_ {k} \ | x = x right) P (Y = ck ∣ x = x), will be the probability of the broad categories as x class output. The posterior probability is calculated according to the Bayesian formula: P (Y = ck ∣ X = X) = P (X = X ∣ Y = ck) P (Y = ck) ∑ kP (X = X ∣ Y = ck) P (Y = ck) P \ left (Y = c_ {k} \ | X = X right) = P \ \ frac {left (X = X | Y = c_ {k} \ right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x | Y=c_{k}\right) P \ left (Y = c_ {k} \ right)} P (Y = ck ∣ X = X) = ∑ kP (X = X ∣ Y = ck) P (Y = ck) P (X = X ∣ Y = ck) P (Y = ck) (2)

According to formula (1) and (2), it can be obtained: P (Y = ck ∣ X = X) = P (Y = ck) ∏ jP (X (j) = X (j) ∣ Y = ck) ∑ kP (Y = ck) ∏ jP (X (j) = X (j) ∣ Y = (ck), k = 1, 2,…, kP \ left (Y = c_ {k} | X=x\right)=\frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) P \ \ prod_ {j} left (X ^ {} (j) = X ^ {} (j) | Y = c_ {k} \ right)}, \ quad k = 1, 2, \ \ cdots, KP (Y = ck ∣ X = X) = ∑ KP (Y = ck) ∏ jP (X (j) = X (j) ∣ Y = ck) P (Y = ck) ∏ jP (X (j) = X (j) ∣ Y = (ck), k = 1, 2,…, k

Thus, the final result of the classifier can be expressed as: Y = f (x) = arg ⁡ Max ⁡ ckP (y = ck) ∏ jP (x (j) = x (j) ∣ y = ck) ∑ kP (y = ck) ∏ jP (x (j) = x (j) ∣ y = ck) y = f (x) =, arg, Max _ {c_ {k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P \ left (X ^ {} (j) = X ^ {} (j) | Y = c_ {k} \ right)} Y = f (X) = argmaxck ∑ kP (Y = ck) ∏ jP (X (j) = X (j) ∣ Y = ck) P (Y = ck) ∏ jP (X (j) = X (j) ∣ Y = ck) (3)

It is not difficult to see that the denominator results of Equation (3) are the same, so the classifier results can be equivalent to: Y = arg ⁡ Max ⁡ ckP (y = ck) ∏ jP (X (j) = X (j) ∣ y = ck) y = \ arg \ max_ {c_ {k}} P \ left (y = c_ {k} \ right) \ prod_ {} j P \ left (X ^ ^ {} (j) = X | {(j)} Y = c_ {k} \ right) Y = argmaxckP (Y = ck) ∏ jP (X (j) = X (j) ∣ Y = ck)

The derivation of expected risk minimization == is omitted here

Parameter estimation of naive Bayes method

In the naive Bayes method, there are mainly two methods to estimate probability: 1. Maximum likelihood estimation; 2. 2. Bayesian estimation. I’m just going to give you the formula instead of proving it.

Maximum likelihood estimation

The maximum likelihood estimator of prior P(Y=Ck)P(Y = C_k)P(Y=Ck) is: P (Y = ck) = ∑ I = 1 ni (yi = ck) N, k = 1, 2,…, KP \ left (Y = c_ {k} \ right) = \ frac {\ sum_ {I = 1} ^ {N} \ I left (y_ {I} = c_ {k} \ right)} {N}, \ quad k = 1, 2, KP (Y = ck) = \ cdots, N ∑ I = 1 ni yi = (ck), k = 1, 2,…, k

Suppose the set of possible values of the first feature x(j)x^(j)x(j) is αj1aj2… A_ ajSj {alpha _ {j1} {j2}… A_jS_j} alpha j1aj2… AjSj, conditional probability P (X (j) = ajl ∣ Y = ck) P \ left (X ^ {} (j) = a_} {j l | Y = c_k) \ right P (X (j) = ajl ∣ Y = ck) of the maximum likelihood estimation is: (j) = P (X ajl ∣ Y = ck) ni = ∑ I = 1 (xi (j) = ajl, yi = ck) ni ∑ I = 1 (yi = ck) P \ left (X ^ {} (j) = a_} {j l | Y = c_ {k} \ right) = \ frac {\ sum_ {I = 1} ^ {N} I\left(x_{i}^{(j)}=a_{j l}, Y_ {I} = c_ {k} \ right)} {\ sum_ {I = 1} ^ {N} \ I left (y_ {I} = c_ {k} \ right)} P (X (j) = ajl ∣ Y = ck) = ∑ I = 1 ni ni yi = (ck) ∑ I = 1 (xi (j) = ajl, yi = ck) J = 1, 2,…, n; L = 1, 2,…, Sj; K =1,2… Kj=1,2 \cdots, n; \quad l=1,2, \cdots, S_{j}; \ quad k = 1, 2, \ \ cdots, Kj = 1, 2,…, n. L = 1, 2,…, Sj; K = 1, 2,…, k

Bayesian estimation

Although the maximum likelihood estimation can obtain the probability estimation value under known circumstances, it will produce the situation where the probability value is 0, which will affect the subsequent classification calculation results. Bayesian estimation can solve this problem, and its original formula is introduced
Lambda. Greater than or equal to 0 (usually take 1 ) \lambda is greater than or equal to 0 (usually 1)
, the Bayesian estimation of conditional probability is:
P Lambda. ( X ( j ) = a j l Y = c k ) = i = 1 N I ( x i ( j ) = a j l . y i = c k ) + Lambda. i = 1 N I ( y i = c k ) + S j Lambda. P_{\lambda}\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+S_{j} \lambda}
Similarly, the bayesian probability estimation of prior probability obtained is:

Steps of naive Bayes algorithm

Step1: The input training data T = {(x1, y1), (x2, y2),…, xN, yN)} T = \ left \ {\ left (x_ {1}, y_ {1} \ right), \ left (x_ {2}, y_ {2} \ right), \ cdots, \ left (x_ {N}. Y_ {N}\right)\right\}T={(x1,y1),(x2,y2),…,(xN,yN)} Based on the given data model, the corresponding probability parameter estimation (prior probability and conditional distribution probability) is obtained:


P ( Y = c k ) = i = 1 N I ( y i = c k ) N . k = 1 . 2 . . K P \ left (Y = c_ {k} \ right) = \ frac {\ sum_ {I = 1} ^ {N} \ I left (y_ {I} = c_ {k} \ right)} {N}, \ quad k = 1, 2, \ \ cdots, k

(j) = P (X ajl ∣ Y = ck) ni = ∑ I = 1 (xi (j) = ajl, yi = ck) ni ∑ I = 1 (yi = ck) P \ left (X ^ {} (j) = a_} {j l | Y = c_ {k} \ right) = \ frac {\ sum_ {I = 1} ^ {N} I\left(x_{i}^{(j)}=a_{j l}, Y_ {I} = c_ {k} \ right)} {\ sum_ {I = 1} ^ {N} \ I left (y_ {I} = c_ {k} \ right)} P (X (j) = ajl ∣ Y = ck) = ∑ I = 1 ni ni yi = (ck) ∑ I = 1 (xi (j) = ajl, yi = ck) J = 1, 2,…, n; L = 1, 2,…, Sj; K =1,2… Kj=1,2 \cdots, n; \quad l=1,2, \cdots, S_{j}; \ quad k = 1, 2, \ \ cdots, Kj = 1, 2,…, n. L = 1, 2,…, Sj; K = 1, 2,…, k

Step2: enter the instance x to be classified and calculate its corresponding probability distribution:


P ( Y = c k X = x ) = P ( Y = c k ) j P ( X ( j ) = x ( j ) Y = c k ) k P ( Y = c k ) j P ( X ( j ) = x ( j ) Y = c k ) . k = 1 . 2 . . K P\left(Y=c_{k} | X=x\right)=\frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}{\sum_{k} P \ left (Y = c_ {k} \ right) \ prod_ {} j P \ left (X ^ {} (j) = X ^ {} (j) | Y = c_ {k} \ right)}, \ quad k = 1, 2, \ \ cdots, k

Step3: output x class:


y = arg max c k P ( Y = c k ) j P ( X ( j ) = x ( j ) Y = c k ) y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)

Algorithm summary

The content is mainly taken from Chapter 4 of Statistical Learning Methods written by Professor Li Hang. Here, the derivation of Bayesian method is based on the assumption that input variables are all conditional independent, and if there is a probabilistic dependency relationship between them, the model becomes bayesian network, which requires further research and learning.

Examples actual combat – text analysis

Jane said

Whether it’s books, historical documents, social media, email or any other text-based form of communication, there’s a lot of information in there. Extracting features from text data sets for classification is not easy. However, people have concluded a general method of text mining.

Here, I have tried to solve a few practical problems using the powerful but surprisingly simple Naive Bayes algorithm, which assumes that features are independent of each other for simplicity when calculating the probabilities used for classification. The main tool used here is Python’s natural language processing library: NLTK and its data sets.

Identify gender

The gender database of NLTK library and naive Bayes algorithm are used to carry out gender dichotomy. The main principle is: using heuristic method, that is, the last few characters of name can define gender characteristics. For example, if a name ends in “la,” it’s more likely to be a female name, like “Angela” or “Layla.” Also, if a name ends in “IM,” it’s more likely to be a male name, such as “Tim” or “Jim.” Once you’ve determined how many characters are needed to determine gender, you can do this experiment. Here’s how to identify gender.

import random
from nltk.corpus import names
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy

Extract the features of the input word
def gender_features(word, num_letters=2) :
    return {'feature': word[-num_letters:].lower()}

if __name__=='__main__':
    Extract feature names
    labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
            [(name, 'female') for name in names.words('female.txt')])

    random.seed(7)
    random.shuffle(labeled_names)
    input_names = ['Leonardo'.'Amy'.'Sam']

    Search parameter space
    for i in range(1.5) :print('\nNumber of letters:', i)
        featuresets = [(gender_features(n, i), gender) for (n, gender) in labeled_names]
        train_set, test_set = featuresets[500:], featuresets[:500]
        classifier = NaiveBayesClassifier.train(train_set)

        Print classifier accuracy
        print('Accuracy ==>'.str(100 * nltk_accuracy(classifier, test_set)) + str(The '%'))

        # Predict the result of the new input
        for name in input_names:
            print(name, '= = >', classifier.classify(gender_features(name, i)))



Copy the code
Number of letters: 1
Accuracy ==> 76.2%
Leonardo ==> male
Amy ==> female
Sam ==> male

Number of letters: 2
Accuracy ==> 78.60000000000001%
Leonardo ==> male
Amy ==> female
Sam ==> male

Number of letters: 3
Accuracy ==> 76.6%
Leonardo ==> male
Amy ==> female
Sam ==> female

Number of letters: 4
Accuracy ==> 70.8%
Leonardo ==> male
Amy ==> female
Sam ==> female
Copy the code

Sentiment analysis

Sentiment analysis is the process of determining whether a given text is positive or negative. In some scenarios, we also used “neutral” as a third option. Sentiment analysis is often used to discover how people feel about a particular topic. Sentiment analysis is used to analyze the emotions of users in many scenarios, such as marketing campaigns, social media, e-commerce customers, etc

Principle: NLTK’s naive Bayes classifier will be used for classification. In feature extraction function, all unique words are basically extracted. However, the data required by the NLTK classifier is stored in dictionary format, so dictionary format is used to make it easier for the NLTK classifier object to read the data. After dividing the data into training and test sets, the classifier can be trained to classify sentences as positive and negative. If you look at the words that are most informative, you can see for example the word “outstanding” means a positive comment, and “insulting” means a negative comment. This is very interesting information because it shows that words can be used to express emotions.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
 
def extract_features(word_list) :
    return dict([(word, True) for word in word_list])
 
if __name__=='__main__':
    # Load positive and negative comments
    positive_fileids = movie_reviews.fileids('pos')
    negative_fileids = movie_reviews.fileids('neg')
     
    features_positive = [(extract_features(movie_reviews.words(fileids=[f])), 
            'Positive') for f in positive_fileids]
    features_negative = [(extract_features(movie_reviews.words(fileids=[f])), 
            'Negative') for f in negative_fileids]
     
    # Training to test data set ratio is (80/20)
    threshold_factor = 0.8
    threshold_positive = int(threshold_factor * len(features_positive))
    threshold_negative = int(threshold_factor * len(features_negative))
     
    features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
    features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]  
    print("\nNumber of training datapoints:".len(features_train))
    print("Number of test datapoints:".len(features_test))
     
    # Train naive Bayes classifier
    classifier = NaiveBayesClassifier.train(features_train)
    print("\nAccuracy of the classifier:", nltk.classify.util.accuracy(classifier, features_test))

    print("\nTop 10 most informative words:")
    for item in classifier.most_informative_features()[:10] :print(item[0])

    Enter a simple comment
    input_reviews = [
        "but it worked great. Dried my hair in about 15 minutes"."Excellent value for travel dryer on a budget"."Great hair dryer."."The direction was terrible and the story was all over the place" 
    ]

    print("\nPredictions:")
    for review in input_reviews:
        print("\nReview:", review)
        probdist = classifier.prob_classify(extract_features(review.split()))
        pred_sentiment = probdist.max(a)print("Predicted sentiment:", pred_sentiment) 
        print("Probability:".round(probdist.prob(pred_sentiment), 2))
Copy the code
Number of training datapoints: 1600 Number of test datapoints: 400 Accuracy of the classifier: 0.735 Top 10 Most informative words: outstanding insulting vulnerable ludicrous uninvolving avoids astounding fascination symbol animators Predictions: Sentiment: Negative Probability: 0.63 Review: But it worked great. Driedmy hair in about 15 minutes. Excellent Value for Travel Dryer on a budget Sentiment: Negative Probability: 0.62 Review: Sentiment: Positive Probability: 0.54 Review: The direction was terrible and The story was all over The place Predicted sentiment: Negative Probability: 0.63Copy the code

summary

In the process of learning, THE author always made the mistake of hindsight. When programming machine learning, I found that my mathematical knowledge was not enough, and when learning machine learning theory, I found that my ability to build wheels was not enough. Fortunately, I still had time to give myself the opportunity to learn again. I hope that after the end of the postgraduate examination, I will come back to chew these books and have a new harvest and a deeper understanding.

reference

  • MAO Shisong, Probability and Mathematical Statistics
  • Statistical Learning Methods by Li Hang
  • Prateek Joshi: The Classic Case of Python Machine Learning