Example – positive probability

Or take the example of the quality inspector to do analysis, suppose I am a quality inspector, now received three boxes of parts to check, among which the first box has 10 parts, the second box has 20 parts, the third box has 15 parts. Half an hour passed, the results of the inspection came out, the first box has 1 unqualified, the second box has 3 unqualified, the third box 2 unqualified.

The box Total number of parts unqualified
A 10 1
B 20 3
C 15 2

Now, if I take any part from these three boxes, what is the probability that this part is qualified? We assume that event D: part is qualified, then:


The probability that a part is qualified is calculated like this, but in machine learning, maybe we want to know, give you a sample, what kind of class does that sample belong to, which is also called the classification problem, and that involves the inverse probability problem.

Inverse probability-naive Bayes theory

Now let’s imagine a scenario: you get a part, which box does it belong to? The problem in machine learning is analogous to: you’re given a sample, it has a bunch of features, and the machine model says which category the sample belongs to. So that’s how we understand Bayesian theory.

Conditional probability


The conditional probability of event A, given that event B has already occurred,Represents the probability that events A and B occur together,It’s the probability of event B happening, so it’s a conditional probability. If we know that this part is qualified, which category is it from among three categories A, B and C? We certainly do not know the specific category, because all three categories have qualified parts, but we can know the probability of this qualified part from each category, that is, to find,,, where D represents the probability that this part is qualified, which can be known from the conditional probability:


Among themI’ve calculated it up here,Represents the probability that this part comes from box A and is genuine. The two conditions are independent, so we calculate as:


Then we can calculate the probability that the qualified goods will come from each box:


Thus, we know that the qualified parts are most likely to come from box A, because the probability of coming from box A is the highest, reaching 0.344. From this example we can derive Bayes’ theory:

Let’s say that event D happens, and we want to find the probability that event D happens because of event A, and we want to find the reverse probability. How do we understand that? 1. Suppose that the conditional probability of event A leading to event D is:


So? P(D*A)=P(D|A)*P(A)? 2. Now event D occurs, which may be caused by event A, event B, or other events. We want to know the probability that event A causes event D, so the conditional probability that event D is caused by event A is expressed as follows:


In step 1Plug into this formula, replace the numeratorGet:


The formula is derived as above. After the above two steps, we can get the final Bayesian formula:


Some people may ask: why do we switch formulas like this? Actually is derived using the two conditional probability, is just two conditional probability is the same molecular formula, used to replace, in fact this is because in our daily, we are more likely to know the result, just like you with a blindfold literally get a part from the case, the qualified or unqualified parts is easy to judge, but how do you know what the box is it? So our question becomes: come up with a box in the three parts, this part is to belong to the probability of a certain box, perspective has changed, we no longer pay attention to it is qualified or not qualified, qualified or unqualified is easy to know, but it belongs to which box is not easy to know, a bayesian is put forward to solve this problem.

Analogy to machine learning

Ok, so now that we have Bayesian theory, how does that translate to machine learning? The analogy to machine learning is: I give you a sample, the characteristics of this sample are given to you, and the specific value of each feature is given to you. Now please tell me which category it belongs to. Of course, this must be A test set, because in the training set, every record is marked with the result, that is, labeled. In the binary classification task, samples belonging to category A are marked 1, and those not belonging to category A are marked 0. The formal description is as follows:

Input: Training setThe test setAmong themH for each sample, and one for each sample in the training setlabelAnnotate the results.

This is a formal description, but it is abstract. We can understand it this way. We give you a sample of the test set, and the value of each characteristic of the sample tells you, which is equivalent to telling you whether the part is qualified or not, and then you tell me which category the sample belongs to. In other words, the characteristic value in the sample is the result of qualified parts in the box example above. Meanwhile, since the training set has characteristic value and each sample has category result, we can start from the training set and easily calculate some probabilities in Bayes formula. In the next section, we will make specific calculation.

Bayesian theory in machine learning

We continue to combine Bayesian theory into the field of machine learning, and we assume the scenario:

Background condition: Training set, the classification result isFor each training set record, drink a classification result input: give you a test set sampleOutput: Gives which category this sample belongs to

Let’s think about it firstA classification as a result, we can calculate the sample belongs to the probability of each classification results, selection probability of category classification results as the sample, then for the samples we know each characteristic value, we think these features selected from training focus, each training set samples have a classification results, so we will be the analogy for qualified characteristics, So what category does this eligible sample come from? Consider the following:

1. Sample data isSo let’s take a feature, and let’s say we take a featureWe know that according to Bayesian theoryThe formula is defined as follows:


2. We disassemble and analyze the formula in 1:

  • (1) $P (t_h | C_i) said $$t_h $in $C_i $classification of probability, we can take advantage of the training set to calculate, because training has the characteristics of value, you can calculate the characteristic value in $$category C_i probability in the sample
  • (2)$P(C_i)$represents the probability of occurrence of category $C_i$, which directly calculates the proportion of category $C_i$samples in the total sample in the training set
  • (3)$P(t_h)$represents the probability of feature $t_h$appearing, which is similar to the probability of picking a qualified part from the box. Here is the probability of extracting the feature value from each category

    The characteristics can be calculated through the above three stepsBelongs to the categoryThe probability. We continue to generalize that a sample record has multiple features, so we assume:

    Hypothesis: Each feature in the sample is independent of each other.

    And that’s an important assumption, because now we’re going to calculate the probability that a sample is in a class.

    oAccording to the above assumption, each feature is independent, then:

    (1)(2)soFormula is:


    So according to the formula, we can calculate:The probability of equal classification, by comparing the sizes, we know which category this sample belongs to, but the point is to compare the probabilities, so we don’t have to count the molecules, because the moleculesIt is the same for multiple classification results, so the probability of molecular comparison is directly calculated.

    To calculate

    In the above paper, we studied the classification of samples by Bayesian theoryThere is a problem in the calculation of: if the features are all discrete values, it is ok to directly calculate the proportion of the discrete values, but if the features are continuous values, it is difficult to calculate. Generally, when the feature attributes are continuous values, we assume that the values obey the Gaussian distribution (also known as the normal distribution). That is:



    Therefore, as long as the mean values and standard deviations of this feature item in each category of the training set are calculated, the required estimates can be obtained by substituing them into the above formula. Another issue that needs to be discussed is whenHow to do? This phenomenon occurs when a feature item division does not appear under a category, which will greatly reduce the quality of the classifier. In order to solve this problem, we introduce Laplace calibration, whose idea is very simple, that is, the count of all features in each category is increased by 1. In this way, if the number of training samples is sufficiently large, the results will not be affected, and the embarrassing situation of zero frequency is solved.

    conclusion

    Many of these are self-understanding things, some concepts and titles may not be so accurate, that is, I think of the head, the summary is not in place, I hope you point out. Next time, I will explain how to combine Bayes theory with actual scenarios. In short, thanks to the blog of Niu on the network, I have learned a lot of content. This article is my own thinking and understanding, and I hope it will be helpful to you.

    Refer to the blog

    Machine Learning: A Store of Naive Bayesian Classification Algorithms