Probability theory is one of the most basic tools we have for describing the universe. It is particularly relevant to statistical categorization and can lead to a number of important results that improve human cognition of the external world. In this article, Peter Mills will give you a brief introduction to probability theory and Bayes’ theorem, as well as their applications to statistical classification, to help you improve and simplify classification models.


From Bayesian learning to Statistical Classification, I will provide a number of examples of applications of Bayes’ theorem and probability theory to statistical classification. This article will also cover other important topics beyond basic probability theory, such as calibration and Validation.

This article, although intended for beginners, also requires a knowledge of first-year and partial second-year mathematics, especially linear algebra, and some univariate and multivariable calculus. If some equations seem confusing, try to focus on the process of solving the actual problem.

You can learn more about probability and statistical classification by studying examples than you can by just reading or browsing equations. Therefore, we have prepared some questions for you to study at the end of this article.


Review of fundamentals of probability theory

When you roll a die, there are six possible outcomes, each with a 1/6 probability.

I is the number on the top of the die. Since at least one side will be up, then:

(1)


Where n=6 represents the total number of possible outcomes.

Now the combined probability of rolling two dice and getting any one of 36 pairs of outcomes is:

I is the number on the first die, j is the number on the second die.

If the number on the second die is ignored, then the probability of getting some result on the first die (say, 6) is:

(2)


This is what’s called a prior probability.

It gets complicated from here, given a number on one die, what is the probability that the other die gets a certain number? In this case, the two events are unrelated, so the probability is always 1/6, but that’s not the whole story in this case.

Or consider the blackjack game. When the last card was worth ten (10 or JQK), what is the probability that the next card will be worth ten?

Given that there are only 34 cards left in the deck, including seven tenths, the probability of the current event will now depend on the outcome of the past event. If the last card was worth ten, there is a 6/34=3/17 chance of getting a ten; otherwise, there is only a 7/34 chance.

Since the probability of the last card being worth ten is also 7/34, the combined probability of both events occurring is:

P_i is on a card value of probability, and P (j | I) is the next brand value is the conditional probability.

Having defined prior, joint, and conditional probabilities, it is now time to introduce Bayes’ theorem. Note that these definitions are symmetric with respect to I and j, so:

(3)


This is a symmetric form of Bayes’ theorem.


Continuous type probability

The generalization to continuous probability or probability density is straightforward. Suppose I have a continuous random variable x that obeies a probability distribution P(x). Then the probability that x is between x_0 and x_0+dx is:

(4)


When the random variable is continuous, the sum becomes an integral, so equation (2) becomes:

P(x,y) is the joint probability of x and y, and you integrate all of x to get the edge probability of y.

Probabilistic problems dealt with in statistical classification have a definite form. One is scalar and discrete, and the other is vector and continuous:

(6)


Where I is the category or category label, and x is the vector of attributes or features.

Usually based on bayesian statistical classification is to estimate the joint probability P (x, I) or conditional probability P (I) | x, are usually calculated by maximum likelihood of:

(7)


Where C is the maximum likelihood estimate of the category, that is, the maximum conditional probability.

Note that P(x) is equal for any given test point, so using joint or conditional probability will yield the same result. Feature space of the conditional probability P (x | I), to describe the distribution of each individual classification are equally important: that is to say, get rid of all the other category labels, leaving only I, the rest of the distribution is needed.

By simply removing the limit sign, the definition of probability density in Equation (4) can be used to derive one of the oldest and most complex statistical classifications. Take a radius at the test point X and count the number of training samples within that radius in one category or the other.

The problem with this method is that sometimes it doesn’t contain any samples within the closure, and sometimes it has a lot of samples. Therefore, instead of fixing the distance, it is better to fix the number of samples and measure the distance. This is the K-nearest-Neighbors (KNN) classifier, where K is the number of the nearest neighbor samples in each category.


Binary classifier

Binary classifiers are special because many examples can draw a hyperplane on the feature space to separate the two categories. A hyperplane is a subspace with one less dimension than the space in which it is embedded. For a two-dimensional feature space, the hyperplane (boundary) is a line, but in three dimensions it is a plane.

The result of most binary classifiers is not to return two integers, but a continuous decision function. The difference of conditional probability can be used as a convenient form of decision function:

(8)


For convenience, we set the values for the categories to -1 and +1.

Unfortunately, most statistical classifiers do not return a decision function that does a good job of estimating this quantity, so this article focuses on methods for calibrating decision functions to make good estimates.

Consider a pair of one-dimensional Gaussian functions of the same shape, h of the same width, with expected values of -b and +b. Then the difference of conditional probability is defined as:

After parameter adjustment, it will become the following form:

(9)


That is, the decision function of a pair of one-dimensional Gaussian functions of the same size is a hyperbolic tangent function.

This example may not seem important, but tanh functions are actually ubiquitous in deep learning. In statistical classification, it is often used to calibrate decision functions to better estimate conditional probabilities.

Tanh decision functions are applied to LIBSVM libraries, such as my own machine learning library libAGF. This example shows that the difference in conditional probability, R, is generally closer to the class decision boundary than sigmoidal functions.

Consider the logistic regression example. In logistic regression, the decision function is:

(10)


Where v is a vector and A is a constant.

The parameters of the function are fitted by minimizing the cost function (such as the minimum variance) :

(11)


For fitting or training, training data is needed. The training data is the set of ordered vector pairs mapped to the value {x_i: y_i} of the class in the feature space. Where yi can be -1 or +1, that is, *y*แตข โˆˆ {-1, +1}.

Training data represents the “ground truth”, which can be obtained in a variety of ways. In a land classification problem: Satellites measure electromagnetic radiation emitted by multiple frequency bands of the ground and use this data to classify the land into types, such as fields, forests, cities, waters, and so on.

An algorithm is modelable if it can output the radiation value of the land surface using different parameters describing the type of land surface. In this case, the resulting training data may be of a large size, although it does not need to be precise.

Or the training data is measured by real instruments, but needs to be manually classified. A simple app is used to display images, and each pixel can be sorted by clicking on a color.

Equations (10) and (11) provide a brief presentation of the complete statistical classification process. (11) The training stage is given, and the model of this stage is derived. In this case, the model contains a small set of function parameters that fall within the scope of parametric statistics.

In contrast, nonparametric statistical models such as KNN use all training data in each classification. For logistic classifiers, fitting is nonlinear, which is another common machine learning technique.

Nonlinear optimization usually assumes that there is no closed analytic solution to the problem and uses iterative numerical algorithms. The field is so broad that we won’t go into it here. You can read more about it in the problem set.


The calibration

The advantage of using continuous decision functions for binary classification is that it allows a degree of calibration. Consider the following classifiers:

As the classification threshold f_0 changes, we can adjust the sensitivity of the classifier, which is particularly important in medical diagnosis.

Note: The function is not defined in the case of f=f_0. To correct the bias, a random value should be returned when the problem occurs during the numerical calculation phase.

Suppose a classifier is trained with data subject to a prior class distribution P'(I), and the true population distribution is P(I). If F can accurately estimate R, then we find that the value of F_0 (i.e. the sample statistic) can be accurately approximated to the value of the population variable:

To further explain, consider the confusion matrix. The elements of the i-th row and the j-th column of the confusion matrix tell us: for all the test data, how many test samples are labeled as the i-th category, but the predicted category returned by the classifier is J.

By dividing the test sample by number, the confusion matrix can be expressed using a joint probability approximation. Consider the following binary classifier:

Among them:

  • Nt =nTN+nFP+nFN+nTP indicates the total number of test samples
  • NTN represents the number of true negative classes
  • NFP represents the number of false positive classes
  • NFN represents the number of false negative classes
  • NTP represents the number of real classes

A perfect classifier would return a diagonal matrix: an element is non-zero only if I =j.

Based on these five parameters, you can write down all possible technical scores for a simple binary classifier. The receiver operating Characteristic (ROC) curve of the subject is obtained by plotting a comparison of the scores of two such techniques with changes in the classification threshold. Here are the hit rates:

The rate of false positives:


The ROC curve of one-dimensional Logistic classifier in (9) is drawn in the figure above, h=1, with different B values. The classifier is considered to be a perfect estimator of conditional probability.

A more sophisticated calibration practice can transform the decision function so that it correctly represents the difference between conditional functions. Consider the following formula, which is strictly derived from the background material shown in the previous two parts:

(13)


Where delta is the Dirac delta function:

A well-calibrated conditional probability estimate should follow this formula.


validation

Once you export a classification, you need to validate it against the test data. The test data should be different from the training data, otherwise the skill score will be overly optimistic. This is known as cross validation. The obfuscation matrix can represent all the details of discrete classifier accuracy for a given data set and can be used to compose any possible technical score. Here I will introduce two measures that are relatively rare in the literature, and you will see their importance through the following introduction.

The most basic skill score is accuracy:

The maximum likelihood classification algorithm maximizes the accuracy. Accuracy has its limitations, which can be reduced by the following alternative measures.

The first is the uncertainty coefficient. This measure is based on Shannon’s information capacity, so we first need to define information entropy. For discrete probability, the information entropy is:

This formula tells us how much information is required to represent event I in the case that the prior distribution is P_i. This measure can be generalized to multivariable distributions. Its conditional entropy can be expressed as:

Once these two quantities are defined, they can be written as uncertainty coefficients:

(14)


This tells us how much information in a single classification result J leads us to the true category value I. This makes it an excellent technical score, because the lowest possible value is 0, meaning that the classifier does not provide any information about true category values; The highest possible value is 1, meaning that the classifier provides all the information for true category values.

For binary classifiers, I also recommend the Pearson correlation coefficient:

(15)


Finally, for a continuum decision function rather than a discrete binary classifier, we can measure the average skill of all possible thresholds by calculating the area under the ROC curve.

For a perfect discriminator, the ROC curve should be in the unit square, H=1 at F=0, and the curve stays at H=1, so the area is 1. The classifier with region 0 is also a perfect classifier, but its judgment is completely opposite to the correct one. The classifier with no discriminant value will be in the diagonal with area 0.5.

Note, for example, how the area under the sample curve increases as the distance between categories increases.


Multicategory classification

We just spent a lot of time talking about binary classifiers. Assuming that the only appropriate statistical classifier we can use is a binary classifier, how can we generalize it to classification problems with more than two categories? Now let’s use probability theory to derive the answer.

Suppose we design a binary classifier set by dividing categories into two sets several times. The encoding matrix A represents the segmentation: the ith row of the matrix represents the ith binary classifier separated by -1/+1 in the JTH column, that is, the JTH category label is converted to -1/+1 for training and 0 for complete exclusion.

The relation between conditional probability of multi-class problems and conditional probability of binary classifier is as follows:



(16)


After rearranging it, we can convert it to a linear system:

(17)


Where Ri represents the difference of the conditional probability of the ith binary classifier.

For example, use a “one to many” method to classify multiple categories. Here, we compare each category with the others. The coding matrix is as follows (similar to the Dirac delta function) :

(18)


It is assumed that the conditional probability of binary classifiers is correctly evaluated. Otherwise, we need to put constraints on the multi-category probabilities that we get. Ignoring the second parameter, conditional probability has the same properties as univariate probability. First, they should all add up to 1:

(19)


Second, they are both positive values:

The equivalence constraints in (18), called normalization constraints, are the easiest to implement. One way to do this is to introduce a “relaxation” variable:

(20)


Where Qp = b is a linear system with unconstrained problems, and ๐œ† is a relaxation variable.

When we classify multiple categories using the one-to-many method, we compare each category in turn to all the other categories, and that’s all we need to do. As a result, once the normalization constraint is performed, all other categories are also in place, and the solution has only positive elements.

Note: Since the system is determined by multiple factors, it needs to be solved as a least squares problem, and there is one more caveat: normalization must be done separately from least squares minimization.

In other words, we first construct normal equations with (17) and then place them in (20). For normal equations, check out my later article Mastering Singular Value Decomposition.


The problem

The list of questions helps you learn bayes and probability theory, as well as come up with useful formulas related to statistical classification. They can also help you think about some of the fundamental questions in the field.

1. Why must the algorithm suitable for logistic classifier in (10) be nonlinear? What’s the good of that?

2. Conduct online research to find out the nonlinear optimization algorithm suitable for Logistic classifier.

3. Derive formula (12) (this is very difficult). How important do you think it is to correct the category distribution? Please explain why.

4. How to calculate the ROC curve in the figure? Fill in the missing steps between formulas (8) and (9) and calculate the ROC curve.

5. Derive Formula (13).

6. List the advantages of indefinite and correlation coefficients (for binary classifiers) as measures for classification techniques. (a) what happens when category labels are rearranged, and (b) the distribution of category labels in the test data is changed? And how would that affect the results?

7. Formula (15) can be obtained from the general formula of Pearson correlation coefficient. Note: This is not a trivial matter.

8. Correlation coefficients are generally not applicable to multi-category classification problems. Why? What kind of problem is the exception?

9. Obtain Formula (17) from Formula (16). Tip: complete the derivation to P (I | x ฬ„) of what property?

The “one-to-many” coding matrix in 10. (18) can use a simplified version of Formula (17). Please explain why.

11. Write down the coding matrix for the “one to many” multi-category classifier.

12. Find some statistical classification data from the Internet or create it yourself, such as creating statistical classification data by classifying pixels in an image. Statistical classification is performed by fitting a multidimensional Gaussian distribution for each category.

ฯƒ represents the Covariance matrix, ๐œ‡ represents the mean value, and D represents the dimension of data characteristics. Don’t forget to evaluate the accuracy of your results by dividing the data into test sets and training sets.


conclusion

I want you to learn bayesian statistical classification. Mathematical models are not closed systems; they can be extended, reused, and reconnected.

The applications of probability theory and Bayes’ theorem and the problems we can solve with them are limited only by our imagination. Here, we show a variety of ways to use these tools to help complex computational learning algorithms gain a foothold.

Original link:Blog. Statsbot. Co/bayesian – le…


Heart of the machine compiles

Participation: Liu Xiaokun, Lu Xue

This article is compiled for machine heart, reprint please contact this public number for authorization.