This article is the basis for the CS229 machine learning course at Stanford University, original file download [1]

Arian Maleki, Tom Do

Translation: Shi Zhenyu [2]

Review and modification production: Huang Hai Guang [3]

Note: Please pay attention to github[4] for updates. For the translation of linear algebra, see (this article).

CS229 Machine Learning Recitation Materials – Probability Theory

Review and reference of probability theory

Probability theory is the study of uncertainty. In this course, we will rely on concepts in probability theory to derive machine learning algorithms. This note attempts to cover the fundamentals of probability theory as they apply to CS229. The mathematics of probability theory is very complex and involves a branch of “analysis” : measure theory. In this note, we provide some basic approaches to probability, but we won’t get into these more complex details.

1. The basics of probability

In order to define probabilities on sets, we need some basic elements,

  • Sample space: The set of all the results of a randomized trial. Here, each result can be considered a complete description of the real world state at the end of the experiment.

  • Event set (event space) : The set of elements (called events) is a subset of (that is, each is a set of possible results of an experiment).

    Note: The following three conditions must be met:

    (1) 

    (2) 

    (3) 

  • Probability measure: A function is a mapping of one that satisfies the following properties:

  • For each,,

  • If are disjoint events (i.e. when,), then:

These three properties are known as probability axioms.

For example:

Consider the event of rolling a six-sided die. The sample space is zero. The simplest event space is the trivial event space. The other event space is the set of all subsets of. For the first event space, the unique probability metric satisfying the above requirements is given by,. For the second event space, a valid probability metric is to assign the probability of each event in the event space to, here is the number of elements in the set of events; For example,.

Properties:

  • If, then:
  • (Boolean inequality) :
  • (Law of total probability) : If are disjoint events and their union is, then the sum of their probabilities is 1

1.1 Conditional probability and independence

Assuming is an event with a non-zero probability, we define the conditional probability under given conditions as:

In other words,) is a measure of the probability of an event occurring if the event has already been observed. Two events are said to be independent events if and only if (or equivalently,). Thus, independence is equivalent to saying that the observed event has no effect on the probability of the event.

2. Random variables

Consider an experiment where we flip 10 coins, and we want to know the number of heads. Here, the element of the sample space is a sequence of length 10. For example, we might have. In practice, however, we generally do not care about the probability of obtaining any particular sequence of pros and cons. Instead, we usually care about the result as a real-valued function, such as the number of heads in our 10 flips, or the longest tails. Under some technical conditions, these functions are called random variables.

More formally, a random variable is a function of theta. In general, we will use capital letters or simpler (with implicit reliance on random results) to represent random variables. We will use lowercase letters to represent the values of random variables.

For example: In our experiment above, assume the number of heads that occur in the flips sequence. Given that there are only 10 coins that you can flip, you can only take a finite number of values, so it’s called a discrete random variable. Here, the probability of the set associated with the random variable taking a particular value is:

Example: Suppose is a random variable that represents the time required for a radioactive particle to decay. In this case, there are infinitely many possible values, so it is called a continuous random variable. We express the probability between two real constants and (where) as:

2.1 Cumulative distribution function

In order to specify the probability measures to use when dealing with random variables, it is often convenient to specify alternative functions (CDF, PDF, and PMF). In this and the next two sections, we will describe these types of functions in turn.

The cumulative distribution function (CDF) is a function that specifies the probability measure as:

By using this function, we can calculate the probability of any event. Figure 1 shows a sample CDF function.

Figure 1: Properties of a cumulative distribution function (CDF) :

2.2 Probability mass function

When a random variable takes a finite number of possible values (i.e., is a discrete random variable), a simpler way to represent the probability measure associated with the random variable is to directly specify the probability of each value that the random variable can assume. In particular, the probabilistic mass function (PMF) is a function such that:

In the case of discrete random variables, we use symbols to represent the set of possible values that the random variable may assume. For example, if phi is a random variable, representing the number of heads in ten flips of a coin, then.

Properties:

2.3 Probability density function

For some continuous random variables, the cumulative distribution function is differentiable. In these cases, we define the probability density function (PDF) as the derivative of the cumulative distribution function, namely:

Note that the probability density function of a continuous random variable may not always exist (i.e., if it is not differentiable everywhere).

And by the property of differentials, for very small,

CDF and PDF(while they exist!) Can be used to calculate the probability of different events. It should be emphasized, however, that the value of the probability density function (PDF) at any given point is not the probability of the event, i.e. For example, you can take a value greater than 1 (but the integral over any subset of is at most 1).

Properties:

2.4 expect

Let’s say phi is a discrete random variable, whose PMF is phi is an arbitrary function. In this case, it can be regarded as a random variable, and we define the expected value as:

If is a continuous random variable, whose PDF is, then the expected value of is is defined as:

Intuitively, the expected value of is can be thought of as a “weighted average” of the values that can be taken for different values, where the weights are given by or. As a special case, notice that the expected value of the random variable itself, which is also referred to as the mean of the random variable, is obtained by making.

Properties:

  • For an arbitrary constant,
  • For an arbitrary constant,
  • (Linear expectation) :
  • For a discrete random variable,

2.5 the variance

The variance of a random variable is a measure of the concentration of the distribution of a random variable around its mean. Formally, the variance of the random variable is defined as:

Using the properties in the previous section, we can derive an alternative expression for the variance:

The second of these equations comes from the linearity of the expectation and the fact that the expectation is actually constant with respect to the outer layer.

Properties:

  • For an arbitrary constant,
  • For an arbitrary constant,

For example:

Calculate the mean and variance of the uniform random variable, arbitrary, whose PDF is, and 0 elsewhere.

For example:

Suppose that for some subset, there’s a calculation, right?

Discrete case:

Continuous case:

2.6 Some common random variables

Discrete random variable

  • Bernoulli distribution: The probability of a coin flipping heads is (where:), 1 if heads occur, 0 otherwise.
  • Binomial distribution: the number of heads in independent flips of a coin with probability of heads (where:).
  • Geometric distribution: the number of times it takes for a coin with probability of (where:) to roll heads for the first time.
  • Poisson distribution: Probability distribution of non-negative integers used to simulate the frequency of rare events (where:).

Continuous random variable

  • Uniform distribution: distribution with equal probability density at each point between and (where: $a
  • Exponential distribution: Probability density with decay on nonnegative real numbers (where:).
  • Normal distribution: Also known as the Gaussian distribution.

The shapes of probability density functions and cumulative distribution functions of some random variables are shown in Figure 2.

Figure 2: Probability density functions (PDF) and cumulative distribution functions (CDF) of some random variables The following table summarizes some properties of these distributions:

3. Two random variables

So far, we’ve considered individual random variables. However, in many cases, in randomized trials, we may have more than one quantity of interest. For example, in an experiment where we flipped a coin ten times, we might care both about the number of heads that came up and the length of the longest heads in a row. In this section, we consider the setup of two random variables.

3.1 Joint distribution and edge distribution

So let’s say we have two random variables, and one way to do that is to think about them separately. If we do that, all we need is alpha and beta. But if we want to know the values assumed for both and in the results of randomised experiments, we need a more complex structure called the joint cumulative distribution function of and, defined as follows:

It can be shown that by knowing the joint cumulative distribution function, you can calculate the probability of any event involving and.

Joint CDF: and the joint distribution function and of each variable are related respectively by the following formula:

Here we call the cumulative probability distribution function of the edges of the sum.

Properties:

3.2 Joint probability and edge probability quality function

If and are discrete random variables, then the joint probability mass function is defined by the following formula:

Here, for any, and

How does the combined PMF on the two variables relate to the probabilistic mass function of each variable? In fact:

For similar. In this case, we call the marginal probability mass function of omega. In statistics, the process of adding one variable to form an edge distribution of another is often called “marginalization”.

3.3 Joint probability and edge probability density functions

Suppose that and are two continuous random variables with a joint distribution function. In the case of differentiable everywhere in and, we can define the joint probability density function:

As in the one-dimensional case, but:

Note that the values of probability density functions are always non-negative, but they can be greater than 1. One thing is certain, though

Similar to the discrete case, we define:

As a marginal probability density function (or marginal density) of phi, and similarly for phi.

3.4 Conditional probability distribution

Conditional distributions try to answer the question, what is the probability distribution of phi when we know we have to take a certain value? In the discrete case, the given conditional probability mass function is simple:

Let’s say the denominator doesn’t equal 0.

In the continuous case, technically it’s a little more complicated, because the probability of a continuous random variable is equal to zero. Ignoring this technical point, we simply define the given conditional probability density as:

Let’s say the denominator doesn’t equal 0.

3.5 Bayes’ theorem

A useful formula that often comes up when trying to derive a conditional probability expression for one variable given another is Bayes’ theorem.

For discrete random variables and:

For continuous random variables and:

3.6 independence

If for all values of sum,, then the two random variables sum are independent. Equivalent,

  • For discrete random variables, for any,.
  • For discrete random variables, for any and.
  • For continuous random variables, for any random variable.
  • For continuous random variables,, when for any.

Informally, if “knowing” the value of one variable never has any effect on the conditional probability distribution of the other variable, then the sum of two random variables is independent, that is, you know everything about the pair of variables as long as you know the sum. The following lemma formalizes this observation:

Lemma 3.1

If and are independent, then for any, we have:

Using the lemma above, we can show that if phi is independent, then any function of phi is independent of any function of phi.

3.7 Expectations and covariance

Let’s say we have two discrete random variables, and it’s a function of these two random variables. Then the expected value of is defined as follows:

For continuous random variables, a similar expression is:

We can use the concept of expectation to study the relationship between two random variables. In particular, the covariance of the two random variables is defined as:

Using a derivation similar to variance, we can rewrite it as:

Here, the key step in showing that the two forms of covariance are equal is the third equal sign, where we use the fact that the sum is actually a constant and can be derived. When, we say and are not related.

Properties:

  • (Expected linearity)
  • If alpha and beta are independent of each other, then
  • If alpha and beta are independent of each other, then.

4. Multiple random variables

The concepts and ideas introduced in the previous section can be generalized to more than two random variables. In particular, suppose we have a continuous random variable. In this section, for the sake of simplicity, we focus only on the continuous case, and the generalization of discrete random variables works similarly.

4.1 Basic Properties

The joint cumulative distribution function, joint probability density function and edge probability density function can be defined as:

To calculate the probability of an event, we have:

The chain rule:

From the definition of conditional probability of multiple random variables, it can be seen that:

Independence: For multiple events, we say that they are independent of each other, when for any subset, we have:

Similarly, we say that the random variable is independent if:

Here, the definition of mutual independence is just a natural extension of the independence of two random variables to multiple random variables.

Independent random variables often appear in machine learning algorithms, where we assume that the training samples belonging to the training set represent independent samples from some unknown probability distribution. To illustrate the importance of independence, consider a “bad” training set. We first extract a training sample from an unknown distribution and then add copies of the exact same training sample to the training set. In this case, we have:

Despite the size of the training set, these examples are not independent! While the process described here is clearly not a sensible way to build training sets for machine learning algorithms, it turns out that sample independence does often occur in practice, and it has the effect of reducing the “effective size” of the training set.

4.2 Random Vector

Let’s say we have n random variables. When working with all these random variables together, we often find it convenient to put them in one vector… We call the resulting vector a random vector (more formally, a random vector is a mapping from PI to PI). It should be clear that random vectors are just an alternative notation for dealing with random variables, so the concepts of joint probability density functions and synthetic density functions will also apply to random vectors.

Expectations:

Any function under consideration. The expected value of this function is defined as zero

Where, the integral is continuous from PI to PI. If the function is from, then the expected value of is the expected value of the elements of the output vector, i.e., if is:

So,

Covariance matrix: For a given random vector, the covariance matrix is the square matrix, whose input is given. From the definition of covariance, we have:

Where the matrix is expected to be defined in an obvious way. Covariance matrices have many useful properties:

  • ; In other words, it’s positive semidefinite.
  • ; In other words, it’s symmetric.

4.3 Multivariate Gaussian distribution

A particularly important example of a probability distribution on a random vector is called a multivariate Gaussian or multivariate normal distribution. A random vector is considered to have a multivariate normal (or Gaussian) distribution when it has a mean and covariance matrix (where the space of a symmetric positive definite matrix is referred to)

Let’s write it as. Note that in the case of, it reduces to a normal distribution, where the mean parameter is, and the variance is.

In general, Gaussian random variables are useful in machine learning and statistics for two main reasons:

First, they are very common when modeling “noise” in statistical algorithms. Generally, noise can be considered as the accumulation of a large number of small independent random disturbances affecting the measurement process; According to the central limit theorem, the sum of independent random variables will tend to “look like Gaussian”.

Secondly, gaussian random variables facilitate many analytical operations, because many integrals involving gaussian distributions in practice have simple closed form solutions. We will encounter this situation later in the course.

5. Other resources

A good textbook on the level of Probability required for CS229 is A First Course on Probability by Sheldon Ross.

The resources

[1]

The original file download: cs229.stanford.edu/summer2019/…

[2]

Shi Zhenyu: github.com/szy2120109

[3]

Huang Haiguang: github.com/fengdu78

[4]

Making: github.com/fengdu78/Da…

This article was first published on the public account “Machine Learning Beginners”