From the most basic probability theory to various probability distributions, this paper comprehensively summarizes the basic probability knowledge and concepts, which may help us understand machine learning or broaden our horizons. These concepts are at the heart of data science and frequently crop up on a variety of topics. It’s always good to brush up on the basics so that we can discover new things we didn’t understand before.


Introduction to the

In this series of articles, I want to explore some introductory concepts in statistics that might help us understand machine learning or broaden our horizons. These concepts are at the heart of data science and frequently crop up on a variety of topics. It’s always good to brush up on the basics so we can discover new things we didn’t understand before, so let’s get started.

Part ONE will cover the basics of probability theory.



The probability of

Why do we need to learn probability theory when we already have such powerful mathematical tools? We use calculus to deal with functions that change infinitesimally and calculate their changes. We use algebra to solve equations, and we have dozens of other areas of mathematics to help us solve almost any kind of puzzle we can think of.

The difficulty is that we all live in a chaotic world, and most of the time it’s impossible to measure things accurately. When we study real-world processes, we want to understand the many random events that influence the results of experiments. Uncertainty is everywhere, and we must tame it to suit our needs. Only then will probability theory and statistics come into play.

These disciplines are now at the centre of artificial intelligence, particle physics, social sciences, bioinformatics and everyday life.

If we’re going to talk about statistics, it’s best to decide what probability is. Actually, there is no absolute answer to this question. We are going to talk about various ideas of probability theory.

frequency

Imagine that we have a coin and want to verify that heads and tails come up with the same frequency. How can we solve this problem? So let’s try to do some experiments, if we write 1 heads, if we write 0 tails. Repeat the toss 1000 times and count the zeros and ones. After we ran some tedious time experiments, we got these results: 600 heads (1) and 400 tails (0). If we calculate the frequency of heads and tails in the past, we’ll get 60% and 40%, respectively. These frequencies can be interpreted as the probability of a coin coming up heads or tails. This is called the probability of frequenization.

Conditional probability

In general, we want to know the probability of certain events occurring when other events occur. We will B occurs events took place the conditional probability of A written as P (A | B). Take rain as an example:

  • What’s the probability of rain when it thunders?
  • What’s the probability of rain on a sunny day?
From the eulerian graph, we can see that the P (Rain | Thunder) = 1: when we see the Thunder, always Rain (of course, it’s not quite right, but in this case we guarantee it).

What’s the P (Rain | Sunny)? Intuitively, it’s a small probability, but how can we mathematically make this exact calculation? Conditional probability is defined as:

In other words, we divide the probability of Rain and Sunny by the probability of Sunny.



Dependent events and independent events

If the probability of one event does not affect the other in any way, that event is said to be independent. Take the probability of rolling a die and getting a 2 twice in a row. These events are independent. We could say this

But why does this formula work? First, we rename the events of the first and second flips to A and B to eliminate semantic impact, and then explicitly rewrite the joint probability we see for the two flips as the single probability product of the two flips:

Now multiply P (A) by P (B) (no change, can be cancelled) and review the definition of conditional probability:

If we read from right to left, we will find that P (A | B) = P (A). This means that event A is independent of event B! Same thing with P (B), that’s the interpretation of the independent events.



Bayesian probability theory

Bayes can be used as an alternative approach to understanding probability. The frequency statistical approach assumes the existence of an optimal specific combination of the model parameters we are looking for. Bayes, on the other hand, treats parameters in a probabilistic manner and treats them as random variables. In Bayesian statistics, each parameter has its own probability distribution, which tells us that there are multiple possibilities for parameters given to existing data. I can write it mathematically

It all starts with a simple theorem that allows us to calculate conditional probabilities based on prior knowledge:

Despite its simplicity, Bayes’ theorem has enormous value, a wide range of applications, and even a special branch of Bayesian statistics. There’s a great blog post on Bayes’ theorem, if you’re interested in bayes’ derivation — it’s not that hard.



Sampling and statistics

Suppose we are studying the distribution of human height and are eager to publish an exciting scientific paper. We measured the heights of some strangers on the street, so our measurements are independent. The process by which we randomly select a subset of data from a real population is called sampling. Statistics are functions used to summarize the data patterns of sample values. The statistic you might have seen is the sample mean:

Another example is sample variance:

This formula shows how far all data points deviate from the average.



distribution

What is a probability distribution? This is a law that tells us, as a mathematical function, the probabilities of different possible outcomes in some experiment. For each function, the distribution may have some parameters to adjust its behavior.

When we calculate the relative frequency of coin flips, we actually calculate what’s called an empirical probability distribution. It turns out that many uncertain processes in the world can be expressed in terms of probability distributions. For example, if our coin turns out to be a Bernoulli distribution, if we want to calculate the probability of heads after n trials, we can use the binomial distribution.

It’s much easier to introduce a concept similar to a variable in a probabilistic environment — a random variable. Every random variable has a distribution. Random variables are capitalized by default, and we can assign a distribution to a variable using the ~ symbol.

The above formula represents that the random variable X obeies a Bernoulli distribution with a success rate (heads up) of 0.6.



Continuous and discrete probability distributions

Probability distributions can be divided into two types: the discrete distribution is used to deal with random variables with finite values, such as coin flips and Bernoulli distributions. Discrete distributions are defined by what is known as a probabilistic mass function (PMF), and continuous distributions are used to deal with a continuous (theoretically) random variable with an infinite number of values. Think of speed and acceleration measured with sound sensors. The continuous distribution is defined by the probability density function (PDF).

The two types of distribution are treated mathematically differently: the continuous distribution usually uses the integral while the discrete distribution uses the summation sigma. Take expected values:

Below, we will introduce various common probability distribution types in detail. As mentioned above, probability distribution can be divided into discrete random variable distribution and continuous random variable distribution. Bernoulli Distribution, Binomial Distribution and Poisson Distribution are common distributions of discrete random variables. The common continuous random variable Distribution includes Uniform Distribution, Exponential Distribution and normal Distribution.



Common data types

Before explaining the various distributions, let’s look at the common data types, which can be divided into discrete and continuous types.

Discrete data: data can only take specific values. For example, when you roll a die, the possible outcomes are 1,2,3,4,5,6, not 1.5 or 2.45.

Continuous data: Data can take any value within a given range, which can be finite or infinite, such as a girl’s weight or height, or the length of a road. A girl can weigh 54 KGS, 54.5 KGS, or 54.5436 KGS.



Type of distribution

Bernoulli distribution

The simplest discrete random variable distribution is the Bernoulli distribution, and that’s where we start.

A Bernoulli distribution has only two possible outcomes, denoted 1 (success) and 0 (failure), and only one Bernoulli trial. Set a random variable X with Bernoulli distribution, value 1 means the probability of success is P, value 0 means the probability of failure is Q or 1-p.

If the random variable X obeies Bernoulli distribution, the probability function is:

The odds of success and failure do not have to be equal. For example, when I fight an athlete, he has a better chance of winning. At this point, my probability of success is 0.15 and my probability of failure is 0.85.

The diagram below shows the Bernoulli distribution of our battles.

As shown in the figure above, my probability of success =0.15 and probability of failure =0.85. E(X) = 1*p + 0*(1-p) = P, and V(X) = E(X^2) — [E(X)]^2 = p — p^2 = p(1-p)

There are actually lots of examples of Bernoulli distributions, whether it will be sunny or rainy tomorrow, whether one team will win or lose in this game, and so on.



The binomial distribution

Now, going back to the coin toss case, when we flip the first coin, we can flip it again, so there are multiple Bernoulli tests. Just because it’s positive the first time doesn’t mean it’ll be positive the next time. So let’s say we have a random variable, X, that tells us how many heads we get. What’s the possible value of X? Can be any non-negative integer in the total number of flips.

If there is an identical set of random events, i.e., a set of Bernoulli tests, in the above example, flipping a coin multiple times in a row. So the number of occurrences of a random event, the probability, is subject to the binomial distribution, also known as the multiple Bernoulli distribution.

Each test is independent of the other, and the previous test does not affect the results of the current test. Two experiments with the same probability of outcome repeated n times are called multiple Bernoulli tests. The parameters of the binomial distribution are n and P, where n is the total number of trials and p is the probability of success for each trial.

According to the above, a binomial distribution has the following properties:



1. Each test is independent;

2. There are only two possible outcomes;

3. Carry out the same test n times;

4. The success rate is the same in all trials, and the probability of failure is the same.

The mathematical expression of the binomial distribution is:

The binomial distribution with unequal probability of success and failure looks like this:

The binomial distribution with equal probability of success and failure looks like this:

The mean of the binomial distribution is expressed as µ = n*p, while the variance can be expressed as Var(X) = n*p*q.



Poisson distribution

If you work in a call center, how many calls do you get in a day? As many times as possible! Poisson distribution can be used to model the number of calls received in a call center in a day. Here are a few examples:

1. The number of emergency calls received by the hospital in a day;

2. The number of theft incidents reported in the local area within a day;

3. The number of people who visited the salon in an hour;

4. The number of reported suicides in a particular city;

5. The number of misprints per page of a book.



Now you can construct many other examples in the same way. Poisson distribution applies to the case where the time and place of events are randomly distributed, where we are only interested in the number of events occurring. The main characteristics of Poisson distribution are as follows:

1. No successful event can affect other successful events;

2. The probability of success after a short interval must be equal to the probability of success after a long interval;

3. When the time interval tends to be infinitesimal, the probability of success within a time interval tends to zero.



The symbols defined in the Poisson distribution are:

  • λ is the incidence of events;
  • T is the length of the event interval;
  • X is the number of events occurring in an interval.
Let X be a Poisson random variable, then the probability distribution of X is called a Poisson distribution. µ represents the average number of events occurring within a time interval t, then µ=λ*t;

The probability distribution function of X is:

The probability distribution of the Poisson distribution is shown as follows, where µ is the parameter of the Poisson distribution:

The following graph shows how the distribution curve changes as the mean increases:

As shown above, the curve moves to the right as the mean increases. The mean and variance of poisson distribution are:

Mean: E(X) = µ

Variance: Var(X) = µ



Uniform distribution

Assuming that we are equally likely to choose an interval equally spaced on a line segment from A to B, then the probability is evenly distributed over the entire interval [a, B] and the probability density function does not change as the variable changes. Unlike Bernoulli distributions, the random variables are equally likely, so the probability density can be expressed as one half of the length of the interval, and if we take half of the possible values of the random variable, the probability of its occurrence is one half.

Assuming that the random variable X obeys uniform distribution, the probability density function is:

The uniform distribution curve is shown as follows, where the area under the probability density curve is the probability of random variable occurrence:

We can see that the probability distribution of the uniform distribution appears as a rectangle, which is why the uniform distribution is also called the rectangular distribution. In a uniform distribution, both a and B are parameters, that is, ranges of random variables.

The random variable X that follows a uniform distribution also has a mean and a variance. Its mean is E(X) = (a+b)/2, and its variance is V(X) = (b-a)^2/12

The density function parameter of the standard uniform distribution is 0 and b is 1, so the probability density of the standard uniform distribution can be expressed as:

An index distribution



Now let’s consider the call center case again, what is the distribution of call intervals? This distribution may be an exponential distribution, because the exponential distribution can model the time interval of the phone. Other examples might include modeling subway arrival times and air-conditioning equipment cycles.

In deep learning, we often need a distribution of sharp points at x=0. To do this, we can use exponential distribution:

The exponential distribution uses indicator function 1x≥0 to make the probability zero when x is negative.

Where λ >0 is the parameter of probability density function. The random variable X follows the exponential distribution, so the mean value of the variable can be expressed as E(X) = 1/λ, and the variance can be expressed as Var(X) = (1/λ)^2. As shown in the figure below, if λ is large, the curve of the exponential distribution falls more, and if λ is small, the curve is flatter. As shown below:

Here is a simple expression derived from the exponential distribution function:

P{X≤ X} = 1 — exp(-λ X) corresponds to the area under the density function curve less than X.

P{X> X} = exp(-λ X), representing the area under the curve of the probability density function greater than X.

P{x1



Normal distribution (Gaussian distribution)

The most common distribution for real numbers is normal distribution, also known as Gaussian distribution. Because of the universality of this distribution, especially the generalization of the central limit theorem, many small random variables can be fitted to a normal distribution by superposition. Normal distribution has the following characteristics:

1. All variables obey the same mean, variance and distribution pattern.

2. The distribution curve is bell-shaped and symmetric along x=μ.

3. The sum of the areas under the curve is 1.

4. The exact value of the left half of the distribution is equal to the right half.



The normal and Bernoulli distributions are very different, but their distribution functions are essentially equal as the number of Bernoulli tests approaches infinity.



If the random variable X obeys the normal distribution, the probability density of X can be expressed as:

The mean value of random variable X can be expressed as E(X) = µ, and the variance can be expressed as Var(X) = σ^2. Mean µ and standard deviation σ are parameters of the Gaussian distribution.

The random variable X obeys the normal distribution N (µ, σ) and can be expressed as:

The standard normal distribution can be defined as a distribution function with mean value of 0 and variance of 1. The probability density function and distribution graph of the standard normal distribution are shown below:



The relationship between distributions



Bernoulli distribution and binomial distribution

1. Binomial distribution is a special case of Bernoulli distribution single test, i.e. Bernoulli test;

2. There are only two possible results for each experiment of binomial and Bernoulli distributions;

3. Binomial distribution Each experiment is independent of each other and each experiment can be regarded as a Bernoulli distribution.



Poisson distribution and binomial distribution

The Poisson distribution is the limiting form of the binomial distribution under the following conditions:

1. The number of tests is very large or approaches infinity, that is, n → ∞;

2. The success probability of each test is the same and approaching zero, that is, P →0;

3. Np =λ is finite.



The relation between normal distribution and binomial distribution & the relation between normal distribution and Poisson distribution

The normal distribution is a limiting form of the binomial distribution under the following conditions:

1. The number of tests is very large or approaches infinity, that is, n → ∞;

2. Neither p nor q is infinitesimal.

When the parameter λ goes to infinity, the normal distribution is the limit form of poisson distribution.



Exponential distribution and Poisson distribution

If the time intervals of random events obey the exponential distribution with a parameter λ, then the total number of events occurring within the time period t obey the Poisson distribution with the corresponding parameter λt.



test

Readers can check their understanding of the above probability distribution by completing the following simple test:

1. The calculation formula of random variables subject to standard normal distribution is as follows:

A. / sigma (x + (including)

B. (x – including)/sigma

C. (x – sigma)/(including



2. In Bernoulli distribution, the formula for calculating standard deviation is:

A. p (1 — p)

B. SQRT (p (p – 1))

C. SQRT (p (1 – p))



3. For a normal distribution, an increase in the mean means:

A. The curve shifts to the left

B. The curve shifts to the right

[C] the curve flattens



4. Assuming that the life cycle of the battery follows a λ = 0.05 exponential distribution, the probability that the final life of the battery is between 10 hours and 15 hours is:

0.1341 a.

0.1540 b.

0.0079 c.



conclusion

In this paper, we discuss the understanding of probability from the most basic random events and their concepts. Then we discussed the most basic probability calculation methods and concepts, such as conditional probability and Bayesian probability and so on. The independence of random variables and conditional independence are also discussed. In addition, this paper introduces probability distribution in detail, including discrete random variable distribution and continuous random variable distribution. This paper mainly discusses the basic probability theorems and concepts, which are basically explained in detail in our university courses of probability theory and mathematical statistics. For machine learning, understanding probability and statistics is very important for understanding machine learning models, on which we can further understand new concepts such as structured probability.





Original link:



From Medium & AnalyticsVidhya

Heart of the machine compiles

The Heart of the Machine editorial department





This article is compiled for machine heart, reprint please contact this public number for authorization.