Fundamentals of probability theory

The joint probability of mutually independent events

If the set of events A={A1,A2,…,An}A = \{A_1, A_2, \cdots, A_n\}A={A1,A2,…,An} are independent of each other, then the probability of all events occurring simultaneously is:


P ( A n A n 1 A 1 ) = i = 1 n P ( A i ) P(A_nA_{n-1}\cdots A_1)=\prod_{i=1}^{n}P(A_i)

Conditional probability

Known event AAA and BBB, and know the AAA, BBB is simultaneous probability P (AB) P (AB) P (AB), the incidence of the BBB is P (B) P (B) P (B), the conditional probability P (A ∣ B) P (A | B) P (A ∣ B) said BBB occurred, Probability of AAA occurrence:


P ( A B ) = P ( A B ) P ( B ) P(A|B)=\frac{P(AB)}{P(B)}

Or the probability of AAA and BBB occurring simultaneously is:


P ( A B ) = P ( A B ) P ( B ) P(AB) = P(A|B)P(B)

Total probability formula

The known event set B={B1,B2,…,Bn}B = \{B_1, B_2, \cdots, B_n\}B={B1,B2,…,Bn} is a partition of the state space SSS, and AAA represents an event in SSS. So the probability of AAA happening is:


P ( A ) = i = 1 n P ( A B i ) P ( B i ) P(A) = \sum_{i=1}^{n}{P(A|B_i)P(B_i)}

Here, P(A), P(A), P(A) is also called prior probability, that is, this is inferred from the conditional probability of the event happening

Bayes formula

Using the definition in the full probability, suppose we know events have taken place in AAA, and know that P (A ∣ Bi), ∀ Bi ∈ BP (A | B_i), \ forall B_i \ in BP (A ∣ Bi), ∀ Bi ∈ B, now want to know how much is the probability of event BiB_iBi, Then this probability can be expressed as P (Bi ∣ A) P (B_i | A) P (Bi ∣ A), AAA occurrence conditions, namely BiB_iBi occurrence probability, pay attention to the AAA has happened here, so use the conditional probability of representation. P (Bi ∣ A) P (B_i | A) P (Bi ∣ A), also known as posterior probability, because this is calculated in the case of things have happened.

According to the definition of conditional probability formula and total probability formula, we have the following deduction:


P ( B i A ) = P ( B i A ) P ( A ) = P ( A B i ) P ( B i ) P ( A ) = P ( A B i ) P ( B i ) j = 1 n P ( A B j ) P ( B j ) P(B_i|A) = \frac{P(B_iA)}{P(A)}=\frac{P(A|B_i)P(B_i)}{P(A)}=\frac{P(A|B_i)P(B_i)}{\sum_{j=1}^{n}{P(A|B_j)P(B_j)}}

Thus, the final form of Bayes’ formula is:


P ( B i A ) = P ( A B i ) P ( B i ) j = 1 n P ( A B j ) P ( B j ) P(B_i|A) = \frac{P(A|B_i)P(B_i)}{\sum_{j=1}^{n}{P(A|B_j)P(B_j)}}

Maximum likelihood estimation

Suppose we have an observational data set:


T = { ( x 1 . y 1 ) . ( x 2 . y 2 ) . . ( x n . y n ) } T=\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \cdots, (\mathbf{x}_n, y_n)\}

The xi ∈ Rn \ mathbf {x} _i \ \ mathbb in {R} ^ nxi ∈ Rn, y = {c1 and c2,…, ck} y = \ {c_1, c_2, \ cdots, c_k \} y = {c1 and c2,…, ck}, and data between independent from each other. We know that the sample obees a known probability distribution whose probability density function is ρ\rhoρ and whose parameter set is θ\thetaθ, and we want to solve the parameter set θ\thetaθ from the observed data set.

The formula is given:


L Theta. ( x 1 . x 2 . . x n Theta. ) = i = 1 n rho ( x i . Theta. ) L_{\theta}(\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_n|\theta)= \prod_{i=1}^{n}\rho(\mathbf{x}_i,\theta)

L of theta L_ {\ theta} L theta under the theta \ theta theta, sample {x1.., xn} \ {\ mathbf {x} _1 \, \ cdots, \ mathbf {x} _n \} {x1.., xn} probability of occurring simultaneously.

We assume that LθL_{\theta}Lθ is the highest value of all possible parameters in the simultaneous set of events, i.e


L Theta. ^ ( x 1 . x 2 . . x n Theta. ^ ) = arg max Theta. L Theta. ( x 1 . x 2 . . x n Theta. ) = arg max Theta. i = 1 n rho ( x i . Theta. ) L_{\hat{\theta}}(\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_n| \hat{\theta})=\arg \max_{\theta}L_{\theta}(\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_n| \theta)=\arg \max_{\theta}\prod_{i=1}^{n}\rho(\mathbf{x}_i,\theta)

The maximum value of LθL_\thetaLθ is equivalent to the maximum value of log⁡Lθ=Hθ\log{L_{\theta}} = _H\thetalogLθ=Hθ.


H Theta. = i = 1 n ln rho ( x i . Theta. ) H_{\theta}=\sum_{i=1}^{n}\ln{\rho(\mathbf{x}_i, \theta)}

Therefore, we only need to solve:


partial H Theta. partial Theta. = 0 \frac{\partial{H_{\theta}}}{\partial \theta}=0

θ^\hat{\theta}θ^

Naive Bayes algorithm

To be added, updated tonight