1. Understanding of probability graph model

Probability graph model is a theory that uses graph to represent the probability dependence of variables, combining the knowledge of probability theory and graph theory, using graph to represent the joint probability distribution of variables related to the model. Developed by Turing Award winner Pearl.

If there is one word to describe the Probabilistic Graphical Model, it is “elegant”. For a real problem, we want to be able to mine the knowledge implicit in the data. The probability graph model builds such a graph, which uses observation nodes to represent observed data, implicit nodes to represent potential knowledge, and edges to describe the relationship between knowledge and data. Finally, a probability distribution is obtained based on such a graph, which solves the problem very “elegantly”.

Nodes in the probability graph are divided into implicit nodes and observation nodes, and edges are divided into directed edges and undirected edges. From the perspective of probability theory, nodes correspond to random variables, and edges correspond to the dependence or correlation of random variables, among which directed edges represent one-way dependence and undirected edges represent mutual dependence.

Probability graph models are divided into two categories: Bayesian Network and Markov Network. Bayesian networks can be represented by a directed graph and Markov networks can be represented by an undirected graph. More specifically, probabilistic graph models include naive Bayes model, maximum entropy model, hidden Markov model, conditional random field, topic model, etc., which are widely used in many scenarios of machine learning.

2. Count bayesian networks

2.1 Frequency view

For a long time, people only have a fixed 0 and 1 for the probability of an event happening or not happening, that is, either it will happen or it will not happen, and never consider the probability of something happening or not happening. And the probability, although unknown, is at least a certain value. For example, if people at that time were asked a question: “There is a bag containing several white balls and black balls, what is the probability of getting a white ball from the bag?” They’ll tell you right away, without even thinking about it, that the probability of picking a white ball is 1/2, that you either pick a white ball or you don’t pick a white ball, that theta can only have one value, and that no matter how many times you pick a white ball, the probability of picking a white ball θ is always 1/2, that is, it doesn’t change with your observation X.

This frequentist view dominated opinion until a figure named Thomas Bayes came along.

2.2 Bayes School

When Thomas Bayes (1702-1763) was alive, he was not well known to the people of his time. He seldom published papers or books, and communicated very little with the academic circles of his time. In today’s terms, Bayes was a real folk academic diaos. However, the diaosi finally published An essay titled “An essay Towards Solving a problem in the doctrine of Chances”, which translates as: a solution to a problem in the theory of chance. You might think I’m saying: the publication of this paper created a random sensation that cemented Bayes’ place in academic history.

This paper can be illustrated by using the example above, “There is a bag containing several white and black balls. What is the probability θ of getting a white ball from the bag?” Bayes thinks that the probability of getting a white ball is an uncertain value because it contains an element of chance. For example, if a friend starts a business, you know there are two outcomes, either success or failure, but you still can’t help but estimate the probability of his success. If you know him well, and he is methodical, clear-thinking, persevering, and able to unite people around him, you can’t help but estimate that the probability of his entrepreneurial success may be more than 80%. This is different from the original “black or white, zero or one” way of thinking, is bayesian way of thinking.

Firstly, I will briefly summarize the different thinking modes of frequency and Bayes:

  • The frequency school regards the parameter θ to be inferred as a fixed unknown constant, that is, although the probability is unknown, it is at least a certain value. Meanwhile, sample X is random, so the frequency school focuses on sample space, and most probability calculations are aimed at the distribution of sample X.
  • Bayesians, on the other hand, believe that parameters are random variables and sample X is fixed. Since the sample is fixed, they focus on the distribution of parameters.

Bayeses, since they think of it as a random variable, have to calculate the distribution, the unconditional distribution that we know in advance, that is, before we have samples (or before we observe X), what is the distribution?

If you throw a ball on the pool table, where does it land? If the ball is thrown impartially, it has the same chance of landing anywhere on the table, that is, the probability of landing anywhere on the table follows a uniform distribution. This distribution, which belongs to the property of basic premises, is called prior distribution or unconditional distribution.

Among them, prior information generally comes from experience and historical data. For example, when Lin Dan is competing with a certain player, the commentator will generally make a rough judgment on the outcome of the match according to the results of Lin Dan’s previous matches. Again, for instance, a factory for product quality inspection every day, in the evaluation of product fraction defective is theta, after a period of time will accumulate a large amount of historical data, the historical data is a priori knowledge, with the prior knowledge, and in determining whether you need for a product quality inspection every day, when I find if the history of the past data shows, The unqualified rate of a product is only 0.01%, it can be regarded as a trustworthy product or inspection free product, only sampling inspection once or twice a month, so as to save a lot of manpower and material resources.

Then check distribution PI (theta | X) are generally regarded as in the case of a given sample X theta conditional distribution, and make the PI (theta | X) to maximize the value of theta MD, known as the maximum a posteriori estimation, similar to the maximum likelihood estimation in classical statistics.

Taken together, it is like that human beings started with pitiful prior knowledge of nature, but with continuous observation and experiments to obtain more samples and results, people get more and more thorough understanding of the laws of nature. Therefore, Bayes method not only conforms to the thinking way of People’s Daily life, but also conforms to the law of people’s understanding of nature. After continuous development, it eventually occupies half of the field of statistics and competes with classical statistics.

2.3 Bayes’ theorem

Conditional probability (also known as posterior probability) is the probability of event A occurring if another event B has already occurred. Conditional probability is expressed as P (A | B), pronounced “under the condition of B A probability”.

For example, in the figure above, events or subsets A and B in the same sample space ω, if an element randomly selected from ω belongs to B, then the probability that the randomly selected element still belongs to A is defined as the conditional probability of A under the premise of B:


Joint probability:

Marginal probability (prior probability) : P(A) or P(B)

2.4 Bayesian networks

Bayesian Network, also known as Belief network, or Directed Acyclic Graphical model, is a probability graph model. It was first proposed by Judea Pearl in 1985. It is a model to simulate causality in human reasoning process, and its network topology is a directed acyclic graph (DAG).

Nodes in directed acyclic graphs of Bayesian networks represent random variables

They can be observable variables, hidden variables, unknown parameters, and so on. Variables or propositions that are considered causal (or not conditionally independent) are linked by arrows. If two nodes are connected by a single arrow, indicating that one node is “parents” and the other is “children”, the two nodes will generate a conditional probability value.

For example, assume that node H E directly affect the node, the E and H, use the arrow from E to H the nodes E to establish H directed arc (E, H), weights (that is, the connection strength) with the conditional probability P (H | E), as shown in the figure below:

In short, the random variables involved in a research system are independently plotted in a directed graph according to whether conditions exist or not, thus forming a Bayesian network. It is mainly used to describe conditional dependencies between random variables, with circles to denote random variables and arrows to denote conditional dependencies.

In addition, for any random variable, its joint probability can be obtained by multiplying their respective local conditional probability distributions:


2.4.1 Structural form of Bayesian network

1. head-to-head

In accordance with the above, it is: P (a, b, c) = P (a) * P (b) * P (c | a, b), namely under the condition of unknown c, a, b is blocked (blocked), is independent and is called the head – to – head independent conditions.

2. tail-to-tail

Consider that C is unknown, and c is given these two cases:

  1. When c is unknown, are: P (a, b, c) = P (c) * P (a | c) * P (b | c), at this point, can’t it is concluded that P (a, b) = P (a) P (b), c is unknown, a and b is not independent.
  2. When c is known, are: P (a, b | c) = P (a, b, c)/P (c), then P (a, b, c) = P (c) * P (a | c) * P (b | c) into the formula, are: P (a, b | c) = P (a, b, c)/P (c) = P (c) * P (a | c) * P (b | c)/P (c) = P (a | c) * P (b | c), is known as c, a and b independent.

3. head-to-tail

So let’s say we have c unknown and WE have C and we have these two cases:

  1. C is unknown, there are: P (a, b, c) = P (a) * P (c | a) * P (b | c), but can’t launch P (a, b) = P (a) P (b), c is unknown, a and b is not independent.

  2. C is known, there are: P (a, b | c) = P (a, b, c)/P (c), and according to the P (a, c) = P (a) * P (c | a) = P (c) * P (a | c), can be obtained in:


    =? P(a,b,c)/P(c)?

    =? P(a)*P(c|a)*P(b|c) / P(c)?

    =? P(a,c)*P(b|c)/P(c)?

    =? P(a|c)*P(b|c)?

    Therefore, under the given condition of C, A and B are blocked and are independent, which is called head-to-tail condition independence.

    The head-to-tail is a chain network, as shown in the following figure:

    Given xi, the distribution of xi+1 and x1,x2… Xi minus 1 is conditional independent. What does that mean? It means that the distribution state of XI +1 depends only on XI and is independent of other variables. In layman’s terms, the current state is related only to the previous state, not to the state before or up. This random process of sequential evolution is called Markov chain. We’ll talk more about Markov chains in the next video.

2.4.2 factor diagram

Wikipedia defines a Factor Graph as a bidirectional Graph based on the factorization of a multivariable global function into the product of several local functions.

Generally speaking, the so-called factor graph is a probability graph obtained by factorization of functions. There are generally two types of nodes: variable nodes and function nodes. As we know, a global function can be decomposed into the product of many local functions by factoring, and the relationship between these local functions and corresponding variables is reflected in the factor diagram.

For example, we now have a global function whose factorization equation is:


Where, fA,fB,fC,fD and fE are all functions, representing the relationship between variables, which can be conditional probability or other relations. The corresponding factor graph is as follows:

In probability graphs, it is a common problem to find the edge distribution of a variable. There are many ways to solve this problem, one of which is to convert Bayesian networks or Markov random fields into factor graphs, and then use sum-Product algorithm to solve the problem. In other words, sum-Product algorithm can be used to efficiently calculate the edge distribution of each variable based on factor graph.

For a detailed summary of the sum-product algorithm, see the blog post: From Bayesian Methods to Bayesian networks

2.5 Naive Bayes

Naive Bayesian is one of the classic machine learning algorithms and one of the few classification algorithms based on probability theory. The naive Bayes principle is simple and easy to implement. It is used for text classification, such as spam filtering. ** Naive Bayes can be regarded as a special case of Bayesian network: that is, the network is boundless and each node is independent. 六四风波

Naive Bayes where is naive? — Two hypotheses:

  • The probability of occurrence of a feature is independent of other features (conditions);
  • Each feature is equally important.

Bayes’ formula is as follows:


Here is an example to explain naive Bayes, given the following data:

Now the question for us is, if a boy and a girl friend, the boy wants to propose to the girl, the boy’s four characteristics are not handsome, bad personality, short height, not ambitious, please judge whether the girl will marry or not marry?

This is a typical classification problem, into a mathematical problem is to compare the p (married | (not handsome, bad character, shorter, not aspirant)) and p (not married | (not handsome, bad character, shorter, not aspirant)) of the probability, the probability of who is big, I can get married or don’t marry the answer! Here we relate to the naive Bayes formula:


We need to request p (married | (not handsome, poor character, shorter, no progress), this is we don’t know, but by simple bayesian formula can be turned into good amount of three, the three variables can be obtained by using the method of statistics.

Wait, why is this true? Now, as those of you who have taken probability theory might sense, this equation needs to be independent of each other. Right! That’s where the term naive Bayesian classification comes from, and naive Bayes algorithms assume that features are independent of each other, and then this equation works!

But why assume that features are independent of each other?

  1. We think so, if without this assumption, so we are on the right side of the probability estimates is not to do, so to speak, we have four characteristics of this example, handsome including {handsome, not handsome}, character including {is bad, good, good blasting}, height including} {is high, short,, aspirant including} {not motivated, progresses, Then the joint probability distribution of the four features is a total 4-dimensional space, and the total number is 233*2=36.

    36, computer scanning statistics is ok, but in real life, there are often too many features, and the value of each feature is also too many, so it is almost impossible to estimate the value of the following probability through statistics, which is why it is necessary to assume the independence of features.

  2. If we don’t have are usually assumed independent characteristics between, so we statistics, you need to find in the feature space, such as statistical p (not handsome, character is bad, shorter, not aspirant | married), we need to under the condition of marriage, to find four features fully meet are not handsome, character is not good, shorter, not motivated by the number of people, in this way, Due to the sparsity of data, it is easy to count the situation of 0. It’s not appropriate.

For these two reasons, naive Bayes makes the assumption of conditional independence for conditional probability distributions, which is a strong assumption, hence the name naive Bayes! This assumption makes naive Bayes method simple, but sometimes sacrifices certain classification accuracy.

Advantages of Naive Bayes:

  • The algorithm logic is simple, easy to implement (algorithm thinking is very simple, as long as the use of Bayes formula transformation can!
  • Low temporal and spatial overhead in the classification process (assuming that features are independent of each other, only two-dimensional storage will be involved)

Disadvantages of Naive Bayes:

In theory, naive Bayes model has the lowest error rate compared with other classification methods. However, in fact, this is not always the case, because the naive Bayes model assumes that attributes are independent of each other, which is often not valid in practical application. When the number of attributes is large or the correlation between attributes is large, the classification effect is not good.

Naive Bayesian models ** Naive means “very simple and Naive” and ** assume that sample characteristics are independent of each other. This assumption is largely nonexistent in reality, but there are plenty of cases where feature correlations are small, so the model still works well.

3. Some questions based on Bayes

  1. Explain the prior probability, likelihood estimation and marginal likelihood estimation in naive Bayes algorithm?
    • ** Prior probability: ** is the proportion of the dependent variable (dichotomy) in the data set. This is as close a guess as you can make about classification without any further information.
    • ** Likelihood estimation: ** Likelihood estimation is the probability that an observation value is classified as 1 given some other variables. For example, the probability of the word “FREE” being used in previous spam is a likelihood estimate.
    • ** Marginal likelihood estimation: ** Marginal likelihood estimation is the probability that the word “FREE” will be used in any message.

4. The difference between generative model and discriminant model

  • Discriminant model (discriminative model) by solving condition probability distribution P (y | x) or directly calculate the value to predict y y.

    Linear Regression, Logistic Regression, SUPPORT vector machine, Traditional Neural Networks, Linear Discriminative Analysis, Conditional Random Field

  • Generative model Generative model determines and estimates y by calculating the joint probability distribution P(x,y) of the observed values and labeled data.

    Naive Bayes, Hidden Markov model (HMM), Bayesian Networks and Latent Dirichlet Allocation, mixed Gaussian model

5. Code implementation

News category GitHub: Click to enter

6. References

From Bayesian methods to Bayesian networks

Author: @ mantchs

GitHub:github.com/NLP-LOVE/ML…

Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]