I’m learning. I’ve sorted out the quality articles on the Internet.

We use Bayes’ formula a lotposterior distribution


butposterior distribution It’s hard to do it the Bayesian way, because we have to computeAnd theIt’s usually a high dimensional random variable, and that’s a very difficult integral to compute. In Bayesian statistics, all inferences about unknowns (inference) the problem can be thought of as a posteriori probability (posterior). Therefore, it is proposed thatVariational InferenceTo calculate theposterior distribution.

So what did the Variational Inference do? Its core idea mainly consists of two steps:

  1. Let’s say I have a distribution(This distribution is what we can figure out, if we can’t figure it out, it doesn’t make sense.)
  2. By changing the parameters of the distributionTo makeClose to the.

In summary, the model introduces a parametric term for a true posterior distribution. That is: with a simple distributionFit complex distributions.

This strategy will calculateIs transformed into an optimization problem


Once it converges, it can be usedTo take the place of.

KL divergence

Fitting one distribution to another usually requires measuring the similarity between the two distributions, usually using KL divergence, and of course there are other methods, such as JS divergence. KL divergence is introduced below:

One of the most important concepts in machine learning is relative entropy. Relative entropy, also known as Kullback — Leibler Divergence or Information Divergence, is an asymmetric measure of the difference between two probability distributions. In information theory, relative entropy is equivalent to the information entropy difference of two probability distributions. If one of the probability distributions is the real distribution and the other is the theoretical (fitting) distribution, then relative entropy is equal to the difference between the cross entropy and the information entropy of the real distribution, which represents the information loss generated when the theoretical distribution is used to fit the real distribution. The formula is as follows:


After the merger, it is expressed as:


Suppose the probability distribution of events that the theory fitsAnd the real distributionExactly the same, thenIt’s the entropy of information for real events, that’s pretty obvious. When the theoretical probability distribution of events is exactly the same as the real one, the relative entropy is equal to 0. And when the fitting is not quite the same, the relative entropy is greater than 0. The proof is as follows:


The first of these inequalities is given byThe derivation is only inIs equal to.

This property is critical because it is the property needed for deep learning gradient descent. If the fitting of the neural network is perfect, it will no longer descend in gradient, whereas imperfection will continue to descend because it is greater than zero.

But it has a downside, which is that it’s asymmetrical. Also is to useTo fittingAnd the useTo fittingThe relative entropy of phi is different, but their distance is the same. That is to say, there is no one-to-one relationship between relative entropy and distance.

To solve the

KL divergence is introduced, but the purpose of this article is to do the variational reasoning, so don’t go astray. Here are some formula equivalent conversions involved:


Let’s do both sides of this equationTo expect, to get:


At this point we need to review our problems, and our goal is to makeClose to the, is to solve:


And because theContained in theThis is very difficult to find. Based on the derivation and deformation of the above formulas, the conclusions are as follows:


willAs a variable,Is constant, so,Is equivalent to:


Now, the goal of Variational inference becomes:


  Referred to asEvidence Lower Bound(ELBO).Commonly referred to asevidenceAnd because, soThat’s why it’s calledELBO.

ELBO

ELBO formula is expressed as:


The original formula can be expressed as:


ELBO is expressed as:


In factEMAlgorithm (Expectation-Maximization) is to take advantage of this feature, it is divided into two alternating steps:E stepAssuming that the model parameters are constant,, calculate the logarithmic likelihood rate atM stepDo it againELBOOptimization of parameters relative to the model. Compared with variational method, EM algorithm assumes that when model parameters are fixed,Is easy to calculate the form, and the variation method does not have this limitation, for the conditional probability is difficult to calculate the case, the variation method is still valid.

So how do we solve this formula? Methods of mean-field, Monte Carlo and Black Box Variational Inference are introduced below.

Mean-field variational Family

We said before that we’re going to pick a family of appropriate approximate probability distributionsSo in the real problem, what form can we choose?

A simple and effective variational family is the mean-field variational family. It assumes that hidden variables are independent of each other:


This hypothesis seems to be relatively strong, but it is still widely applied in practice. We can extend it to grouping hidden variables that are actually correlated into a product form of joint distribution of each group.

Using ELBO and mean field hypothesis, we can use the coordinate Ascent Variational inference (CAVI) method to deal with:

  • The chain rule using conditional probability distribution is as follows:

  • The expectation of the variational distribution is:

Substituting it into ELBO’s definition, we can get:


toTaking the derivative and setting the derivative to zero gives:


Thus, the updating law of Coordinate ascent can be obtained:


We can use this rule to constantly fix the othersTo update the current coordinates corresponding toValue, this has to do withGibbs SamplingThe process is similar, butGibbs SamplingIs constantly sampling from conditional probability, andCAVIThe algorithm is constantly updated with the following form:


The complete algorithm is as follows:

MCMC

The MCMC method uses Markov chain sampling to approximate the posterior probability, and the variational method uses optimization results to approximate the posterior probability. So when do we use MCMC and when do we use the variational method?

First of all, MCMC is more costly in calculation than variational method, but it can ensure that samples with the same distribution as the target are obtained, while variational method does not have this guarantee: It can only find a density distribution that is approximate to the target distribution, but at the same time, the variational method is faster in calculation. Since we transform it into an optimization problem, So the results can be obtained quickly using methods such as stochastic optimization or distributed optimization. Therefore, when the amount of data is small, we can use MCMC method to consume more computing power but get more accurate samples. When the amount of data is large, it is more appropriate for us to use the variational method.

On the other hand, the distribution form of posterior probability also affects our choice. For example, for a mixed model with multiple peaks, MCMC may only focus on one peak and fail to describe the other peaks well. For such problems, variation method may be superior to MCMC method even with a small sample size.

Black box variational inference (BBVI)

ELBO formula is expressed as:


To use parametersalternativeAnd take its derivative:


The direct expansion calculation is as follows:


Due to:


Therefore:


Then it was written as SGD, which is called Black Box Variational Inference (BBVI).


Among them.

reference

  • www.jianshu.com/p/b8ecabf9a…
  • Blog.csdn.net/weixinhum/a…
  • zhuanlan.zhihu.com/p/51567052
  • zhuanlan.zhihu.com/p/49401976