The original address of this article is https://blog.oliverxu.cn

Importance sampling and KL divergence analysis and different implementation methods

Recently, I have been reading the article Guided Policy Search, in which the Importance Sampling, KL Divergence and other technologies have been used. Although these have been used before, some documents have not been systematically sorted out. The article Guided Policy Search was written in 2013, but the techniques used in some algorithms, such as TRPO and PPO, have been basically used in this article. Preliminary feeling this article is still more classic.

The examples cited in this article are all policy in the continuous action space of reinforcement learning.

Two policies are generated for validation:

As my research direction is reinforcement learning, the example I gave was also illustrated by the Policy of continuous space in reinforcement learning. Taking PPO algorithm as an example, generally speaking, when the structure of actor-critic network is used, the input of Actor is State, and the output is the mean value of the normal distribution of the corresponding dimension of action μ\muμ. And then based on that mean and the variance that you computed, you can figure out the distribution. Use Pytorch to do this.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import torch
import torch.nn as nn
from torch.distributions import MultivariateNormal

Construct two distributions as p(x) and q(x). For simple analysis, the dimension of action is set to 1
px = MultivariateNormal(torch.zeros(1), torch.eye(1))
qx = MultivariateNormal(torch.tensor([0.5]), torch.tensor([[2.0]]))

# Draw the probability density functions of the two distributions
## Here matplotlib is used to draw the image of probability density function, and the package scipy.stats.norm is used. The main principle is: Normal distribution has its corresponding probability density formula, and the corresponding probability density value can be calculated directly according to xaxis data, without sampling and other operations.
x_axis = np.arange(-20.20.0.01)
px_mean = 0
px_std = 1

qx_mean = 0.5
qx_std = 2.0

fig = plt.figure()
ax = plt.subplot(1.1.1)
l1, = ax.plot(x_axis, norm.pdf(x_axis, px_mean, px_std))
l2, = ax.plot(x_axis, norm.pdf(x_axis, qx_mean, qx_std))
plt.legend([l1, l2], ['p(x)'.'q(x)'])
plt.savefig("pdf.pdf")
Copy the code

Function reference:

  1. Pytorch MultivariateNormal: https://pytorch.org/docs/stable/distributions.html

Creates a multivariate normal (also called Gaussian) distribution parameterized by a mean vector and a covariance matrix.

  1. PPO construction actor can refer to: github.com/nikhilbarha…

Importance from the principle of

Consider the case where you want to calculate the expectation of a function f(x)f(x)f(x) f(x) under some distribution x ~ p(x)x \sim p(x)x ~ p(x). According to the expectation formula of the continuous probability density function, you can obtain:


E [ f ( x ) ] = f ( x ) p ( x ) d x material 1 n Σ i f ( x i ) E[f(x)] = \int f(x)p(x)dx \approx \frac{1}{n} \Sigma_if(x_i)

If the Monte Carlo method is used to calculate this expectation, it is equivalent to continuously sampling the distribution (for XXX), and then calculating the corresponding expectation according to the expectation formula. But when p(x), p(x), p(x), p(x) is a hard-to-sample distribution (is there a specific example of when a hard-to-sample distribution might occur? What is a hard-to-sample distribution? How to estimate the expectation of a hard-to-sample distribution by sampling some known distribution and some simple distribution.

The corresponding solution is Importance Sampling technology, The technique “Importance Sampling has been successfully used to accelerate stochastic optimization in many convex problems.” Biased Importance Sampling for Deep Neural Network Training “.


E [ f ( x ) ] = f ( x ) p ( x ) d x = f ( x ) p ( x ) q ( x ) q ( x ) d x material 1 n i f ( x i ) p ( x i ) q ( x i ) E[f(x)]=\int f(x) p(x) d x=\int f(x) \frac{p(x)}{q(x)} q(x) d x \approx \frac{1}{n} \sum_{i} f\left(x_{i}\right) \frac{p\left(x_{i}\right)}{q\left(x_{i}\right)}

XXX is sampled from distribution Q (x) Q (x)q(x), p(x)q(x)\frac{p(x)}{q(x)}q(x) P (x) is known as sampling ratio or sampling weight. Its function is “a correction weight to offset the probability sampling from a different distribution”.

This article on zhihu zhuanlan.zhihu.com/p/41217212 Demo shows an importance sampling, is used for curve area as an example, the analytical solution is difficult to directly curve integral, unable to directly find out the integral, often with the method of estimation, namely by product sampling interval, the use of calculus section, Summation, the idea of taking a limit to approximate the integral or the area of the curve.

If the sampling is uniform, an estimation can be obtained, but this estimation method will become more and more accurate as the number of samples increases. On the other hand, when the number of samples is certain, is there any way to improve the accuracy of integral calculation and reduce the variance of calculation? The example given in this article is very interesting. Obviously, where the probability density function is large, its function value also has a great influence on the integral. At this time, increasing the number of samples in this region can improve the accuracy of calculation accordingly.

Importance from implementation

When the off-policy algorithm is used, there are mainly two policies (excluding TD3), one is the Behavior policy and the other is the Target policy

  • Behavior Policy: the number of updates is fast, which is used to generate episodes needed in the learning process
  • Target Policy: The update process is slow. Usually, the parameters of the Behavior policy are transferred to the Target policy after the behavior policy is updated to a certain extent

In other words, there are actually two behavior policies, but one updates slowly and the other updates quickly. Of course, after a certain number of iterations, their ideal state is convergence to the optimal value. The expected Return of Target policy can be estimated through Behavior policy by using the Importance Sampling technology.

    def fx(x) :
        return 1/ (1+np.exp(-x))
        #return x

    px = MultivariateNormal(torch.tensor([3.2], dtype=torch.float32), torch.tensor([[1.0]], dtype=torch.float32))
    qx = MultivariateNormal(torch.tensor([3.0], dtype=torch.float32), torch.tensor([[1.0]], dtype=torch.float32))
    x_axis = np.arange(-20.20.0.01)
    px_mean = 4
    px_std = 1
    qx_mean = 1
    qx_std = 4
    fig = plt.figure()
    ax = plt.subplot(1.1.1)
    l1, = ax.plot(x_axis, norm.pdf(x_axis, px_mean, px_std))
    l2, = ax.plot(x_axis, norm.pdf(x_axis, qx_mean, qx_std))
    plt.legend([l1, l2], ['p(x)'.'q(x)'])
    plt.savefig("pdf1.pdf")

    n = 10000
    s = 0
    for i in range(n):
        #pdb.set_trace()
        x_i = px.sample()
        s += fx(x_i)
    print("Simulation Value", s/n)

    Sample q(x) using IS technology
    s = 0
    for i in range(n):
        x_i = qx.sample()
        s += fx(x_i) * (px.log_prob(x_i) / qx.log_prob(x_i))
    print("IS results: ", s/n)
    pdb.set_trace()
Copy the code
Simulation Value tensor([0.9414]) IS results: Tensor ([0.9398])Copy the code

Question: How IS IS used in RL (such as PPO)? What IS fx defined?

Answer: FX defines Advantage, and the final objective function calculates the mean value of Advantage and maximizes its mean value.

You can refer to: zhuanlan.zhihu.com/p/388707220

KL divergence theory

Criteria for quantification of information:

  • Events that are highly likely to occur should have less information, and in extreme cases, events that are guaranteed to occur should have no information.
  • Less likely events are more informative.
  • Individual events should have incremental information. For example, flipping a coin twice heads should transmit twice as much information as flipping a coin once heads.

An event
X = x X=x
Self information of:


I ( x ) = l o g P ( x ) I(x)=-logP(x)

Logloglog natural logarithm, base is Eee. Intuitively, when the probability of this event is higher, it is more habitual, that is to say, the value of self-information defined above is smaller.

The definition of Entropy

In information theory, it is the measure of information, in physics and thermodynamics, it is the measure of chaos.

Shannon entropy gives the total quantity of uncertainty of the whole distribution ohm to which the event belongs:

H (x) = the Ex – p/I (x) = – Ex – p [logP (x)] = – Σ xP logP (x) (x), H (x) = E_ \ sim p {x} [I (x)] = – E_ {x \ sim P} [logP (x)] = – \ Sigma_xP logP (x) (x), H (x) = the Ex – p/I (x) = – Ex – p [logP (x)] = – Σ xP logP (x) (x)

Relative entropy (KL divergence)

For a random variable XXX, there are two distributions P(x)P(x)P(x) P(x) and Q(x)Q(x)Q(x). The KL divergence can be used to measure the difference between the two distributions. It should be noted that here is the distribution of QQQ relative to PPP:

DKL (P ∣ ∣ Q) = the Ex – P [logP (x) Q (x)] = Ex – P [logP (x) – logQ (x)] = Σ xP (x) x (logP (x) – logQ (x)) D_ {KL} (P | | Q) = E_ \ sim P {x} [log \frac{P(x)}{Q(x)}] = E_{x \sim P}[logP(x)-logQ(x)] = \Sigma_x P(x) \times (logP (x) – logQ (x)) DKL (P ∣ ∣ Q) = the Ex – P [logQ (x) P (x)] = Ex – P [logP (x) – logQ (x)] = Σ xP (x) x (logP (x) – logQ (x))

KL divergence implementation

px = MultivariateNormal(torch.tensor([5.0], dtype=torch.float32), torch.tensor([[1.0]], dtype=torch.float32))
qx = MultivariateNormal(torch.tensor([2.0], dtype=torch.float32), torch.tensor([[1.0]], dtype=torch.float32))
kl_div = kl_divergence(px, qx)
Copy the code
Tensor (4.5000)Copy the code

Reference: pytorch.org/docs/stable…

The TORCH. DISTRIBUTIONS: This module refers to TensorFlow Distribution package, and there are mainly two methods to carry out back propagation (it is not feasible to directly carry out back propagation on random samples, so specifically, the calculation methods of the two alternative functions mentioned in the thesis of TRPO and PPO algorithms) : One is Score Function and the other is Pathwise Derivative.

  • Score function:
  • Pathwise derivative:

torch.distributions.kl.kl_divergence(p, q)

Validation: for one dimensional gaussian distribution, the deduction of the KL divergence can be reference: zhuanlan.zhihu.com/p/22464760, its final expression is:


p 1 ( x ) log p 1 ( x ) p 2 ( x ) d x = log sigma 2 sigma 1 + sigma 1 2 + ( mu 1 mu 2 ) 2 2 sigma 2 2 1 2 \int p_{1}(x) \log \frac{p_{1}(x)}{p_{2}(x)} d x=\log \frac{\sigma_{2}}{\sigma_{1}}+\frac{\sigma_{1}^{2}+\left(\mu_{1}-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}-\frac{1}{2}

On the analysis of the above policy results can be found, both the results of the same, but the torch. Distributions. Kl. Kl_divergence source and its calculation principle of the follow-up still need to write an article to continue in-depth analysis.

References:

zhuanlan.zhihu.com/p/143105854

Towardsdatascience.com/light-on-ma…

zhuanlan.zhihu.com/p/150693309