1. Introduction

Different from general neural networks, the weight parameters of Bayesian neural networks are random variables rather than definite values. As shown below:

In other words, in contrast with traditional neural networks, which use loss functions such as cross entropy and MSE to fit label values, Bayesian neural networks fit posterior distributions.

This has the advantage of reducing overfitting.

2. BNN model

BNN is different from DNN in that it can learn the prediction distribution, and can give not only the predicted value, but also the uncertainty of the prediction. This is crucial for many problems, such as the famous Exploration & Exploitation (EE) problem in machine learning. In reinforcement learning, agents need to make decisions based on existing knowledge or try something unknown. In the experimental design problem, bayesian optimization is used to adjust the hyperparameters, and the next point is chosen according to the optimal value of the current model or by exploring some high uncertainty space. For example: abnormal sample detection, counter sample detection and other tasks, BNN has a very strong robustness due to its uncertainty quantification ability.

Probability modeling: Here, the conjugate distribution of the likelihood distribution is selected so that a posteriori can be analyzed and calculated. For example, a prior of the beta distribution and the likelihood of the Bernoulli distribution yield a posteriori that obeies the beta distribution.

Due to the conjugate distribution, the prior distribution needs to be constrained. Therefore, we attempt to approximate the posterior distribution using adoption and variational inference.


Neural network: Using a fully connected network to fit data is equivalent to using multiple fully connected networks. But the neural network is easy to overfit and has poor generalization. And the predicted results can not be given confidence.

BNN: Combines probabilistic modeling with neural networks and can give confidence in the predicted results.

Prior is used to describe the key parameters and serve as the input to the neural network. The output of the neural network is used to describe the likelihood of a particular probability distribution. A posterior distribution is calculated by sampling or variational inference. Meanwhile, unlike neural networks, the weight W is no longer a definite value, but a probability distribution.


BNN is modeled as follows:

Assume that the network parameters of NN are WWW, p(W)p(W)p(W) is the prior distribution of parameters, given observation data D=X,YD={X,Y}D=X,Y, where XXX is the input data and YYY is the label data. BNN wants to give the following distribution:

That is, our predicted value is:


P ( Y X . D ) = P ( Y X . W ) P ( W D ) d W ( 1 ) P \ left (Y ^ star} {\ | X ^ {\ star}, D \ right) = P \ \ int left (Y ^ star} {\ | X ^ {\ star}, W \ right) P (W) | D D W (1)

Since WWW is a random variable, our predicted value is also a random variable.

Among them:


P ( W D ) = P ( W ) P ( D W ) P ( D ) ( 2 ) P (W) | D = \ frac {P (W) P (D | W)} {P (D)} (2)

∣ D P (W) (W P | D) P (W) ∣ D is the posterior distribution, P (D) ∣ W P (D | W) P (D) ∣ W is the likelihood function, P (D) P (D) P (D) is the marginal likelihood.

It can be seen from Formula (1) that the core of BNN probability modeling and prediction of data lies in efficient approximate posterior inference, and variational inference VI or sampling is a very suitable method.

If sampled: We evaluate P(W) given D P(W \vert \mathcal{D})P(W∣D) with a sampling posterior distribution, F of X given w f of X \vert W f of X given w, where F is our neural network.

It is precisely because our output is a distribution, not a value, that we can estimate the uncertainty of our predictions.

3. BNN training based on variational inference

If direct, sampling a posteriori probability p (W) ∣ D p (W) | D p ∣ D (W) to assess the p (Y ∣ X, D) p (Y | X, D) p (Y ∣ X, D), posterior distribution of multidimensional problems and variational inference thought is to use a simple distribution to approximate the posterior distribution.

Said theta = (mu, sigma) \ theta = (\ mu, \ sigma theta = (mu, sigma), each weight wiw_iwi from normal distribution (mu I, sigma I) (, \ \ mu_i sigma_i) (mu I, sigma I) in the samples.

We want q(W ∣θ)q(W \vert \theta) Q (W ∣θ) and P(W \ D)P(W \vert \mathcal{D})P(W ∣D) to be close together and use KL divergence to measure the distance between these two distributions. That is, optimization:


Theta. = a r g m i n Theta.  KL [ q ( w Theta. ) P ( w D ) ]    ( 3 ) \theta^* = \underset{\theta}{\mathrm{argmin}} \text{ KL}\left[q(w \vert \theta) \vert \vert P(w \vert \mathcal{D})\right] \; (3)

Further derivation:


Theta. = a r g m i n Theta.  KL [ q ( w Theta. ) P ( w D ) ] = a r g m i n Theta.   E q ( w Theta. ) [ log [ q ( w Theta. ) P ( w D ) ] ] (definition of KL divegence) = a r g m i n Theta.   E q ( w Theta. ) [ log [ q ( w Theta. ) P ( D ) P ( D w ) P ( w ) ] ] (Bayes Theorem) = a r g m i n Theta.   E q ( w Theta. ) [ log [ q ( w Theta. ) P ( D w ) P ( w ) ] ] (Drop  P ( D ) Because it doesn’t depend on Theta. )    ( 4 ) \begin{array}{l} \theta^* &= \underset{\theta}{\mathrm{argmin}} \text{ KL}\left[q(w \vert \theta) \vert \vert P(w \vert \mathcal{D})\right] & \\\\ &= \underset{\theta}{\mathrm{argmin}} \text{ }\mathbb{E}_{q(w \vert \theta)}\left[ \log\left[\frac{ q(w \vert \theta) }{P( w \vert \mathcal{D})}\right]\right] & \text{(definition of KL divegence)} \\\\ &= \underset{\theta}{\mathrm{argmin}} \text{ }\mathbb{E}_{q(w \vert \theta)}\left[ \log\left[\frac{ q(w \vert \theta)P(\mathcal{D}) }{P( \mathcal{D} \vert w)P(w)}\right]\right] & \text{(Bayes Theorem)} \\\\ &= \underset{\theta}{\mathrm{argmin}} \text{ }\mathbb{E}_{q(w \vert \theta)}\left[ \log\left[\frac{ q(w \vert \theta) }{P( \mathcal{D} \vert w)P(w)}\right]\right] & \text{(Drop }P(\mathcal{D})\text{ because it doesn’t depend on } \theta) \end{array} \; (4)

∣ theta formula, q (w) q q | \ theta (w) (w ∣ theta) said after given the parameters of the normal distribution, the distribution of weight parameters; ∣ w P (D) P (D | w) P (D) ∣ w said after the network parameters, given the likelihood of observed data; P(w)P(w)P(w) represents the prior of weights, which can be used as regularization of the model.

And the use of


L = E q ( w Theta. ) [ log [ q ( w Theta. ) P ( D w ) P ( w ) ] ]    ( 5 ) \mathcal{L} = – \mathbb{E}_{q(w \vert \theta)}\left[ \log\left[\frac{ q(w \vert \theta) }{P( \mathcal{D} \vert w)P(w)}\right]\right] \; (5)

To represent the lower boundary of variational ELBO, that is, formula (4) is equivalent to maximized ELBO:


L = i log q ( w i Theta. i ) i log P ( w i ) j log P ( y j w . x j )    ( 6 ) \mathcal{L} = \sum_i \log q(w_i \vert \theta_i) – \sum_i \log P(w_i) – \sum_j \log P(y_j \vert w, x_j) \; (6)

D={(x,y)}D ={(x,y) \}D={(x,y) \}D={(x,y)}

We need to take the derivative of the expectation in formula (4), but here we use the technique of reparametric weights:


w i = mu i + sigma i x ϵ i    ( 7 ) w_i = \mu_i + \sigma_i \times \epsilon_i \; (7)

Among them, ϵ I ~ N (0, 1) \ epsilon_i \ sim \ mathcal {N} (0, 1) ϵ I ~ N (0, 1).

So, replace WWW with ϵ\epsilonϵ :


partial partial Theta. E q ( ϵ ) [ log [ q ( w Theta. ) P ( D w ) P ( w ) ] ] = E q ( ϵ ) [ partial partial Theta. log [ q ( w Theta. ) P ( D w ) P ( w ) ] ]    ( 8 ) \frac{\partial}{\partial \theta}\mathbb{E}_{q(\epsilon)}\left[ \log\left[\frac{ q(w \vert \theta) }{P( \mathcal{D} \vert w)P(w)}\right]\right] =\mathbb{E}_{q(\epsilon)}\left[ \frac{\partial}{\partial \theta}\log\left[\frac{ q(w \vert \theta) }{P( \mathcal{D} \vert w)P(w)}\right]\right] \; (8)

In other words, we can use several different ϵ ~ N(0,1)\epsilon \sim \mathcal{N}(0,1)ϵ ~ N(0,1), Calculate the partial partial theta log ⁡ [q (w ∣ theta) P (D) ∣ w P (w)] \ frac {\ partial} {\ partial \ theta} \ log \ left [\ frac {q} \ vert \ theta (w) {P (\ mathcal {D} \ vert W) P (w)} \ right] partial theta partial log [P (D) ∣ w P (w) q (w ∣ theta)], the average of the KL divergence is approximated by derivation of theta \ theta theta.

In addition to resampling the WWW, to ensure that the real axis is included in the θ\thetaθ parameter range, resampling σ\sigma sigma can be,


sigma = log ( 1 + e rho )          ( 9 ) \sigma = \log (1 + e^{\rho}) \; \; \; (9)

Then, theta = (mu, rho) \ theta = (, mu, and rho) theta = (mu, rho), theta \ theta theta here already and the original definition of theta = (mu, sigma) \ theta = (\ mu, \ sigma theta = (mu, sigma).

4. BNN practice

Algorithm:

  1. From N (mu, the log (rho) 1 + e) N (\ mu, the log (1 + e ^ \ rho)) N (mu, the log (1 + e rho)) in the sampling, for WWW.
  2. Calculate the log ⁡ q ∣ theta (w) \ | \ theta (w) log q logq ∣ theta (w), log ⁡ p \ log p (w) (w) logp (w), log ⁡ p (y ∣ w, x) \ log p (y | w, x) logp (∣ w y, x).

Among them, Calculate the log ⁡ p (y ∣ w, x) \ log p (y | w, x) logp (∣ w y, x) the actual calculation log ⁡ p (y ∣ ypred) \ log p (y | y_ {Mr Pred}) logp (y ∣ ypred), Ypred = w ∗ xy_ {Mr Pred} * xypred = w = w ∗ x. Also you can get the L = ∑ ilog ⁡ q (wi ∣ theta I) – ∑ ilog ⁡ P (wi) – ∑ jlog ⁡ P (yj ∣ w, xj) \ mathcal {L} = \ sum_i \ log q (w_i \ vert \ theta_i) – \ sum_i \ log P (w_i) – \ sum_j \ log P (y_j \ vert w, x_j) L = ∑ ilogq wi ∣ theta (I) – ∑ ilogP (wi) – ∑ jlogP (yj ∣ w, xj). 3. Repeat the update parameter theta equals theta – ‘alpha ∇ theta L \ theta = \ theta – \’ alpha \ nabla_ \ theta \ mathcal {L} theta equals theta – ‘alpha ∇ theta L.

Pytorch implementation:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Normal
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

class Linear_BBB(nn.Module) :
    """ Layer of our BNN. """
    def __init__(self, input_features, output_features, prior_var=1.) :
        """ Initialization of our layer : our prior is a normal distribution centered in 0 and of variance 20. """
        # initialize layers
        super().__init__()
        # set input and output dimensions
        self.input_features = input_features
        self.output_features = output_features

        # initialize mu and rho parameters for the weights of the layer
        self.w_mu = nn.Parameter(torch.zeros(output_features, input_features))
        self.w_rho = nn.Parameter(torch.zeros(output_features, input_features))

        #initialize mu and rho parameters for the layer's bias
        self.b_mu =  nn.Parameter(torch.zeros(output_features))
        self.b_rho = nn.Parameter(torch.zeros(output_features))        

        #initialize weight samples (these will be calculated whenever the layer makes a prediction)
        self.w = None
        self.b = None

        # initialize prior distribution for all of the weights and biases
        self.prior = torch.distributions.Normal(0,prior_var)

    def forward(self, input) :
        """ Optimization process """
        # sample weights
        w_epsilon = Normal(0.1).sample(self.w_mu.shape)
        self.w = self.w_mu + torch.log(1+torch.exp(self.w_rho)) * w_epsilon

        # sample bias
        b_epsilon = Normal(0.1).sample(self.b_mu.shape)
        self.b = self.b_mu + torch.log(1+torch.exp(self.b_rho)) * b_epsilon

        # record log prior by evaluating log pdf of prior at sampled weight and bias
        w_log_prior = self.prior.log_prob(self.w)
        b_log_prior = self.prior.log_prob(self.b)
        self.log_prior = torch.sum(w_log_prior) + torch.sum(b_log_prior)

        # record log variational posterior by evaluating log pdf of normal distribution defined by parameters with respect at the sampled values
        self.w_post = Normal(self.w_mu.data, torch.log(1+torch.exp(self.w_rho)))
        self.b_post = Normal(self.b_mu.data, torch.log(1+torch.exp(self.b_rho)))
        self.log_post = self.w_post.log_prob(self.w).sum() + self.b_post.log_prob(self.b).sum(a)return F.linear(input, self.w, self.b)

class MLP_BBB(nn.Module) :
    def __init__(self, hidden_units, noise_tol=1.,  prior_var=1.) :

        # initialize the network like you would with a standard multilayer perceptron, but using the BBB layer
        super().__init__()
        self.hidden = Linear_BBB(1,hidden_units, prior_var=prior_var)
        self.out = Linear_BBB(hidden_units, 1, prior_var=prior_var)
        self.noise_tol = noise_tol # we will use the noise tolerance to calculate our likelihood

    def forward(self, x) :
        # again, this is equivalent to a standard multilayer perceptron
        x = torch.sigmoid(self.hidden(x))
        x = self.out(x)
        return x

    def log_prior(self) :
        # calculate the log prior over all the layers
        return self.hidden.log_prior + self.out.log_prior

    def log_post(self) :
        # calculate the log posterior over all the layers
        return self.hidden.log_post + self.out.log_post

    def sample_elbo(self, input, target, samples) :
        # we calculate the negative elbo, which will be our loss function
        #initialize tensors
        outputs = torch.zeros(samples, target.shape[0])
        log_priors = torch.zeros(samples)
        log_posts = torch.zeros(samples)
        log_likes = torch.zeros(samples)
        # make predictions and calculate prior, posterior, and likelihood for a given number of samples
        for i in range(samples):
            outputs[i] = self(input).reshape(-1) # make predictions
            log_priors[i] = self.log_prior() # get log prior
            log_posts[i] = self.log_post() # get log variational posterior
            log_likes[i] = Normal(outputs[i], self.noise_tol).log_prob(target.reshape(-1)).sum(a)# calculate the log likelihood
        # calculate monte carlo estimate of prior posterior and likelihood
        log_prior = log_priors.mean()
        log_post = log_posts.mean()
        log_like = log_likes.mean()
        # calculate the negative elbo (which is our loss function)
        loss = log_post - log_prior - log_like
        return loss

def toy_function(x) :
    return -x**4 + 3*x**2 + 1

# toy dataset we can start with
x = torch.tensor([-2, -1.8, -1.1.1.8.2]).reshape(-1.1)
y = toy_function(x)

net = MLP_BBB(32, prior_var=10)
optimizer = optim.Adam(net.parameters(), lr=1.)
epochs = 2000
for epoch in range(epochs):  # loop over the dataset multiple times
    optimizer.zero_grad()
    # forward + backward + optimize
    loss = net.sample_elbo(x, y, 1)
    loss.backward()
    optimizer.step()
    if epoch % 10= =0:
        print('epoch: {}/{}'.format(epoch+1,epochs))
        print('Loss:', loss.item())
print('Finished Training')


# samples is the number of "predictions" we make for 1 x-value.
samples = 100
x_tmp = torch.linspace(-5.5.100).reshape(-1.1)
y_samp = np.zeros((samples,100))
for s in range(samples):
    y_tmp = net(x_tmp).detach().numpy()
    y_samp[s] = y_tmp.reshape(-1)
plt.plot(x_tmp.numpy(), np.mean(y_samp, axis = 0), label='Mean Posterior Predictive')
plt.fill_between(x_tmp.numpy().reshape(-1), np.percentile(y_samp, 2.5, axis = 0), np.percentile(y_samp, 97.5, axis = 0), alpha = 0.25, label='95% Confidence')
plt.legend()
plt.scatter(x, toy_function(x))
plt.title('Posterior Predictive')
plt.show()
Copy the code

Here are the area plots of the 100-count average and the 100-count average 97.5% large and 2.5% small (i.e. 95% confidence).


Reference:

  1. Variational inference;
  2. Weight Uncertainty in Neural Networks Tutorial;
  3. Bayesian Neural Networks;