Make writing a habit together! This is the 12th day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

Please follow my official account [Jizhi Vision] for more notes to share

Hi, I’m Jizhi Vision. This paper discusses Caffe’s BN and scale operators.

The BN operator in Caffe is a bit different from other frameworks such as PyTorch and Darknet. I have also written an article about BN in the past. If you are interested, you can refer to “[Model Reasoning] Bn and in operators from the perspective of deployment”.

Here we will talk about how BN in Caffe compares to other frameworks, and caffe’s implementation of forward inference for BN’s CPU and GPU.

1. Caffe BN’s uniqueness

In my article “[Model Reasoning] Bn and in Operator from the perspective of deployment”, I have written that the calculation process of bn operator is mathematically expressed as follows:

The basic process is:

(1) Calculate the mean value;

(2) Variance;

(3) Normalization;

(4) Scaling and bias;

Among them, the process of (1) finding the mean value and (2) finding the variance has been done during training, so for reasoning, it is only necessary to load the off-line weight. Let’s talk about the differences of BN in caffe. In fact, bn in Caffe only performs (3) normalization, while (4) scaling and bias are performed by the scale operator, so the calculation of bn in Caffe requires two operators bn + scale. This is different from other frameworks like PyTorch or Darknet.

Pytorch’s conv + BN + activation block shows that the three operators are separated and orderly. Bn is also a complete BN.

Let’s take a look at onNX. Onnx operator is fine grained, so activation is still complete bn.

If you look at darknet, you can see that bn in Darknet is written in CONv by default, so conv + BN + activation in Darknet (please automatically ignore that activation here is not relu, I just picked a random one) is a fusion operator, and the batch_normalize flag bit determines if there is a BN, and again the BN here is a full BN.

Finally, if you look at the Conv + BN + activation block in caffe, you can see that things are different in caffe, where bn is broken down into BatchNorm + Scale, So a CONV + BN + activation block is conV + BN + scale + activation in caffe. The BN here is not the full BN, and this is where the BN in Caffe is different from the BN in other frameworks.

If the value is true, use the saved mean and variance. Otherwise, use the moving average to calculate the new mean and variance. In this case, bn is only normalized.

If you look at the scale parameter, when you look at the biAS_term parameter, you might suddenly realize that normal scale only does scaling, and the scale with bias does scaling + bias. When you combine this with the normalization of bn, It’s just a whole bn process.

Here’s a look at caffe’s source code.

Caffe bn forward_CPU implementation

Caffe. Proto BatchNormParameter = BatchNormParameter = BatchNormParameter

message BatchNormParameter {
  optional bool use_global_stats = 1;

  optional float moving_average_fraction = 2 [default = .999];
  
  optional float eps = 3 [default = 1e-5];
}
Copy the code

Moving_average_fraction is the moving average coefficient. If use_global_stats is true, Moving_average_fraction is not used, otherwise moving_average_fraction is used to update the mean value. Eps is a hyperparameter that prevents the denominator from being zero.

Take a look at the bn header, which I’ve omitted from the caffe namespace:

template <typename Dtype>
class BatchNormLayer : public Layer<Dtype> {
 public:
  explicit BatchNormLayer(const LayerParameter& param)
      : Layer<Dtype>(param) {}
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);

  virtual inline const char* type() const { return "BatchNorm"; }
  virtual inline int ExactNumBottomBlobs() const { return 1; }
  virtual inline int ExactNumTopBlobs() const { return 1; }

 protected:
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
  virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
     const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);

  Blob<Dtype> mean_, variance_, temp_, x_norm_;
  bool use_global_stats_;
  Dtype moving_average_fraction_;
  int channels_;
  Dtype eps_;

  // extra temporarary variables is used to carry out sums/broadcasting
  // using BLAS
  Blob<Dtype> batch_sum_multiplier_;
  Blob<Dtype> num_by_chans_;
  Blob<Dtype> spatial_sum_multiplier_;
};
Copy the code

Several key functions are declared in the subclass of BatchNormLayer: Forward_cpu, Forward_gpu, Backward_cpu and Backward_gpu. Among them, Backward is mainly used to train gradient backtransmission, and we don’t care about reasoning. Use_global_stats_ = use_global_stats_ = use_global_stats_ = use_global_stats_ = use_global_stats_ If is, the moving average is used to calculate the mean first, and then update the variance according to the mean. This is generally used to update the mean and variance during training.

First, let’s talk about the moving average to update the mean and variance. The training process does not end with a single forward calculation, but with the sampling of mini-batch samples from the total sample for multiple forward calculations. In this case, we need to consider how to combine the mean and variance obtained each time. Caffe’s algorithm does not simply add up the mean and variance of each calculation. Instead, caffe’s algorithm reduces the effect of the previous calculation’s mean and variance (multiplied by a variable less than one) and adds the result of the current calculation, known as a moving average update.

The forward mean calculation code is as follows:

if (use_global_stats_) {
    // use the stored mean/variance estimates.
    const Dtype scale_factor = this->blobs_[2]->cpu_data()[0] == 0 ?
        0 : 1 / this->blobs_[2]->cpu_data()[0];
    caffe_cpu_scale(variance_.count(), scale_factor,
        this->blobs_[0]->cpu_data(), mean_.mutable_cpu_data());
    caffe_cpu_scale(variance_.count(), scale_factor,
        this->blobs_[1]->cpu_data(), variance_.mutable_cpu_data());
  } else {
    // compute mean
    caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim,
        1. / (num * spatial_dim), bottom_data,
        spatial_sum_multiplier_.cpu_data(), 0.,
        num_by_chans_.mutable_cpu_data());
    caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,
        num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,
        mean_.mutable_cpu_data());
}
Copy the code

And then you calculate the variance based on this mean, and that’s where the whole code gets complicated.

if (! use_global_stats_) { // compute variance using var(X) = E((X-EX)^2) caffe_sqr<Dtype>(top[0]->count(), top_data, temp_.mutable_cpu_data()); // (X-EX)^2 caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim, 1. / (num * spatial_dim), temp_.cpu_data(), spatial_sum_multiplier_.cpu_data(), 0., num_by_chans_.mutable_cpu_data()); caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1., num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0., variance_.mutable_cpu_data()); // E((X_EX)^2) // compute and save moving average  this->blobs_[2]->mutable_cpu_data()[0] *= moving_average_fraction_;
  this->blobs_[2]->mutable_cpu_data()[0] += 1;
  caffe_cpu_axpby(mean_.count(), Dtype(1), mean_.cpu_data(),
      moving_average_fraction_, this->blobs_[0]->mutable_cpu_data());
  int m = bottom[0]->count()/channels_;
  Dtype bias_correction_factor = m > 1 ? Dtype(m)/(m-1) : 1;
  caffe_cpu_axpby(variance_.count(), bias_correction_factor,
      variance_.cpu_data(), moving_average_fraction_,
      this->blobs_[1]->mutable_cpu_data());
}
Copy the code

Here’s the normalization process, minus the mean:

// subtract computed_mean
caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,
    batch_sum_multiplier_.cpu_data(), mean_.cpu_data(), 0.,
    num_by_chans_.mutable_cpu_data());
caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num,
    spatial_dim, 1, -1, num_by_chans_.cpu_data(),
    spatial_sum_multiplier_.cpu_data(), 1., top_data);
Copy the code

To see what caffe_CPU_gemm does:

template<>
void caffe_cpu_gemm<float>(const CBLAS_TRANSPOSE TransA,
    const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K,
    const float alpha, const float* A, const float* B, const float beta,
    float* C) {
  int lda = (TransA == CblasNoTrans) ? K : M;
  int ldb = (TransB == CblasNoTrans) ? N : K;
  cblas_sgemm(CblasRowMajor, TransA, TransB, M, N, K, alpha, A, lda, B,
      ldb, beta, C, N);
}
Copy the code

Caffe_cpu_gemm functions as the dot product, that is, C=alpha*A*B+beta*C, where A, B, and C are input matrices (one-dimensional array format); CblasRowMajor: Data is row main (two-dimensional data is also stored in one-dimensional arrays, which is why it is dot product rather than matrix product); TransA, TransB: Whether to transpose A and B (CblasTrans CblasNoTrans); M: number of rows of A and C; N: number of columns of B and C; K: number of columns of A, number of rows of B; Lda: the number of columns of A (without transpose) and the number of rows (with transpose); LDB: the number of columns of B (not transposed) and the number of rows (transposed). So:

(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1, batch_sum_multiplier_.cpu_data(), mean_.cpu_data(), 0., num_by_chans_.mutable_cpu_data()); Num_by_chans_.mutable_cpu_data () = batch_sum_multiplier_.cpu_data() * mean_.cpu_data( caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num, spatial_dim, 1, -1, num_by_chans_.cpu_data(), spatial_sum_multiplier_.cpu_data(), 1., top_data); Cpu_data () * spatial_sum_multiplier_.cpu_data() + top_data If (bottom[0]! = top[0]) {  caffe_copy(bottom[0]->count(), bottom_data, top_data);
}
Copy the code

So that’s a little bit clearer.

Then smooth the variance and open the standard deviation:

// normalize variance
caffe_add_scalar(variance_.count(), eps_, variance_.mutable_cpu_data());
caffe_sqrt(variance_.count(), variance_.cpu_data(),
variance_.mutable_cpu_data());
Copy the code

Caffe_add_scalar is implemented as follows, adding an alpha to each Y. So what we’re doing here is we’re adding a very small EPS to each of the variances to keep it from being zero. Caffe_sqrt is the square root, so after this wave operation you get the standard deviation, stored here at variance_.cpu_data().

template <> void caffe_add_scalar(const int N, const float alpha, float* Y) { for (int i = 0; i < N; ++i) { Y[i] += alpha; }}Copy the code

After subtracting the mean, feature_map data is placed in top_data and the standard deviation is placed in variance_.cpu_data().

// replicate variance to input size
caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,
batch_sum_multiplier_.cpu_data(), variance_.cpu_data(), 0.,
num_by_chans_.mutable_cpu_data());
caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num,
spatial_dim, 1, 1., num_by_chans_.cpu_data(),
spatial_sum_multiplier_.cpu_data(), 0., temp_.mutable_cpu_data());
caffe_div(temp_.count(), top_data, temp_.cpu_data(), top_data);
// TODO(cdoersch): The caching is only needed because later in-place layers
//                 might clobber the data.  Can we skip this if they won't?
caffe_copy(x_norm_.count(), top_data,
x_norm_.mutable_cpu_data());
Copy the code

The key is:

caffe_div(temp_.count(), top_data, temp_.cpu_data(), top_data);
Copy the code

Take a look at caffe_div’s implementation, which does element-wise division, so it’s easy to think of it as dividing the mean by the standard deviation, which completes normalization.

template <>
void caffe_div<float>(const int n, const float* a, const float* b,
    float* y) {
  vsDiv(n, a, b, y);
}
Copy the code

The bn forward_CPU in Caffe’s source code ends at this point. You can see that there are only normalization operations, but no scaling or bias operations. This further confirms that caffe’s BN is broken down into BN + scale.

Caffe bn Forward_CUDa implementation

High performance computing is dependent on CUDA. Let’s take a look at caffe BN Forward_CUDa implementation code. In fact, the logic is very simple, I paste the whole code first:

template <typename Dtype> void BatchNormLayer<Dtype>::Forward_gpu(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) { const Dtype* bottom_data = bottom[0]->gpu_data(); Dtype* top_data = top[0]->mutable_gpu_data(); int num = bottom[0]->shape(0); int spatial_dim = bottom[0]->count()/(channels_*bottom[0]->shape(0)); if (bottom[0] ! = top[0]) { caffe_copy(bottom[0]->count(), bottom_data, top_data); } if (use_global_stats_) { // use the stored mean/variance estimates. const Dtype scale_factor = this->blobs_[2]->cpu_data()[0] == 0 ? 0 : 1 / this->blobs_[2]->cpu_data()[0]; caffe_gpu_scale(variance_.count(), scale_factor, this->blobs_[0]->gpu_data(), mean_.mutable_gpu_data()); caffe_gpu_scale(variance_.count(), scale_factor, this->blobs_[1]->gpu_data(), variance_.mutable_gpu_data()); } else { // compute mean caffe_gpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim, 1. / (num * spatial_dim), bottom_data, spatial_sum_multiplier_.gpu_data(), 0., num_by_chans_.mutable_gpu_data()); caffe_gpu_gemv<Dtype>(CblasTrans, num, channels_, 1., num_by_chans_.gpu_data(), batch_sum_multiplier_.gpu_data(), 0., mean_.mutable_gpu_data()); } // subtract mean caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1, batch_sum_multiplier_.gpu_data(), mean_.gpu_data(), 0., num_by_chans_.mutable_gpu_data()); caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num, spatial_dim, 1, -1, num_by_chans_.gpu_data(), spatial_sum_multiplier_.gpu_data(), 1., top_data); if (! use_global_stats_) { // compute variance using var(X) = E((X-EX)^2) caffe_gpu_mul(top[0]->count(), top[0]->gpu_data(), top[0]->gpu_data(), temp_.mutable_gpu_data()); // (X-EX)^2 caffe_gpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim, 1. / (num * spatial_dim), temp_.gpu_data(), spatial_sum_multiplier_.gpu_data(), 0., num_by_chans_.mutable_gpu_data()); caffe_gpu_gemv<Dtype>(CblasTrans, num, channels_, Dtype(1.), num_by_chans_.gpu_data(), batch_sum_multiplier_.gpu_data(), Dtype(0.), variance_.mutable_gpu_data()); // E((X_EX)^2) // compute and save moving average this->blobs_[2]->mutable_cpu_data()[0] *= moving_average_fraction_; this->blobs_[2]->mutable_cpu_data()[0] += 1; caffe_gpu_axpby(mean_.count(), Dtype(1), mean_.gpu_data(), moving_average_fraction_, this->blobs_[0]->mutable_gpu_data()); int m = bottom[0]->count()/channels_; Dtype bias_correction_factor = m > 1 ? Dtype(m)/(m-1) : 1; caffe_gpu_axpby(variance_.count(), bias_correction_factor, variance_.gpu_data(), moving_average_fraction_, this->blobs_[1]->mutable_gpu_data()); } // normalize variance caffe_gpu_add_scalar(variance_.count(), eps_, variance_.mutable_gpu_data()); caffe_gpu_sqrt(variance_.count(), variance_.gpu_data(), variance_.mutable_gpu_data()); // replicate variance to input size caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1, batch_sum_multiplier_.gpu_data(), variance_.gpu_data(), 0., num_by_chans_.mutable_gpu_data()); caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num, spatial_dim, 1, 1., num_by_chans_.gpu_data(), spatial_sum_multiplier_.gpu_data(), 0., temp_.mutable_gpu_data()); caffe_gpu_div(temp_.count(), top_data, temp_.gpu_data(), top_data); // TODO(cdoersch): The caching is only needed because later in-place layers // might clobber the data. Can we skip this if they won't? caffe_copy(x_norm_.count(), top_data, x_norm_.mutable_gpu_data()); }Copy the code

Bn forward_GPU and forward_CPU are the same logically, and the same function is passed as a parameter. Caffe_gpu_scale, caffe_gpu_GEMv, caffe_gpu_GEMm…

So I don’t think caffe BN forward_GPU code has much to say, just look at forward_CPU.

Caffe’s bn and Scale operator implementation and some experience have been shared above. I hope my sharing can be helpful to your learning.


[Public Account transmission]

On Caffe’s BN and Scale operators