Learning-based coding (I) : loop filtering using CNN

This paper introduces a technique of using CNN in VVC loop filtering. The algorithm is derived from JVEt-O0056

CNNLF

This convolutional network model, called CNNLF, is used after the ALF stage of VVC loop filtering, as shown in the figure below.

 

The input of CNNLF is the output of ALF, which we call reconstructed pixels. The output of CNNLF is what we call repaired pixels.

In order to reduce the pressure of transmitting CNN parameters and reduce the decoding time of CNNLF, the CNNLF network consists of only three layers, and the brightness component and chromaticity component use the same network. As shown in the figure below.

 

The meanings of MxNxPxQ for each layer are as follows:

  • M: horizontal width of the channel filter
  • N: vertical height of the channel filter
  • P: indicates the number of input channels
  • Q: number of filters in this layer network

Operations at each level include convolution, Batch Normalization, adding bias, and ReLU. The last layer, however, does not contain ReLU.

CNNLF operation

Each CTU performs CNNLF operations independently. CNN input was generated in the Packing stage, and its 6 channels were generated by 4 reconstruction brightness blocks and 2 reconstruction chroma blocks. The 4 luminance blocks are generated by 4 times 2×2 downsampling of the luminance component of CTB to obtain 4 1/4 size sub-blocks, each of which has the same size as the chroma block. These 4 brightness subblocks and 2 chroma blocks constitute a 6-channel CNN input, as shown in the figure below.

 

After three layers of CNN, the data size is still 64x64x6. In the Unpacking stage, the first four channels form brightness CTB, and the last two channels form two chroma CTB respectively. Finally, the output of CNNLF was the output of Unpacking plus the output of ALF.

CNN parameter transmission

The PARAMETERS of CNN are transmitted within I frames of each RAS (about 1 second). As shown in the figure below.

 

The parameters of CNN only use the image training of layer 0 and layer 1 in RAS time domain. To improve efficiency, when the encoder decides not to use CNNLF for I frames, the CNN parameters need not be transmitted and the remaining images in RAS are set to CNNLF-off.

For each color component, CNNLF can be decided at the image level, CTB level, and 32×32 block level. For an image, each color channel can transmit a syntax element {PictureAllOff, PictureAllOn, CTBOnOff} to indicate whether CNNLF is enabled. PictureAllOff indicates that CNNLF is not used for the current channel of the current image. PictureAllOn means CNNLF is used for all pixels of the current channel of the current image. CTBOnOff indicates that the use of CNNLF is determined by the CTB’s own syntax elements. There are also three syntax elements at the CTB level {CTBAllOff, CTBAllOn, BlockOnOff}. CTBAllOff indicates that CNNLF is not used for the current channel of the current CTB. CTBAllOn means CNNLF is used for all pixels of the current channel of the current CTB. BlockOnOff indicates that the use of CNNLF is determined by each 32×32 block’s own syntactic elements. To reduce decoding time, PictureAllOff is used for both Cr and Cb at the image level if PictureAllOff is used for brightness components. At the CTB level, if CTBAllOff is used for the brightness CTB, then CTBAllOff is used for both Cr and Cb.

The stage of training

In order to reduce complexity, only the images of layer 0 and layer 1 in the time domain are used to train CNN parameters. Which means that only these images need to be encoded twice. The first encoding is used to produce the training data of CNN, and the second encoding generates the final codestream. The training information is as follows:

 

Inference phase

Encoders and decoders do not need to rely on other frameworks, as shown in the figure below for the inference stage.

 

The experimental results

The following figure shows the experimental results using CNNLF in VTM5.0.

 

In RA configurations of VTM5.0, 1.20% (Y), -14.17% (Cb), and 14.11% (Cr) BD-rates, 117% decoding time increased.

If you are interested, please pay attention to wechat public account Video Coding