Learning based coding (IV) : WSE-CNNLF

The algorithm in this paper comes from JVEt-N0133, A CNN model WSE-CNNLF(Wide- Activated Squeeze-and- Congestion Neural Network Loop Filter) is proposed for Loop filtering.

Different from the previous algorithms, WSE-CNNLF does not add a stage or replace a certain stage in VVC loop filtering, but completely replace VVC loop filtering (including DBF, SAO and ALF).

Network inputs and outputs

Wse-cnnlf includes six inputs, three of which are YUV reconstructed images and three auxiliary inputs.

(1) Three reconstruction components

YUV’s three reconstructed images are first normalized to [0,1] according to bit depth and then input to the network. The normalization method is as follows:

 

(2) Three auxiliary inputs

The first auxiliary input is QP, and different QPS cause different levels of distortion. With QP as input, you can train just one model to apply to all QP cases. QP is normalized to QPmap:

 

The other two auxiliary inputs are CU partition information for the luminance (Y) component and chromaticity (UV) component. Because the block effect is mainly caused by block partitioning, CU boundary information can be processed more efficiently by using the attention mechanism of neural network. CU partition information is converted into CUmaps, normalized, and then input into the network. For example, for each CU of a frame, the boundary position is filled with 2 and the other position is filled with 1, as shown below. If the normalization factor is 2, two Cumaps can be obtained, one is y-Cumap and the other is UV-Cumap.

 

Processing module

Wide-activated Convolution performs well in super-resolution and noise reduction tasks. It consists of a wide 3×3 convolution followed by ReLU and a narrow 1×1 convolution.

SE (squeeze-and-congestion) is used to weight each convolutional layer. It can take advantage of complex relationships between different channels and generate a weighting factor for each channel.

Given a characteristic diagram of HxWxC, SE contains the following four steps:

(1) A value is obtained for each channel based on Global Average Pooling (GAP).

 

(2) A fully connected layer plus a ReLU function adds the necessary nonlinearity. The output channel complexity also decreases by a certain ratio r.

(3) add a sigmoid function to the second full connection layer to give each channel a smooth threshold ratio within the range of [0,1].

(4) Finally, each channel is scaled using a threshold ratio.

 

K indicates the wide ratio in wide Convolution in step 1. R indicates reduce ratio in an SE.

For a basic processing unit, if the number of input and output channels is the same, the input and output can be connected through skip Connection to directly learn residual error to accelerate convergence, as shown in the upper left figure.

The network structure

The WSE-CNNLF process consists of three phases, and its workflow is shown in the figure below.

 

(1) Stage 1

The three components of YUV are processed by wide-se-blocks respectively, and each corresponding CUmap is fused by multiplying the CUmap pixel by pixel with its corresponding channel. Since U and V are only half the size of Y, size alignment is required.

(2) Stage 2

The feature maps of the different channels are joined together and then processed by several wide-se-blocks.

(3) Stage 3

The three channels are processed separately to generate the final residual image, which is then added to the original image.

Training and Results

The network configuration during training is as follows:

 

 

The above is the comparison between THE WSE-CNNLF method and VTM4.0 when DBF, SAO and ALF are disabled and only WSE-CNNLF method is enabled.

If you are interested, please pay attention to wechat public account Video Coding