1. What is deep learning

Deep Learning, a branch of machine Learning, is an algorithm that uses artificial neural network as the framework to learn information representation [1]. In general, simple representations (such as image edges) are first learned, and then based on the simple representations, higher-level representations (such as image corners, and then parts of objects) are further studied, and finally advanced representations suitable for specific tasks are obtained. Deep learning is named because the representation layer of learning is generally deep.

Convolutional Neural Network (CNN) is a kind of Neural Network commonly used in deep learning. Because of its unique architecture, it is very suitable for various image processing tasks, so CNN has achieved great success in many fields such as image classification, object recognition and super resolution.

2. What is loop filtering?

In the process of lossy image/video compression, irreversible quantization operation is introduced, which causes distortion between reconstructed image and original image. Loop filtering technology reduces distortion by filtering reconstructed images, so as to improve compression efficiency and reconstruction quality. Common filters in video coding standards include Deblocking Filter ([2]), pixel Adaptive Offset (Sample Adaptive Offset, [3]), and Adaptive Loop Filter ([4]). Note that the image quality enhancement here is ill-conditioned because the distorted image can correspond to an infinite number of original images, and almost no filter can perfectly restore the original image.

3. Why deep learning can do loop filtering?

The most common way to solve ill-posed problems is to transform the priori knowledge of signals into regularized constraints so as to reduce the solution space. The convolutional neural network architecture itself has been proved to be able to capture the prior knowledge of natural images very well [5], and using it to filter images is equivalent to using implicit regularization. In addition, CNN can further learn useful prior knowledge for specific tasks (such as the decompression distortion here) from the massive training data, and establish the mapping from distorted images to original images, so as to complete the enhancement of image quality.

4. 怎么做?

Literature [6] was an early attempt to replace loop filters (Deblocking and SAO) in HEVC (High Efficiency Video Coding) with CNN, which enhanced the quality of compressed Video and achieved a good effect.

In order to complete this task, training data should be prepared first, namely distortion image and corresponding original image, so that CNN can capture the mapping between distortion image and original image during training. Therefore, the author used HEVC to compress the training set image to get the false reconstruction image. By combining these distorted images with the corresponding original images, a lot of training samples like {distorted images, original images} can be obtained.

Next, the network structure needs to be determined. Here, CNN accepts the distorted image of MxM size as input, and then outputs the filtered image of MxM size. Therefore, the loop filtering task belongs to the category of image quality enhancement, and can refer to the network architecture used by the underlying computer vision tasks such as denoising and super resolution. In this paper, the author designed the 4-layer CNN as shown in Figure 1, whose specific configuration can be seen in Table 1. Since the input and output resolutions of such tasks are consistent, in order to maintain high-precision spatial location information, the network structure generally does not include the subsampling operation (note that in general, some networks will use the subsampling operation to increase the spatial receptive field and maintain high-precision reconstruction by increasing the number of convolutional channels). In the middle two layers of the network, two sizes of convolution kernel,, are used to obtain multi-scale features. In addition, the network contains a layer hop connection, which is the sum of the input image and the output of the last layer of CNN. In this way, the actual output of the network becomes the difference between the reconstructed image and the original image, reflecting the idea of residual learning [7]. The benefit of residual learning is that it can accelerate the convergence speed and improve the performance after convergence.

Figure 1. Network structure in reference [6], a 4-layer full convolutional network

Table 1. Network configuration in reference [6]

In order to drive CNN training, loss function is also needed. The author adopts the MSE loss function of the following formula. Where N is the number of training samples, X and Y sub-tables represent distorted images and corresponding original images, and 𝜃 represents all parameters (weights and bias) of the convolutional network.

At this point, the network weights can be updated by stochastic gradient descent, which is usually done automatically by deep learning frameworks (such as PyTorch). The parameters obtained after the training are the parameters with the best mean meaning for the whole training set. The reader may ask: The obtained parameters are optimal relative to the training set, how well do they perform in the new data set? In general, we assume that the training set and the test set are sampled from the same distribution, which ensures that CNN has similar performance in the test set. In order to improve the generalization performance, often require training set contains enough samples, to reflect the real distribution of data, in addition, there are some other ways to avoid network fitting, such as CNN parameters (i.e., the type of 𝜃) exert norm or 1-2 – norm constraint, this kind of method is called regularization, is an important research field in the deep learning.

The author trained a model for each QP point, namely QP{22,27,32,37}, and then integrated the model into the reference software for testing. The final performance is shown in table 2 below. Here the anchor for comparison is hm-16.0, configured as all-intra. Since the tests were performed only in an all-intra configuration, CNN filtering can also be treated as a post-processing operation. If the filtered frame is used as a reference for encoding subsequent frames, it becomes a loop filter.

Table 2. The algorithm in literature [6] saves bD-rate compared with HEVC

5. JVET-U0068

Next, we will focus on the design of loop filters in a JVET proposal this year [8]. The algorithm named DAM (deep-filters with Adaptive Model-Selection) is designed for the reference software OF the latest generation of video coding standard VTM-9.0. And includes new features such as schema selection.

FIG. 2. (a) CNN architecture used by DAM, M represents the number of feature maps, N represents the spatial resolution of feature maps; (b) Structure of residual elements.

Network architecture: DAM filtering method of the trunk as shown in chart 2, in order to increase the receptive field, reduce the complexity, the network includes a stride length is 2 convolution layer, the layer will be the figure characteristics of the spatial resolution in the horizontal direction and vertical direction were reduced to half the size of the input, the output layer of figure passes through several order stacked unit residuals. The last convolution layer takes the feature map of the last residual unit as input and outputs 4 sub-feature maps. Finally, the Shuffle layer generates a filtered image with the same spatial resolution as the input. Other details related to this architecture are as follows:

  1. For all the convolution layers, use a convolution kernel of 3 by 3. For the internal convolution layer, the number of feature maps is set to 128. For activation functions, use PReLU.

  2. Different models were trained for different slice types.

  3. When a CONVOLUtional neural network filter is trained for an Intra slice, predictive and partitioned information is also fed into the network.

Adaptive model selection: First, each slice or CTU unit can decide whether to use a convolutional neural network-based filter or not; Second, when a slice or CTU unit decides to use a convolutional neural network-based filter, it can further determine which of the three candidate models to use. For this purpose, QP values in {17, 22, 27, 32, 37, 42} are used to train different models. The QP encoding the current sequence is denoized as Q, then the candidate model consists of three models trained for {q, Q-5, q-10}. The selection process is based on the rate-distortion cost function, and then the relevant pattern representation information is written into the bit stream.

Inference: Use PyTorch to perform CNN online inference in VTM. Table 3 below shows the relevant network information in the inference stage.

Unable to copy content in load

Table 3. Network information for inference

Training: Using PyTorch as the training platform, convolutional neural network filters for intra slice and Inter slice were trained respectively, and different models were trained to adapt to different QP points. The network information in the training stage is listed in Table 4.

Unable to copy content in load

Table 4. Network information during training

Experimental results: The author integrated DAM test on VTM-9.0. The block-removing filter and SAO of existing filters were turned off, while ALF (and CCALF) were placed behind convolutional neural network-based filters. The test results are shown in Table 5 to Table 7. Under AI, RA, and LDB configurations, the bD-rate savings of Y, Cb, and Cr channels are respectively: {8.33%, 23.11%, 23.55%}, {10.28%, 28.22%, 27.97%}, and {9.46%, 23.74%, 23.09%}. In addition, Figure 3 shows the subjective comparison before and after the DAM algorithm is used.

RA
Y U V EncT DecT
Class A1 – 9.25% – 22.08% – 22.88% 248% 21338%
Class A2 – 11.20% – 29.17% – 28.76% 238% 19116%
Class B – 9.79% – 31.05% – 29.57% 248% 23968%
Class C – 10.97% – 28.59% – 29.18% 196% 21502%
Class E
Overall – 10.28% – 28.22% – 27.97% 231% 21743%
Class D – 12.17% – 28.99% – 30.27% 187% 20512%
Class F – 5.27% – 16.93% – 16.58% 337% 9883%

Table 5. Performance of DAM on VTM9.0 (RA)

LDB
Y U V EncT DecT
Class A1
Class A2
Class B – 8.69% – 24.51% – 25.26% 235% 10572%
Class C – 10.21% – 24.50% – 24.60% 189% 11145%
Class E – 9.75% – 21.46% – 17.47% 455% 8730%
Overall – 9.46% – 23.74% – 23.09% 258% 10257%
Class D – 11.56% – 26.59% – 27.98% 182% 12071%
Class F – 5.05% – 15.83% – 15.41% 331.70% 5723%

Table 6: PERFORMANCE of DAM on VTM9.0 (LDB)

AI
Y U V EncT DecT
Class A1 – 7.04% – 19.72% – 20.94% 423% 14896%
Class A2 – 7.32% – 24.01% – 22.73% 268% 13380%
Class B – 7.48% – 24.24% – 24.06% 240% 13606%
Class C – 8.90% – 22.61% – 25.08% 176% 10061%
Class E – 11.30% – 24.37% – 24.11% 274% 14814%
Overall – 8.33% – 23.11% – 23.55% 257% 13065%
Class D – 8.75% – 22.68% – 24.96% 158% 10571%
Class F – 5.03% – 15.94% – 15.38% 170% 9364%

Table 7: performance of the AVG proposal on VTM9.0 (AI)

Figure 3. Left: original image (from JVET sequence BlowingBubbles); In: VTM-11.0 compression, QP42, reconstruction quality 27.78dB; Right: VTM-11.0+DAM, QP42, reconstructed mass 28.02dB.

6. Summary and outlook

In recent years, the research work of loop filtering (including post-processing) is mainly distributed in the following aspects (but not limited to) :

  1. Use more input information. For example, in addition to reconstructed frames, prediction and partition information can also be used as CNN input [9];
  2. Use a more complex network structure [9, 10, 11];
  3. How to use the correlation between video image frames to improve performance [12, 13];
  4. How to design unified models for different quality levels [14].

In general, deep learning-based coding tools are on the rise. While showing attractive performance, they also cause high complexity. In order to implement these deep coding tools in the future, complexity and performance compromise optimization will be an important research direction.

7. References

[1] Deep learning resulting in a ficolin-3 molecule. Wikipedia, the free encyclopedia, 10 Mar. 2021. Web. 10 Mar. 2021. ‹ zh.wikipedia.org/w/index.php… The fat.

[2] Norkin, Andrey, et al. “HEVC deblocking filter,” IEEE Transactions on Circuits and Systems for Video Technology 22.12 (2012): 1746-1754.

[3] Fu, Chih-Ming, et al. “Sample adaptive offset in the HEVC standard,” IEEE Transactions on Circuits and Systems for Video technology 22.12 (2012) : 1755-1764.

[4] Tsai, Chia-Yang, et al. “Adaptive loop filtering for video coding,” IEEE Journal of Selected Topics in Signal Processing 7.6 (2013): 934-945.

[5] Ulyanov, Dmitry, et al. “Deep image prior,” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[6] Dai, Yuanying, et al. “A convolutional neural network approach for post-processing in HEVC intra coding,” International Conference on Multimedia Modeling. Springer, Cham, 2017.

[7] He, Kaiming, et al. “Deep residual learning for image recognition,” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[8] Li, Yue, et al. “Convolutional neural network-based in-loop filter with adaptive model selection,” JVET-U0068. Teleconference, 2021.

[9] Lin, Weiyao, et al. “Partition-aware adaptive switching neural networks for post-processing in HEVC,” IEEE Transactions on Multimedia 22.11 (2019) : 2749-2763.

[10] Ma, Di, et al. “MFRNet: a new CNN architecture for post-processing and in-loop filtering,” IEEE Journal of Selected Topics in Signal Processing (2020).

[11] Zhang, Yongbing, et al. “Residual highway convolutional neural networks for in-loop filtering in HEVC,” IEEE Transactions on image Processing of 27.8 (2018) : 3827-3841.

[12] Yang, Ren, et al. “Multi-frame quality enhancement for compressed video,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

[13] Li, Tianyi, Et al. “A deep learning approach for multi-frame In-loop Filter of HEVC,” IEEE Transactions on Image Processing 28.11 (2019) : 5663-5678.

[14] Zhou, Lulu, et al. “Convolutional neural network filter (CNNF) for intra frame,” JVET-I0022. Gwangju, 2018.