At the end of the day, each video compression method has trade-offs: if you allow larger file sizes, you can have better image quality; But if you want your files to be very small, you have to tolerate the chance of errors. But now _ (and soon)_, it is hoped that neural network-based approaches will provide a better trade-off between video file size and quality _ (a better trade-off)_.

Any AI-enabled technology is seen as the dawn of tomorrow, with a mysterious sense of the future that makes it tempting to get close. Thanks to Professor Ma Zhan of NTU, we were able to interview Liu Ho-jie, a PhD candidate at NTU, Neural Video Coding Using Multiscale Motion Compensation and Spatiotemporal Context Model At the top of the artificial intelligence AAAI 2020 and was elected a Poster Spotlight, improved version has been launched after making, become a open source project _ (link: https://njuvision.github.io/N.) _.

Liu, who is now an exchange student at New York University’s Tanton School of Engineering, arrived in New York just a day before the outbreak began and the United States closed its borders.

The following is compiled from an interview with LiveIdeoStack and Liu Haojie.

01 ?for Haojie Liu

= = = = = = = = = = = = = = = = = = = = =

LivevideoStack: Why machine learning and neural coding?

Liu Haojie: First of all, my tutor has been engaged in the research of traditional video coding for many years, and has rich experience and technology accumulation in the field of video coding. When he was enrolled for his master’s degree in 2016, neural network and deep learning became increasingly popular, and the coding based on deep learning just started at that time.

Under such a dual opportunity, I began to try to combine the two, mainly studying image and video coding based on deep learning, and my main research direction and subject have been carried out according to this.

LivevideoStack: Current research at NYU?

Hao-jie liu: At present, I am visiting Professor Yao Wang’s Video Lab in Tenton School of Engineering, New York University, mainly to further deepen end-to-end image and Video coding algorithms, refine each module in the entire end-to-end Video coding framework, and better combine neural coding with visual tasks. Can make more practical application and real scene oriented research results.

Of course, I have been exploring the design of some interesting neural video coding frameworks that are different from traditional frameworks.

02 ?For Neural-Video-Coding

LivevideoStack: Can you talk more specifically about the end-to-end solution to neural coding?

Hao-jie liu:

1) From the perspective of image coding, our method introduces the _ nonlocal module and the self-attention mechanism _, which can better extract local and non-local information, and the implicit self-attention mechanism can adaptively allocate bit rate.

2) Image coding was further completed by other students in the lab, such as fixed-point network, single-model multi-bit rate point coverage, etc., which were more oriented to practical application and landing.

3) In combination with image segmentation, we have also integrated object-based image coding and analysis into our own system, so that our algorithm can achieve extremely high subjective visual quality at very low bit rates.

3) In the end-to-end video coding system, we combine the non-local self-attention image coding algorithm _NLAIC_ developed by ourselves, and use the time-domain prediction model _CONVLSTM_ to extract and aggregate the time-domain priors, which are fused with the spatial priors to provide a better probability model which can significantly reduce the bit rate.

4) In the process of interframe prediction, we combined multi-scale motion estimation to generate multi-scale motion field, and at the same time made multi-scale motion compensation for the video feature domain to optimize the prediction performance step by step. This method can better solve the difficult prediction problems such as video occlusion and get better video prediction performance.

LivevideoStack: The most impressive difficulty in the research process?

Haojie Liu: Compared with some pure image enhancement algorithms, the most important thing involved in video coding is to estimate the rate of the encoded feature and optimize the rate-distortion with the loss of video reconstruction.

How to apply mode selection in traditional video coding to multi-frame optimization in end-to-end system is a difficult point to solve the multi-frame rate distortion optimization in the training process.

LivevideoStack: In terms of the current research, what are the specific issues that still need to be addressed?

Hao-jie liu:

1) Interframe coding is a very important part of video coding. It is a key problem to get better prediction frames based on encoded video frames under limited bit-rate constraints.

2) To better design probability prediction model based on spatiotemporal information.

3) _ better design of multi-frame rate-distortion optimization _, multi-frame rate-distortion optimization can effectively solve the problem of error accumulation and propagation in the actual coding process, which has a great impact on the final coding performance.

LivevideoStack: What about the idea that neural coding has a better trade-off?

Liu Haojie: I think it has two sides. _ For image coding _, end-to-end image coding technology is becoming more and more mature, because the learning based algorithm can optimize the encoder and decoder at the same time, and under the condition of constant optimization of feature transformation, probability estimation, quantization and other methods, the whole end-to-end framework can perform rate distortion optimization very well.

_ For video coding _, traditional video coding has complex block partition and mode selection to optimize the whole codec system, but it is difficult to use a single model to solve all the problems perfectly in the current end-to-end video coding system. Many problems need to be solved in this system, such as how to optimize the multi-frame video encoder in training, whether to adopt multi-model or not, and RD selection within and between frames. Therefore, how to design a better rate-distortion optimization strategy in end-to-end video coding can bring great performance gain.

LivevideoStack: What do you know about the institutions and platforms that do relevant research in China?

Liu Haojie: In China, such as Shanghai Jiao Tong University, University of Science and Technology of China, Peking University, Tencent and Alibaba all have a lot of excellent research results in this field.

Shanghai Jiao Tong University proposed the earliest end-to-end video coding framework DVC, and on this basis, DVC_Pro was proposed to further improve the performance of coding.

USTC liudong teacher team on the traditional video coding framework introduced a lot of deep learning algorithm to improve the corresponding module greatly improve the performance of the traditional coding framework, at the same time, they put forward based on neural network end-to-end image compression algorithm of wavelet transform, the method of using integrated learning specific image texture optimization specific compression model, MLVC using multiple reference frames in the end-to-end video coding framework has high compression performance;

The Peking University team proposed a hierarchical priori representation of probabilities to further optimize the probabilistic model in the end-to-end system, which can encode images more efficiently and at the same time have a lower codec complexity.

The multi-frequency feature transformation method proposed by Tencent in the industry has better performance than VVC in image coding.

LivevideoStack: Are you interested in foreign research related to neural coding?

Hao-jie liu: The coding team of Google has done a lot of basic work on the whole end-to-end system, starting from the earliest image coding based on recursive model, and then the compression model based on Variational autoencoder (VAE) has become the basis for most of the current work. On this basis, a lot of work for feature transformation, quantization, and multi-level probability module to get better compression performance.

Many of the work of the vision lab at ETH Zurich, including the soft-to-hard quantization method, 3D probability model, very low bit rate image compression method and their end-to-end video coding system, have made great contributions to the field of neural coding. At the same time, they have replicated and open sourced the work of end-to-end video coding for DVC, which has brought great convenience to many researchers.

I also pay attention to Disney’s method. In their article published in ICCV 2019, they introduce coding constraint to get intermediate frame by using the idea of Video Interference. Meanwhile, they put forward a coding method from feature domain to make residual compensation, which finally gets good coding performance.

LivevideoStack: The use of intermediate coding?

Hao-jie liu: as a result of the neural coding transform to get the quantitative features by means of feature extraction, and many computer vision tasks is often done through feature extraction and expression to some visual task, so in some visual task, can be directly by the characteristics of the intermediate code to do some visual task and greatly reduce the time cost and complexity of decoding back into the image. This method can be well applied in some machine vision methods, and improve the efficiency of these methods.

03 ?For Traditional Video Coding

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

LivevideoStack: What are the limitations of traditional coding?

Hao-jie liu:

1) Traditional video coding frameworks have continued the block-based hybrid coding framework for more than 20 years with great success, which has largely benefited from the continuous development of hardware. However, limited by Moore’s law, the hardware development gradually fell into a bottleneck, and it became increasingly difficult to exchange for coding performance through computational complexity, and the cost and difficulty of hardware design also increased continuously. _

2) In addition, video coding is no longer limited to meeting the viewing needs of users. With the growing and changing needs of users, the analysis and other visual applications after the transmission of video coding are becoming more and more abundant. It is particularly important to explore and develop some novel video coding algorithms and frameworks.

3) Traditional encoding is mainly focused on pixel-based prediction, which cannot make better use of the correlation of feature domains to better solve the problem of data redundancy. In addition, learning-based video codec can optimize the codec and related modules end-to-end.

LivevideoStack: How do you evaluate the new generation of traditional codecs like VVC?

Liu Hao-jie: On the whole, VVC still follows the same hybrid coding framework, including block partition, intra-frame prediction, inter-frame prediction, transformation and quantization, entropy coding, filtering and so on. In each specific technical point, VVC has a further improvement than the original technology.

In terms of objective quality, SDR video can save more than 40% bit rate at the highest than HEVC, and it also has the same gain for HDR and VR video, and its subjective performance is also significantly higher than HEVC.

LivevideoStack: How is neural coding the same or different from traditional coding?

Liu Haojie: In essence, both neural coding and traditional coding can express video information more compactly by making use of spatio-temporal correlation of video and corresponding priori information to remove redundancy. The pass-rate distortion optimization can use limited information as much as possible to achieve higher video reconstruction.

From the aspects of complexity, due to the traditional coding and neural video coding rely on computing platform there’s a difference, at the same time, neural coding in the field of engineering and hardware development is not enough mature, believe that with the development of artificial intelligence chip, neural network quantitative fixed point of mature, in all aspects of the advantage of neural coding will gradually reflect.

At present, many research results can realize real-time image codec algorithm on GPU, and have better performance of subjective image reconstruction.

04 ?For the Very Close Future

LivevideoStack: An end-to-end application scenario for neural coding?

Hao-jie liu:

1) Object-based end-to-end image coding, we found in the process of research that it has a very good performance and performance in license plate recognition, pedestrian recognition tasks under the surveillance scene.

2) The reconstructed images and videos with very high precision can be obtained at very low bit rate, which can be used in a wide range of scenarios with extremely limited bandwidth, such as deep sea exploration and aviation communication.

LivevideoStack: What are the requirements for the implementation and popularization of neural coding applications?

Hao-jie liu:

1) More teams should work together to develop some unified standards for neural coding.

2) More open source codes and more open interfaces are easy for other modules to access.

3) Maturity and development of neural network hardware.

LivevideoStack: What are some of the hard questions about neural coding itself that still need to be addressed?

Hao-jie liu:

1) As the subsequent modules of neural coding continue to increase, how to better carry out end-to-end training for multi-modules is a problem to be solved.

2) There is no good unified standard for performance comparison of neural coding at present.

3) It is often difficult to solve the rate-distortion optimization and rate-rate allocation problems between multiple frames in the training process. It is difficult for a model to achieve the overall optimal performance on all sequences. Mode selection and training multiple models are particularly important for performance.

LivevideoStack: The Future of Machine Learning in Video Codec?

Hao-jie liu:

1) Using machine learning method to replace related modules in traditional video coding. Machine learning or deep learning has superior performance than traditional methods in image and video prediction, denoising and block removal, etc. Replace the traditional coding with corresponding modules can greatly improve the performance of traditional coding.

2) Design a new end-to-end learning based video coding framework, such as better image transformation, intra-frame prediction module, quantization, probabilistic model, etc.

3) Machine learning can expand the boundaries of video coding applications, from the earliest viewing needs that only served the client, to today’s various machine vision tasks, as well as video processing and analysis for more efficient clients.

LivevideoStack: How do you see the future of neural coding?

Liu Haojie: While studying end-to-end video coding framework, our lab also proposed some hardware-oriented neural network coding fixed-point, and our algorithm was also simplified based on some neural network chips and tested on them.

AI chips and how to optimize the end-to-end coding system for these chips is an important issue, which is also related to the future application of neural coding.

In addition, the end-to-end hao-jie liu worked in his laboratory in Google image coding algorithm of the second image coding contest held by _ (https://openaccess.thecvf.com)… MS-SSIM index ranks the second among all submission algorithms. Subsequent open source model can stably surpass BPG algorithm in objective and subjective indexes, and reach or exceed the performance of VVC in certain images. Its related achievements support target-based coding, multiple visual tasks in feature domain _ (PCM best paper finallist) _, high quality image reconstruction with very low bit rate, etc. In the aspect of video prediction, a variety of methods are proposed to further improve the performance and efficiency of interframe prediction.

Edited by Coco Liang