Zoe Liu, co-founder and Chief Scientist of Visionular Inc, gave a keynote speech at RTC 2018, sharing a cutting-edge exploration of video codec.

Welcome to the RTC developer community to exchange experience with more real-time audio and video, codec developers.


Why Video Codec Matters?

As we all know, in terms of technical complexity, video encoding and decoding is not symmetrical, encoder is much more complex than decoder. So what can machine learning do to optimize code?


At present, there are more than three coding standards under discussion: one is organized by MPEG, one is open source and exempted from copyright tax from VP9 to AV1, and the other is our domestic research and development from AVS to AVS2 and AVS3 series.

Coding standards are evolving with each passing day. And everyone asks, why is video coding so important?

Take JPEG, an image standard that has evolved over the decades. So why has JPEG not been defeated in popularity over the decades? Largely benefiting from its wide range of commercial uses and ease of implementation. Next, I want to explain why video codec is so important.

In 2013, Google introduced VP9 to replace the H.264 encoder. Overseas users watch YouTube on two types of phones, and on Android they see the VP9 code stream. Because Apple doesn’t support VP9 hardware decoding, iPhone users see H.264 streams.

Google has a tally of how long VP9 and H.264 play worldwide (excluding China). From the figure above, we can see that in India, Africa and other markets with poor network bandwidth, the application of VP9 greatly optimized user experience, greatly shortened the first screen time, and significantly reduced the number of delays.

At the same time, the adoption of the next generation of Codec applications has the potential to improve the user experience and drive new business, which demonstrates the importance of Video Codec.

In encoders, whether HEVC or AV1, there is the concept of partition. Those familiar with coding know that both HEVC and AV1 have a partition of a quadtree.

For example, the size of its superblock in AV1 is 128*128, and it can continue to divide the quadtree. Each 128*128 image block can be divided into four 64*64, and each 64*64 can be divided into four 32*32. In this way, for example, in AV1, it can be decomposed to a minimum of 4*4 image blocks.

For image macroblocks, make a partition map. Statistics show that the calculation of partition RDO evaluation at the Video encoder end accounts for more than 80% of the coder complexity.

So how do you use machine learning to try to optimize?

As shown in the figure above, the first row of four images is intra-frame compression, and the second row of four images is an example of inter-frame compression. It shows the need for different partitions for different image blocks.

The reason is that each image block has a different content. For intra – frame compression, the more detail, the more texture, the finer the chunking. For interframe compression, mainly for residuals, it depends on how interframe prediction is performed. In this sense, the chunking itself is determined by content and prediction patterns.

Then, for any image block, we can extract a certain feature from the content. As we all know, when the VALUE of QP is large, that is, when the distortion degree is high, the content of the whole block tends to be smooth, so larger blocks will be selected. When QP is small, more detailed blocks will be selected. It can be seen from these aspects that, from the perspective of blocks, corresponding features can be extracted from content and coding mode in the case of partition, and decision results can be obtained from machine learning through offline training.

The paper above is a work done by Xu Mai, a teacher at Beihang University, and his students. They are based on the basic classification of partitions made by neural networks (in this case, convolutional neural networks).

For example, if the block size is 64 x 64, you need to make a decision whether to go down. In this case, you need to make four 32 x 32 blocks, and then go down to decide whether to continue partitioning. In other words, the decision is made layer by layer.

This paper has made a preliminary attempt, through neural network training and learning, output is the image block final complete partition results, the multi-level decision results are one-time output into the final partition map. The advantage of this method is that the complexity of the neural network itself can be minimized and the results can be exported at one time.

In addition, in the process of convolutional neural network decision making, it includes early termination decision. As the network depth and the number of nodes per layer increase, the neural network itself will introduce some new complexity. The result of this paper is compared with HM, and the speed of encoder is improved by about 50%.

AV1 is an open standard and an open source CODEC. We worked with Google to contribute the libaom open-source code. Above is our screen shot. Encoder is further optimized by machine learning method.

As can be seen from the figure, this CL is not deep learning, but uses a very simple neural network. Generally, the neural network structure in CL consists of one to two layers, with about 128 nodes in each layer. So this is not deep learning, it is a relatively simple network structure.

In the past, when optimizing encoders, we often adopt the idea of empirical, that is, when doing partition, we can extract the variance of the current block layer from the first, second and third level, or divide the current block into four, extract the variance of each subblock, and make some analysis on it. Then decisions are made and hard-coded thresholds are given. If the size of a block parameter is lower than or higher than a certain threshold, proceed to partition. All of these decisions can be replaced by a neural network, where a simple network can be trained by accumulating large amounts of data and the network can be used to generate decisions about whether or not the quadtree should continue to divide.

As you can see from the figure above, encoder speed can be increased by 10-20% using a simple neural network. Therefore, when we use machine learning, it is not necessarily deep learning, because the concept of neural network has existed for a long time. We mainly use big data for training, design the network from the data set, and model the relatively complex nonlinear relationship, so as to further improve the speed and coding efficiency of Encoder.

ML to Coding Performance

AV1 is an example, it uses neural network, machine learning concepts, making Encoder further speed up. Neural networks are used instead of empirical decision making.

So, can neural networks help us improve the performance of video compression? Here, I’ll share three examples of ways in which neural networks and deep learning can be used to improve coding performance.

The first is to share from the concept of hyperfraction. As we all know, compression causes information loss. After the information is lost, we hope to reconstruct the lost information in the process of inloop at the decoding end or encoding end. If we can do that, we can achieve further improvements in coding performance. Because compression is one system, or improve the image quality at a certain rate, or save the image rate at a certain quality. If the lost information can be reconstructed at a certain bit rate, the picture quality can be further improved, and then the bit rate can be further saved. By improving the picture quality, the quality of the distorted image after the bit rate reduction can be restored to the original picture quality level.

AV1 has more tools that allow it to improve on existing coding standards such as HEVC. One of these tools, called Restoration, provides two Wiener filters and self-guided projected filters, and only this one has a BDRate performance improvement of 1-1.5%.

As you can see in the upper right figure, this is a description of Restoration. So how does it get the information back? We can imagine that any pixel is a point in multi-dimensional space, so an hour of video must have n image frames, with m pixels in each frame. If you think of every point as one dimension, any video is actually a point in higher dimensions. After this kind of imagination, Xs on the right of the upper figure is the original video. After compression, another point in the high-dimensional space is obtained. If these two points coincide, it is the process of Lossless coding. The further apart these two points are in higher dimensional space, the greater the distortion. The reconstruction process is trying to bring the compressed point X back closer and closer to the original point.

There is a concept of guided Filter in AV1, through which the extracted X can be restored to two points X1 and X2, namely, the result of two filters. And when we do that, we find that these two points are still very far from the original point. Then AV1 further builds a plane through X1 and X2, and projects the corresponding points of the original video on the current plane. And you can see that the point is much closer to the origin, so it’s a reconstruction.

Finally, AV1 requires only two parameters, α and β, to be transmitted in the code stream with high precision. The same restoration can be used at the decoding end to restore the image with relatively high quality. From this, we can see that this tool can achieve more than 1% improvement of BDRate, we can use the concept of learning to get better image restoration, so it is natural to think of the application of hypersegmentation method.

Superresolution is now widely used in machine learning. After compression, a higher-quality image can be reconstructed through learning, which can be used in our existing coding structure to achieve better coding performance.

This is a paper co-written by us and Ding Dandan from Hangzhou Normal University, mainly exploring reconstruction of images. It can be applied in four aspects: first, interpolation filtering; Second, In loop filter; Third, a clearer reference frame can be reproduced by using multiple reference frames. Fourth, Out loop post-processing filter.

All these are from the perspective of learning, using known reference frames to reconstruct reference frames with higher clarity and higher quality. Or use interpolation filtering, because interpolation acquisition is also equivalent to reconstruction of some original information. The information obtained from the data of our training set is stored in the neural network structure and its corresponding parameters, and the reconstructed information obtained from the existing video data is used to help us improve the performance of coding.

The figure above is a further example involving a joint reconstruction of forward and backward frames in space and time.

The result of the final reconstruction is a super-resolution image. With the same bit rate, the video quality can be further improved by using this technology at the decoding end. In the process of video reconstruction with multi-frame resolution, since there is a concept of motion vector in each frame of video, the main contribution of this paper is to make a pixel alignment on the basis of the original method, which is a special point in video processing compared with image.

Another part of Xu’s work is to use learning to recover information lost during coding, but instead of improving resolution, improve the quality of the images and remove artifacts from the encoded images.

In the process of encoding and decoding, we will find that the quality of each frame fluctuates. Every image frame has different quality due to different QP. Some frames have better quality, such as Key frame, which is called PQF in this paper.

If we can identify PQF frames and make up for and improve the quality of frames with poor quality by learning, we can not only improve the quality of the original poor quality frames, but also optimize the quality of each frame in the sequence to a higher level.

It is critical that the quality of the video be stable from frame to frame. For example, when acupuncture is performed, many needles are inserted, and if each needle has the same amount of force, the patient will be ok with it. If a sudden injection is very strong, the person will remember the feeling of the injection. The same is true when the human eye watches a video.

The first task is to identify which frames in the video are of high quality, since the original video at the decoding end is not available. This work mainly uses the method similar to no Reference Image Quality Assessment to carry out. In addition to the original video, there is a study called unreferenced quality assessment, and this paper draws on work in that area.

In the first step, 36 features were extracted from each frame, and a total of 180 features were extracted from 5 consecutive frames for training. Support vector machine method is used here to make clustering, after training to identify the quality of the frame.

In the second step, the neural network is used to improve the image quality of the poor quality frame after determining the better frame. Firstly, motion compensation is performed, that is, the process of finding the frame with low quality and the two frames with high quality before and after (PQF), and pixel-level motion estimation is performed between the three frames. After completing motion estimation and motion compensation, do quality improvement.

As we all know, ground truth is needed in the process of neural network training, and it is difficult for motion vector to obtain ground truth. Therefore, the purpose here is not to train motion Vector, but to achieve the minimum difference between the obtained reconstructed frame and the original frame. To train the neural network. This neural network consists of two parts, Mc-subnet and quality enhancement Subnet. The MSE function constituted is minimized, and then the trained neural network is used to reconstruct image frames of low quality, finally achieving the consistency of the overall video quality.

The third work also uses deep learning to improve the efficiency of video coding. The idea is that when people look at a video, they are more sensitive to some areas and less sensitive to others.

Watching football, for example, we are more sensitive to the quality of the ball and less sensitive to the distortion of the grass. Different areas of human eyes have different sensitivity, so I can not do compression and transmission of those areas, that is, analyze in the Encoder end, identify these areas, and then synthesize in the decoder end, coding, decoding is the process of analysis and synthesis.

Machine learning by including the analysis of the content, there is a big advantage in this aspect, so this job is by learning to every frame image segmentation into two regions, part of the area is the human eye is not sensitive, another part of the area is preserved, for those who preserved area can use the traditional compression (AV1 method, for example), For those areas with greater fault tolerance, they can be synthesized.

In our specific work, the concept of AV1’s Global Motion is actually used to identify the corresponding global motion vectors of these regions. Finally, reference frames and global motion vectors are used to reconstruct the regions omitted at the coding end, which is equivalent to replacing and combining some regions by Motion Warp. This working neural network is mainly used for pre-processing and image segmentation. The coding was done by AV1 aligned coding because AV1 aligned’s Global Motion tool was used.

The image on the right shows the experimental results of image segmentation, which are the areas with more complicated texture and which are more sensitive to human eyes. In this work, we found that it is relatively better to divide an image into two regions – texture region and non-texture region, and then warping the texture region.

Compared with images, because temporal, the third dimension, is introduced in video, if motion warping is made for each frame, the effect between frames is inconsistent, then it is easy for human eyes to see the difference, and the subjective perception effect is poor.

To take a simple example, in the old days when we were doing real-time communication (like Facetime) on a mobile phone, a mobile phone with two CPU cores naturally thought of parallelism. H.264, for example, can use the encoder side slice parallelism concept, where each image frame is divided into a top slice and a bottom slice.

For example, in a frame of 22 lines, you can encode the top and bottom 11 lines individually. However, in the case of large QP and low bit rate, due to the independent encoding of the upper and lower slices, the distortion results are quite different, so a line will be observed in the middle of the compiled video. If you look at each frame individually, you can’t see that line, but if you play the video, that line shows up. This is the difference between video and image. There is a consistent problem between frames.

At that time, we analyzed that the middle line appeared mainly because the division of the two slices was too consistent, each frame was the top half and the bottom half, which was too neat. One solution is to make the number of lines in the upper and lower slices approximate but not completely fixed, so that the computation amount of each CPU core is basically equal to achieve the purpose of parallelism. However, a random variable is also introduced to make the number of lines in the upper and lower slices slightly different. And the number of lines varies randomly from frame to frame, so you can’t see that line.

This example further illustrates that when processing video, the perceived effect of video playback on time continuity must be taken into account. This is a big challenge in subjective quality assessment of video versus image. Therefore, in our method of improving video compression performance through video texture analysis and synthesis, we are not doing Motion Warping for every frame, but utilizing hierarchical structure adopted in encoder, and only for image frames in the top layer of this structure (or simply speaking, For B frames only) use warping. The perceptual effect of the video obtained in this way is relatively ideal.

ML to Perceptual Coding

Machine learning can also be used to form perceptual coding. We do video encoders have quality evaluation standards. Quality assessment is about three generations of change:

In the first generation, more PSNR was used. In the process of signal loss, the general evaluation wants to know how much information is lost, so PSNR is used to measure.

In the second generation, when we measure the quality of the video, we think that the video is for people to see, and the insensitive area of the human eye does not need to spend a lot of effort, even if the distortion is large, the human eye can not perceive. So the evaluation standard of video coding is developed to be measured by indicators that can better match with people’s subjective vision.

Third generation, with the development of machine learning or artificial intelligence, a lot of times, like a lot of surveillance video is no longer for human eyes, but for machine analysis.

For example, our coding is generally a low-pass process. Generally, human eyes are more sensitive to low-frequency signals, whether JPEG, MPEG or AV1. When we do bit rate allocation for blocks, the reason why we do transform is that We hope to conpress more energy to the lowpass end, while in the highpass end, we will discard a lot of information that is not sensitive to human eyes. However, the analysis process of the machine is not completely consistent with the observation of human eyes. For example, in the surveillance video, most of the time the video is stable. When a person or an object suddenly appears, that is, high-frequency information appears, which is more useful for machine analysis.

Due to the reference of machine analysis, quality Matrics will have different evaluation standards.

What ‘s Next

This is a paper of CVPR this year. It is a video downloaded from its website. The original text is video Translation. As mentioned earlier, this is a video Analysis + Synthesis process.

As you can imagine, with such a large amount of video, you can recover a complex hd video content from a simple graph, and then you can see its magic transformation threshold. This actually gives a new perspective on video compression. When we started our business, investors would also say, you are now doing compression, and machine learning AI has developed to a certain extent, can all the images and videos be replaced by a very complex deep learning neural network, and finally through the neural network in decoder, Plus the information extracted from the code stream as input to the neural network to reconstruct any video?



Finally, what effect does learning have on our compression? It is often said that “a picture is worth a thousand words”. One frame of image can express so much content, but for a video, it is far more than “a thousand words”. As for the question I raised just now, can machine learning replace all the compression standards and the traditional method of motion compensation + transform threshold used before? My answer today is no.

We believe that in the foreseeable future 5 to 10 years, compression will continue to develop in accordance with the direction of the existing basic coding structure, but each module may be replaced by neural network or machine learning, but the basic framework will not change.

In conclusion, there are two main points: first, through machine learning, video analysis, understanding and compression will be more closely connected. For example, a lot of video on the highway is used for traffic surveillance and license plate recognition, but for a video, if you focus only on the license plate, you can compress the large hd video into a string of numbers. Then the compression of different scenes requires some preliminary analysis of the video. Second, through machine learning, video reconstruction technology will play an increasingly important role in video coding process. Video can reconstruct very complex information, using some neural network tools. Especially for video encoding and decoding, some information that helps the decoding side to reconstruct can be generated at the encoding side because the encoding side is active video. These information can be sent to the decoder in the code stream to help the decoder reconstruct higher quality video. Such an approach can be called “encoder-guided decoding reconstruction” and should have greater potential than independent decoding reconstruction techniques. There will also be more room for machine learning.

Follow Agora’s wechat account to watch video review, download PPT and more content.