In today’s world of real-time interaction, video quality is an important indicator of the end user experience. It is not realistic to implement large-scale real-time evaluation solely relying on manual, so the construction and promotion of automatic video quality evaluation system is the general trend.

But how to evaluate video quality? Different concerns may lead to different answers. For all kinds of live broadcast end users, the focus is on real-time quality control; For practitioners who provide video technology services, the focus is on fine-grained enhancement or rollback between versions of video algorithms. Therefore, we need a set of objective indicators to “evaluate subjective video quality experience”, on the one hand, as the client experience evaluation or fault detection, on the other hand, as practitioners’ algorithm optimization performance reference. We call this evaluation system VQA(Video Quality Assessment).

The difficulty of this problem is how to collect data, that is, how to quantify people’s subjective evaluation of video quality, and how to build a model that can replace manual scoring.

In the following part, we will first sort out the general evaluation methods of the industry, then introduce the process of establishing the Agora-VQA model of Sonnet, and finally summarize the future improvement direction.

How does the industry evaluate video quality?

Like other algorithms in the field of deep learning, building a video quality assessment model can also be divided into two steps: collecting VQA data and training THE VQA model. The whole process of VQA training is realized through the simulation of subjective labeling by objective model, and the fitting effect is defined by the consistency evaluation index. Subjective VQA annotations collect end user feedback in a graded way to quantify the video experience of real users; Objective VQA provides a mathematical model that mimics subjective quality grading.

Subjective: VQA data collection

Subjective evaluation Video quality is scored by observers, which can be divided into MOS(Mean Opinion Score) and DMOS(Differential Mean Opinion Score). MOS describes the absolute evaluation of videos, which belongs to the scene without reference and directly quantifies the quality of massive UGC videos. DMOS refers to the relative evaluation of videos, which belongs to reference scenes and generally compares the differences between videos under the same content.

In this paper, we mainly introduce the operation examples given by MOS, ITU-T Rec BT.500 to ensure the reliability and validity of subjective experiments. The subjective video feeling is projected into the interval [1,5], which is described as follows:

score experience describe
5 Excellent Experience is very good
4 Good Perceptible, but not affecting (experience)
3 Fair Slight influence
2 Poor Have an impact on
1 Bad Very affected

Two issues need to be explained in detail:

1. How to form MOS?

The recommendation given by ITU-T Rec BT.500 is “to set up non-expert group of ≥15 people”. After receiving grader’s annotation on the video, the correlation between each person and the overall mean value should be calculated first, and then the mean value of the remaining grader’s evaluation should be calculated after eliminating the grader with low correlation. When the number of participants is more than 15, it is enough to control the experimental random error within an acceptable range.

2. How to interpret MOS? To what extent does MOS represent “my” opinion?

Although different raters have different definitions of absolute intervals between “good” and “bad,” or sensitivity to quality damage. But the judgment of “better” and “worse” converged. In fact, in public databases such as Waterloo QoE Database, STD averages 0.7, indicating that subjective feelings of different raters can differ by nearly one notch.

Objective: VQA model building

There are many ways to classify VQA tools. According to the amount of information provided by the original reference video, VQA tools can be divided into three categories:

Full Reference

It relies on the complete original video sequence as the reference standard, and PSNR and SSIM based on pixel by pixel are the most original comparison methods. The disadvantage is that the degree of subjective fitting is limited. The VMAF index launched by Netflix is also included in this list.

Reduced Reference

The object of comparison is some corresponding features (of the original video sequence and the receiver video sequence), which is suitable for the case where the complete original video sequence is not available. This kind of method is between Full Reference and No Reference.

No Reference No Reference

The method of No Reference (hereinafter referred to as “NR”) further removes the dependence on additional information and evaluates the current video more matter-of-fact. Due to the limitation of online data monitoring, reference videos are usually unavailable in actual scenarios. Common NR indicators include DIIVINE, BRISQUE, BLIINDS and NIQE, etc. Due to the lack of reference videos, the accuracy of these methods is often lower than that of full reference and semi-reference.

Subjective and objective consistency evaluation index

As mentioned above, pixel-based PSNR and SSIM methods and subjective fit is limited, so how do we determine whether various VQA tools are good or bad?

It is usually defined from the prediction accuracy and monotony of objective models. The prediction accuracy describes the Linear prediction ability of objective model for subjective evaluation, and the related indexes are Pearson Linear Correlation Coefficient (PLCC) and Root Mean Square Error (RMSE). Prediction monotonicity describes the consistency of relative grades of scores, and is measured by SROCC(Spearman Rank Correlation Coefficient).

How does Agora-VQA evaluate video quality?

However, most public data sets are not large enough to reflect the real online situation in terms of data volume and video content richness. Therefore, in order to get closer to real data features and cover different RTE (real-time interaction) scenarios, we established agora-VQA Dataset and trained agora-VQA Model on this basis. ** This is the industry’s first video subjective experience MOS evaluation model based on deep learning that can be run on mobile devices. ** It uses deep learning algorithm to estimate MOS score of subjective experience of video picture quality at the receiving end of RTE (real-time interactive) scene, which removes the high dependence of traditional subjective picture quality evaluation on human resources, thus greatly improving the efficiency of video picture quality evaluation and making real-time evaluation of online video quality possible.

Subjective: Agora-VQA Dataset

We established a subjective evaluation database of picture quality, and set up a scoring system according to ITU standards to collect subjective scores, and then cleaned the data to obtain the subjective experience score MOS of the video. The overall process is shown as follows:

In the stage of sorting out videos, first of all, we should consider enriching the sources of video content in the same batch of scoring materials to avoid visual fatigue of raters. Secondly, try to distribute evenly in the range of picture quality. The following is the scoring distribution collected in a video:

In the subjective scoring stage, we set up a scoring app, where each video is 4-8s in length and 100 videos are collected for scoring in each batch. For each grader, the total viewing time should be controlled within 30min to avoid fatigue.

Finally, in the data cleaning phase, there are two options. First, according to ITU standards, the correlation between each grader and the mean of the population is calculated first, and then the mean of the evaluation of the remaining graders is calculated after the grader with low correlation is removed. The other is to calculate the 95% confidence interval of each sample, select the videos with the highest scoring consistency as the gold standard, and screen out participants with large scoring deviations in these samples.

Objective: Agora-VQA Model

On the one hand, in order to get closer to the actual subjective feelings of users, and on the other hand, since reference videos are not available in live video broadcast and similar scenes, our scheme defines objective VQA as a non-reference evaluation tool for decoding resolution at the receiving end, and uses deep learning method to monitor the video quality at the decoding end.

Training deep learning models can be divided into end-to-end and non-end-to-end. In the end-to-end training mode, due to the different spatial and temporal resolution of the video, the video should be sampled to a uniform size for end-to-end training. For non-end-to-end, features are first extracted through a pre-trained network, and then MOS is fitted by regression training for video features.

In feature extraction, there are different sampling methods for the original video. The following figure (citing the illustration in paper [1]) shows the correlation between different sampling methods and subjectivity. It can be seen that sampling in video space has the greatest impact on performance, while sampling in time domain has the highest correlation with MOS of the original video.

It is not only the spatial characteristics that affect the image quality experience, but also the time-domain distortion, which has a time-domain lag effect (see paper [2]). This effect corresponds to two behaviors: one is the immediate decrease of subjective experience when the video quality deteriorates; the other is the slow improvement of viewer experience when the video quality improves. This phenomenon is also taken into account in our modeling.

Performance comparisons with other VQA tools

Finally, the correlation performance of different image quality evaluation algorithms on KONVID-1K and Live-VQC is shown as follows:

Comparison of parameter number and computation amount of the model:

It can be seen that Agora-VQA has great computing advantages compared with the large model based on deep learning in the academic world, and this advantage gives us the possibility to directly evaluate the video communication service experience on the end, which greatly improves the saving of computing resources under the guarantee of certain accuracy.

Looking forward to

Finally, Agora-VQA still has a long way to go to achieve the ultimate QoE(Quality of Experience), the goal of characterizing the user’s subjective Experience:

1) From decoding resolution to rendering resolution

The concept of decoded resolution is relative to rendered resolution, and it is known that video played on different devices or stretched on the same device at different window sizes will cause differences in subjective experience. At present, Agora-VQA evaluates the quality of the video stream at the decoding end. In the next phase, we plan to support different devices and different stretch sizes to get closer to the perceived quality of the end user and achieve “what you see is what you get”.

2) From the video clip to the entire call

The VQA data set used to train the model, much by the length of 4 ~ 10 s of video clips, and the recency effect should be considered in actual conversation, only through the video fragment linear dot tracking, reporting, may not be able to accurately fit the user’s subjective feeling, the next step we plan to comprehensive consideration of clarity, fluency, interactive delay, etc., by synching up Form a time-varying experience evaluation method.

3) From experience score to fault classification

At present, agora-vqa can predict the video quality accurately to 0.1 within the interval [1,5], and automatically locate the fault cause when the video quality is poor is also an important step to realize the online quality survey. Therefore, we plan to support the fault detection function on the basis of the existing model.

4) From real-time evaluation to industry standardization

At present, Agora-VQA has been iterated and polished in the internal system, and will be gradually opened later. In the future, it is planned to integrate online evaluation function in SDK simultaneously and release offline evaluation tool.

The above is our research and practice in VQA. Please click “Read the original article” and post to the developer community to communicate with us.

reference

[1] Z. Ying, M. Mandal, D. Ghadiyaram and A. Bovik, “Patch-VQ: ‘Patching Up’ the Video Quality Problem,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 14014-14024.

[2] K. Seshadrinathan and A. C. Bovik, “Temporal hysteresis model of time varying subjective video quality,” 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 2011, pp. 1153-1156.

The Dev for Dev column

Dev for Dev (Developer for Developer) is an interactive and innovative practice activity jointly initiated by Agora and RTC Developer community. Through technology sharing, exchange and collision, project construction and other forms from the perspective of engineers, it gathers the strength of developers, excavates and delivers the most valuable technical content and projects, and fully releases the creativity of technology.