Improve correct posture for WebRTC video quality assessment with machine learning

Original: http://webrtcbydralex.com/index.php/2018/10/11/webrtc-video-quality-assessment/

How do you ensure that WebRTC video calls or streams are of good quality?

You can get all the possible metrics from the statistical API, but you still can’t get close to the answer. The reason is simple. First, most of the statistics in the report are about networking, not video quality. However, it is well known, and anyone who has tried, that while these affect the perceived quality of a call, they are not directly related, which means you can’t guess or calculate video quality based on these metrics. Finally, call quality is a very subjective issue, and these issues are difficult for computers to calculate directly.

In a controlled environment, such as in the lab, or in the unit test, people can use the video quality evaluation, reference indexes in sending Fang Biaoji frame with ID, and then capture the receiver’s frame and matching ID (to compensate for jitter, delay or other problems caused by the network) and to measure some differences between the two images. Google’s “full stack test” addresses many codecs and network damage issues and can be run as part of a unit test suite. But how do you do this in production and in real time?

For most WebRTC PaaS use cases, Frame of reference (https://chromium.googlesource.com/external/webrtc/+/master/video/full_stack_tests.cc) is not available (service provider in any way to access the customer content is illegal). Of course, users of the service can log the flow at both sender and receiver and calculate quality scores offline. However, this does not permit action or reaction to a sudden decline in quality. It can only help with post-mortem analysis. So how to record, upload, download… In the case of real-time detection of quality decline and action taken?

In my case, or in some specific cases, which WebRTC PaaS provides the best video quality? For most people, this is an unanswerable question. How do you do a real-time, automatic 4×4 comparison, or this Zoom versus WebRTC (https://jitsi.org/news/a-simple-congestion-test-for-zoom/) comparison, while detecting the network?

CoSMo R&D has launched a new AI-based video evaluation tool that achieves this feat in conjunction with its KITE test engine and corresponding Network instrumentation module.

introduce

Cu-seeme of Cornell University began conducting the first experiments with real-time communication (RTC) over the Internet in 1992. With the launch of Skype in August 2003, RTC rapidly gained popularity on the Internet. Since 2011, WebRTC technology has made RTC available directly in Web browsers and mobile applications.

According to cisco Visual Network Index [1] released in June 2017, real-time video traffic (streaming media, video conferencing) should increase sharply from 3% (1.5 exabyte per month) of Internet video traffic in 2016 to 13% (24 exabyte per month) in 2021.

Quality of experience (QoE) for the end user is very important for any application that processes video. The industry already has a number of tools and metrics to automatically evaluate QoE for video applications. For example, Netflix developed the Video Multi-method Evaluation Fusion (VMAF) metric [2], which measures the quality of delivery by using different video encoders and encoding Settings. This metric helps routinely and objectively assess the quality of thousands of video codes across dozens of coding Settings.

But it needs the original reference undistorted video to calculate the compressed video quality score. This approach works well for video streams with pre-recorded content without distortion video, but not for RTC, which generally cannot provide raw video.

Raw video can be recorded from the source, but video quality can’t be evaluated in real time. In addition, recording live video during live communications raises legal and security issues. For these reasons, entities that perform video quality assessments (such as third-party platform-as-a-service) may not be authorized to store video files.

Therefore, the special case of RTC cannot be solved by the need to refer to the measurement of the video. Therefore, it is necessary to use evaluation methods without reference indicators. These indicators are called unreferenced Video Quality Assessment (NR-VQA) indicators.

I. Video quality indicators

Video quality assessment techniques can be divided into three categories.

First, there is full reference (FR) technology, which requires full access to the reference video. In the FR method, we found the traditional video quality methods: signal-to-noise ratio (SNR), peak signal-to-noise ratio (PSNR) [3], mean square error (MSE), Structural similarity (SSIM) [4], visual information fidelity (VIF) [5], VSNR [6] or video quality measurement tool (VQM) [7].

These metrics are well-known and easy to calculate, but they are not a good indicator of the quality of the user experience [8, 9].

Then there is the reduced reference (RR) technique, which requires a rough set of features extracted from the reference video.

Finally, the no-reference (NR) technique does not require any information about the reference video. In fact, they don’t need any reference videos at all.

A comprehensive and detailed review of NR’s video quality indicators was published in 2014 [10]. A recent survey of audio and video quality assessment methods was published in 2017 [11]. Metrics are divided into two groups: pixel-based methods (NR-P), which are calculated based on statistics derived from pixel-based features, and bit-stream methods (NR-B), which are calculated from encoded bitstreams.

II. Previous efforts for WebRTC video quality assessment

Literature [12] has proposed the first measure to evaluate broadcast video quality to many viewers through WebRTC. For this experiment, the author used SSIM index [4] as the measurement standard of video quality. The test aims to measure how many viewers can join in watching the broadcast while maintaining acceptable image quality. When it comes to accurately evaluating the user experience, the results are inconclusive. As the number of viewers joining the broadcast increased, SSIM measurements remained surprisingly stable with values of [0.96, 0.97]. Then suddenly, when the number of clients reaches about 175, the SSIM drops to a value close to zero. As audiences grow from 1 to 175, the user experience cannot remain acceptable without a loss of quality. In addition, the test used a pseudo client that only implemented the parts of WebRTC responsible for negotiation and transport, not the WebRTC media processing pipeline, which was not practical for evaluating the video quality of the broadcast experiment.

In reference [13], the authors evaluated various NR indicators (0 to 10% packet loss rate) on videos that were compressed and transmitted impaired over lossy networks. The eight NR metrics studied were complexity (number of objects or elements present in a frame), motion, block effect (discontinuity between adjacent blocks), jerky (non-smooth and non-smooth presentation of frames), average blur, blur ratio, average noise and noise ratio. Because none of these NR metrics to accurately assess such damaged the quality of the video, so they recommend the use of machine learning technology will be a number of NR target with two network measurement (bit rate and packet loss level), in order to provide improved NR metrics can provide and the video quality measurement (VQM) a video rating, This is a reliable FR measure that provides a good correlation with human perception. In this experiment, they used ten videos from a real-time quality video database. The videos were compressed at eight different levels using H.264 and were compromised when transmitted over the network, which lost 12 packets.

They evaluated the quality of their results against scores given by the FR metric video Quality Measure (VQM) [14], but did not measure for NR.

In literature [15], the author relied on many bitstream-based features to assess the damage of receiving video and how these damage affects the perceived video quality.

Paper [16] proposed the combination of audio and video indicators to evaluate audio-visual quality. The assessment was carried out on two different data sets.

First, they presented the results of a combination of FR indicators. The FR audio indicators selected by the author are the Perception assessment of audio quality (PEAQ) [17] and ViSQOL [18]. For FR video metrics, they use video quality measurement (VQM) [7], peak signal to noise ratio (PSNR) and SSIM [4].

They then showed the results of the combination of NR indicators. NR audio indicators are SESQA and reduced SESQA (RSESQA) [19]. For NR video indicators, they used block blur measure [20], Blind/unreferenced Image Spatial Quality Assessment (BRISQUE) [21], blind Image Quality Index (BIQI) [22] and Natural Image Quality Assessment (NIQE) [23]. The best combination of the two datasets is the block blur of RSESQA.

A recent experiment to evaluate the experience quality of WebRTC video streaming on mobile broadband networks has been published in literature [24]. Video calls of various resolutions (from 720×480 to 1920×1080) are input via WebRTC between Chrome and Kurento Media Server. The quality of WebRTC videos was subjectively rated by 28 people on a scale of 1 (poor quality) to 5 (good). The authors then used several metrics, all based on errors in calculations between the original video and WebRTC video, to objectively assess the quality of WebRTC video. Unfortunately, the authors do not clearly report whether there is a correlation between subjective assessments and objective measures of computation.

III. NARVAL: Aggregation of video quality evaluation without reference index based on neural network

III. 1 methodology

There are two main parts to this work: first, extract features from videos that represent video conferencing use cases (as compared to pre-recorded content used by, say, Netflix), and then train the model to predict a given fractional video. We used six publicly available video quality datasets containing various distortions that may occur during video communication to train and evaluate the performance of our model.

For feature extraction, we selected measures and features published and evaluated on different image quality datasets. After computing them on the video in our database, we store the data so that we can reuse them in the training section. The data can then be processed for use in our training model, for example to obtain the mean of features on the video. In the second part, we use different regression models, mainly the neural network of input and layer change, as well as support vector regression.

We tested multiple parameter combinations for each model and kept the best for each model category only. In addition to the most basic neural networks, convolutional, cyclic and time-delay neural networks are also used.

NARVAL TRAINING: Dense deep neural network diagrams

We trained our model on the database using a 5x fit and then repeated the training many times. Since each database contains multiple distortions, we cannot split the folds randomly, so we try to choose five folds so that all distortions are in one fold and we maintain the same distribution for all tests. And then you just look at the average of the folds.

Another way to create a fold is to make a video where the deformation is a fold. With this approach, the folds are smaller, verifying that the folds are new to the model.

III. 2

First, the training set (that is, the set with known scores) is verified to see whether the video quality we calculated matches the known value, as shown below.

NARVAL TRAINING: 3D convolutional network diagram

For health checks, the scores provided by NARVAL’s SSIM and WMAF scores on the same reference video were calculated again. We can see that the scores show the same behavior, though not exactly the same. Interestingly, it also illustrates a result known in the image processing community, but apparently counterintuitive in the WebRTC community: perceived video quality does not decrease linearly with bit rate/bandwidth. As you can see in the figure below, to reduce quality by 10%, you need to reduce bandwidth by 6 to 10 times!

conclusion

In practice, this means that you can now use NARVAL to calculate video quality without reference frames or videos! It opens the door to simpler implementations in existing use cases, as well as to many new use cases where quality assessment can be done at any given point in the streaming pipeline.

The full research report is available from CoSMo. CoSMo also licenses two implementations: a Python implementation for research and prototyping, and a C/C ++ implementation for speed and SDK embedding. Eventually, video quality assessment will be proposed as a service, built on top of POLQA with Citrix’s AQA service.

reference

[1] — Visual Networking Index, Cisco, 2017

[2] — Toward A Practical Perceptual Video Quality Metric, Netflix, 2016.

[3] — Objective video quality measurement using a peak-to-signal-noise-ratio (PSNR) full reference technique, American National Standards Institute, Ad Hoc Group on Video Quality Metrics, 2001.

[4] — Image Quality Assessment: From Error Visibility to Structural Similarity, Wang et al., 2004.

[5] — Image Information and Visual Quality, Sheik et al., 2006.

[6] — VSNR: A Wavelet-based Visual signal-to-noise Ratio for Natural Images,

chandler et al., 2007.

[7] — A New Standardized Method for Objectively Measuring Video Quality, Margaret H. Pinson and Stephen Wolf, 2004.

[8] — Mean Squared Error: Love It or Leave It? A new look at Signal Fidelity Measures, Zhou Wang and Alan Conrad Bovik, 2009.

[9] — Objective Video Quality Assessment Methods: A Classification, Review, and Performance Comparison, Shyamprasad Chikkerur et al., 2011.

[10] — No-reference Image and Video Quality Assessment: a classification and review of recent approaches, Muhammad Shahid et al., 2014.

[11] — Audio-visual Multimedia Quality Assessment: A Comprehensive Survey,Zahid Akhtar and Tiago H. Falk, 2017.

[12] — WebRTC Testing: Challenges and Practical Solutions, B. Garcia et al., 2017.

[13] — Predictive No-reference assessment of video Quality, Maria Torres Vega et al., 2017.

[14] — A New Standardized Method for Objectively Measuring Video Quality, Margaret H. Pinson and Stephen Wolf, 2004.

[15] — A No-reference Bitstream-based perceptual Model for video quality estimation of Videos affected by coding artifacts and packet losses, Katerina Pandremmenou et al., 2015.

[16] — Combining audio and video metrics to assess audio-visual quality, Helard A. Becerra Martinez and Mylene C. Q. Farias, 2018.

[17] — PEAQ — The ITU Standard for Objective Measurement of Perceived Audio Quality, Thilo Thiede et al., 2000.

[18] — ViSQOL: Virtual Speech Quality Objective Listener, Andrew Hines et al., 2012.

[19] — The ITU-T Standard for Ended Speech Quality Assessment, Ludovic Malfait et al., 2006.

[20] — No-reference Perceptual quality assessment of {JPEG} Compressed images, Zhou Wang et al, 2002.

[21] — Blind/Referenceless Image Spatial Quality Evaluator, Anish Mittal et al., 2011.

[22] — Constructing A Mathematical Model for Constructing Blind Image Quality, Anush Krishna Moorthy and Alan Conrad Bovik, 2010.

[23] — Making a “Completely Blind” Image Quality Analyzer, Anish Mittal et al., 2013.

[24] — Estimation of Visual And Video Streaming in Streaming Video based on Video Processing, Ieee Transactions on Video Processing, 2018.

[25] — Evolution of Real-time Communication Testing with WebRTC 1.0, Alexandre Gouaillard and Ludovic Roux, 2017.

[26] — Comparative Study of WebRTC Open Source SFUs for Video Conferencing, Emmanuel Andre et al., 2018

This article is from the blog of Alex. Gouaillard, founder of CosMos Software, who also works for WebRTC, QUIC and other standards organizations. LiveVideoStack has translated the original text.

NetEase Yunxin, the instant messaging and audio and video technology expert around you, to know more about us, please poke NetEase yunxin official website

For more industry insights and technical dry goods, please follow the NetEase Yunxin blog

More exciting content, pay attention to the NetEase cloud letter zhihu organization number oh ~

Improve correct posture for WebRTC video quality assessment with machine learning

Related Posts

5646 Zhihu answers in one minute (with code)

10 Tips for Improving Code Readability

The applet wx.request() method is simply wrapped