As 2020 began, the epidemic threw people off their stride. It is impossible to go out for shopping, schools are closed, and work at home office. Under such circumstances, live delivery, online teaching, and video conferencing are all in the forefront. However, if the host’s voice or picture is suddenly not heard during the live broadcast, or the sound and picture effect are poor, it will definitely affect the experience. So how can a product find problems and solve problems as much as possible during the testing phase? Through this article, it is expected that we can understand the indicators concerned with audio and video, and have a basic understanding of the evaluation process of audio and video. This article is from Netease Intelligence enterprise research and development Ye Shaoqiu, please indicate the author and source.

This paper mainly from three aspects, one is the purpose of audio and video testing, mainly said that audio and video testing can solve the problem, more general; Second, it introduces how to do the test from the perspective of audio and video, including the test dimension, evaluation framework, overall structure and the specific content of each part; The third part introduces the composition of this framework and the problems encountered in the implementation process.

Before the content is expanded, the basic process of audio and video calls is briefly introduced:

As can be seen from this diagram, the whole process is divided into three parts: sending, network and receiving.

The functions of each module at the sending end are as follows: collecting, corresponding to the microphone and camera hardware, or multimedia mixing or screen recording; As we all know, 3A May not be familiar with, here is the module for the processing of audio and video effects.

Encoding and decoding are a pair of inverse processes: after encoding is completed, the package is sent to the network for transmission, and the receiver is processed and decoded for playback.

The purpose of audio and video testing

With a basic understanding of the audio and video call process, let’s get back to why we’re testing.

This part is actually related to why we do audio and video reviews and what questions we can answer after we do reviews. You’ll probably get a grilling from your boss, but what do you think of the quality of this feature? Can it have the conditions to go online? You may not be an audiovisual quality assurance student, but you must have encountered similar problems at some point.

All in all, it’s a four-sided question. How did you do? Has it improved? By how much? What is the gap with competing products? To sum up, there are four words — “know yourself and know your enemy”. The first three are mainly to answer the question of “bosom friend”, and the last one is “know your enemy”, which is also built on the basis of “bosom friend”.

“Confidant” is mainly to obtain baseline data and understand the current situation, which is the basis. In the process of version iteration, the longitudinal comparison between versions based on this data is the reference of version optimization whether the versions are improved or deteriorated, and how much is improved or deteriorated. These are the understanding of one’s own ability, sun Tzu said, “know yourself and know your opponent and win every battle. We should not only keep our nose to the ground, but also look up to see how others are doing, because the final manifestation of competitiveness is the gap between competitive products and competitors. Of course, the gap here contains two meanings, the positive gap is an advantage, if it is negative is to improve.

Corresponding to these problems, which dimension data should we provide as auxiliary evidence?

To answer these four questions, we listed evaluation indicators of multiple dimensions for video conferencing, as shown in the table above.

Conference effects are mainly developed from the basic effects and other enhanced effects, which is the concentrated embodiment of the underlying capabilities of video conferencing.

Basic effects include stability (stutter, flash back), clarity (sound and picture clear, understandable) and fluency (too long joining time, large delay, stutter). Other effects mainly include beauty and mirror image, background blur, practicability, ease of use, function completeness, etc., which directly affect user experience. In addition, there are also some features that affect performance and stability, such as reliability, security, maintainability, portability, operation efficiency, function suitability, etc., which also have a non-negligible impact on user experience and deserve attention.

This paper mainly focuses on the effects of audio and video.

How to conduct audio and video testing

Evaluation dimensions have been set, so how to measure the indicators of these dimensions? Speaking of the evaluation framework, it includes the classification and summary of the evaluation data, and finally answers the previous questions from the summary and summary of these data.

From the corresponding to different special to distinguish, including audio test, video test, QoS test, performance and compatibility test. Based on these dimensions, timely comparative testing of competitive products can provide a comprehensive answer to the above four questions.

Audio testing is mainly divided into three parts, subjective testing, objective testing and POLQA testing.

Subjective test is mainly subjective listening, aiming at the optimization and adjustment of audio algorithm, paying attention to abnormal phenomena such as echo, volume and noise, as well as delay and synchronization of audio and picture in single and double lecture scenes. Objective testing and POLQA mainly record some objective indicators, such as audio parameters (bit rate, delay, volume and POLQA score, etc.). These dimensions usually cover different network and business scenarios, adjust for different algorithms, and consider coverage tests for different devices.

The evaluation of video is generally divided into three parts, similar to audio testing, including subjective testing, objective testing and vMOS, and offline testing is added for CoDEC.

Subjective tests include clarity and fluency, as well as time delay and audio and picture synchronization tests. Objective parameters mainly include video-related parameters (resolution, bit rate, frame rate and laton statistics) and MOS. CODEC offline test covers PSNR and SSIM as well as VMAF, which is popular nowadays.

QoS test is not a separate evaluation dimension, but more can be said to be an coverage test of user scenarios. Network is the bearing of the business, but the actual user’s network can’t completely ideal, also is not so bad, the final test gripper actually or evaluation index of the audio and video, on this basis, covering different weak network and extreme networks, focused on audio and video effects at the same time, focus on congestion control and bandwidth detection, and cooperation between video model and adjusting speed. This section outputs network related baselines and limiting capabilities.

The final effect of the audio and video evaluation and QoS evaluation mentioned above is reflected in QoE. What is QoE? To put it bluntly, it means to hear with your ears and see with your eyes. This part is directly reflected in user experience. For real-time audio and video scenes, it mainly includes real-time communication, i.e. end-to-end delay (end-to-end delay, first frame time), video clarity and fluency, and audio clarity and fluency (the intelligibility of audio is directly related).

People can not always walk with their heads down, but also look up at the sky, products are the same. Not only do we need to implement our own features, but we also need to see how our competitors are doing. Because the final decision product is not good to sell, customers buy, competitiveness is the most key.

In many cases, if the specific acceptance criteria are not easy to give, the conclusion of comparison with indicators and competing products is a good choice. If they can overcome their opponents, they are relatively successful. If it’s a downwind, think about how to optimize. Things here, it seems to be very smooth, the framework of the test is ready, can not directly supplement the data? But there are actually some problems to be solved.

The measurement dimension is available, the data indicators are ready, and the next step is to collect data. To collect data, we need to prepare a stable test environment. In fact, the test environment is to simulate a set of end-to-end real-time audio and video communication system, including acquisition, including network, including rendering and display, and also including observers (in fact, the tester), which will introduce test errors, which is the pit in the test process.

First, collection. The video images of the same camera in several different positions cannot be the same, which is only one of the errors introduced by collection. In addition, the lag of camera collection will also have a great impact on the final user experience evaluation, so it is impossible to confirm whether the lag is caused by network or the reason of collection.

Camera placement, focus and lighting conditions, camera Angle and depth of field have the greatest impact on image range and rendering performance

The deterioration of the experience brought by collection includes the following:

1. Collection stuck;

2. Collect discoloration and blur;

3. Image mutation caused by over-exposure.

The picture above shows the difference of view Angle of different camera models near the same position.

The comparison between the video after the network damage and the stable picture collected by the camera indicates that the lag is not caused by the introduction of the camera, but the lack of network resistance. When it comes to network resistance, different topologies and fluctuations of networks will affect the test results. Packet loss and jitter of the network itself will introduce additional experience backoff, which is not conducive to problem tracking and recurrence, so a stable and unified weak network simulation environment is particularly important.

Based on this, a weak network simulation scheme based on TC is introduced (as shown in the figure below).

The error introduced by display is mainly caused by the difference of display devices. It can be clearly seen from this figure that the color temperature of the same image on different monitors varies greatly.

The ultimate effect of audio and video is reflected in the end-to-end experience, which is definitely reflected in people’s subjective feelings. There are a thousand Hamlets in a thousand readers’ minds. The observer’s personal perspective, the environment at the time and individual mood swings all affect the subjective effect.

Based on this, we consider recording a comparison of audio and video data (new and old versions/comparison of competing products) for scenes involving subjective evaluation, which can be used for subjective evaluation and scoring. Based on the self-testing scoring platform, online scoring is convenient.

Based on the above testing framework and problems, we set up the following evaluation framework to eliminate some modules that may introduce errors.

1. The first part, camera (microphone) + multiplexer, realizes the function that the content collected by the same camera (microphone) can be multiplexed by multiple PC devices, so that the content of the camera input on each PC is completely the same and synchronized;

2. The display at the receiving end is projected to a large-screen 4K TV through 4K video multiplexing equipment, and a copy can be saved through external storage for backup and subsequent subjective MoS;

3. Build the network and media server in the middle into a private environment and use a separate set of test environment, which is not affected by the exit network or the path from the laboratory to the media server. This framework has low cost and high cost performance.

I believe that these work down, from the dimension of evaluation, has been able to answer the boss put forward the question.