Introduction: With the popularization of 5G network and the impact of the epidemic, real-time audio and video technology will be applied in more and more scenarios, including meetings, phones, audio and video calls, online education, telemedicine, etc. These real-time interactive scenarios put forward increasingly high requirements for the quality of RTC audio. How to test the effect of RTC audio and ensure good audio transmission quality by constructing objective, standard and repeatable evaluation system has become an urgent and important topic at present.

Ma Jianli senior audio and video test engineer of NetEase Yunxin

Ideal communication model

Face to face communication in daily communication generally has a good effect, if in a quiet laboratory, reduce the interference and influence of the environment, will get the ideal communication effect. Let’s abstract this model again. It can be seen that it has the following characteristics:

Quiet environment: the background noise of NR15 is equivalent to that in extremely quiet night, human ear can not be disturbed by other influences and concentrate on listening to the target voice.

Suitable reverberation environment for listening: Reverberation usually affects the comprehension of listeners. The greater the reverberation, the longer the trailing of speech and the lower the intelligibility. For example, in a concert hall with large reverberation, there will be a certain beautification effect for Musical Instruments and songs, but it is detrimental to human communication.

Speaking clearly and naturally: The speaker is in excellent mental and physical condition, articulating clearly, with balanced frequency, fluent pronunciation, and speaking at a moderate speed.

Moderate volume: Studies have shown that volume has a significant effect on sound quality, and that, all other things being equal, the higher the volume, the better the subjective perception of hearing. The speaker speaks loudly, which to some extent enhances the listener’s intelligibility.

Timely response and smooth communication: In RTC real-time communication, delay is also a very important indicator. Generally speaking, there is no obvious obstacle and hysteresis in the subjective perception of delay within 200ms. Communication can be normal between 200ms and 400ms, and hysteresis will occur after 400ms. In face-to-face communication scenarios, the delay is only about 3ms.

RTC The volume link

The above picture shows two people communicating in real time through RTC. As can be seen from the picture, speaker A starts to speak, and the voice goes through air transmission, microphone collection, A/D conversion, enhancement processing (noise reduction, echo cancellation, volume control, reverb), encoding, package transmission, receiver decoding, NetEQ, D/A conversion to downlink playback. And then B hears the sound. This is the complete sound transmission path in the simplex state.

Compared with the ideal communication model, there are many types of interference and influence in the actual RTC link, such as environmental impact, hardware impact, link impact and network impact, and each link may introduce audio quality degradation. Together, these influences can lead to sound problems in the following aspects.

  • Volume problems: silent, small volume, loud voice caused by the clipping, harsh, and so on, suddenly small.
  • Echo problems: echo leakage, echo residue, voice damage such as suppression, shear, intermittent.
  • Noise problems: noise residue is not stable.
  • System introduction: noise, current sound, popO sound.
  • Narrow sense of sound quality problems: fuzzy speech, speech distortion, speech stuffy, sharp speech, mechanical sound.
  • Network problems: stutter, intermittent, fast play, slow play, mechanical sound.

Subjective test method

In the earliest subjective test, two people talked on the phone. A and B established the RTC link and restored the user usage scenario of the real scene by speaking separately or at the same time. The following three dimensions were mainly concerned.

Listening Quality: The sound Quality of the listener refers to the situation in which simplex is used. For example, the Quality of the sound heard by A and B is Listening Quality. Listening Quality describes the voice Quality in most cases and is also the most basic part. The existing objective evaluation methods and means in the industry are basically based on Listening Quality.

Talking Quality: The voice Quality of a speaker is the sound Quality heard by the speaker, which is related to echo, side sound masking and the local environment.

Conversation Quality: In addition to the Listening Quality and Talking Quality of A and B, Conversation Quality is also related to the duplex call. The main influencing factors include echo and duplex talk and end-to-end delay.

Subjective testing focuses on dimensions

The subjective test should pay attention to points as shown in the figure above, which can be divided into several aspects such as sound quality, timbre, volume, delay, echo and noise reduction.

tone

Timbre, also known as timbre, is the characteristic of auditory sound. Timbre is mainly determined by the frequency spectrum of sound. In RTC link, the frequency response of sound is mainly affected by the frequency characteristics of microphone, intermediate processing such as EQ, high and low pass filtering, volume control algorithm (DRC/AGC), speaker/earphone frequency response, etc. Different people’s vocal frequency distribution is also different. Generally speaking, men’s voices are more low-frequency, thick or muffled, while women or children have more high-frequency components, and their voices are bright or even sharp.

Quality:Sound quality is divided into three dimensions, clarity, fluency and naturalness.

  • Clarity is also known in audio as intelligibility. It refers to the understanding degree of semantic content, and there are many aspects that affect the intelligibility. For example, the mixed noise makes the speech unclear, leading to the decrease of intelligibility; There is a large reverberation in the speech, resulting in speech trailing, not clear.
  • Fluency indicates the degree of continuity of speech. The direct influence factors are as follows: poor network environment leads to discontinuous voice, lag, and word loss; The sound is played fast or slow due to QoS adjustment. Speech damage caused by echo and noise reduction algorithms.
  • Naturalness indicates the degree of similarity to the original speech. Typical problems affecting naturalness include: the distortion introduced by the algorithm processing; Nonlinear distortion of loudspeaker; Excessive sound amplification caused by clipping, overload, etc.

The volume

For SDK providers of RTC, the biggest challenge is the diversity of devices, different platforms (Mac, Windows, Android, iOS, Web), different models and different peripheral devices, different models or devices have great differences in the volume of collection and playback. The volume control strategy is to ensure consistency across devices on different platforms, ensuring that the user can hear the sound at sufficient volume without causing significant damage or degradation.

noise

The purpose of noise reduction algorithm is to remove the noise interference introduced by the environment or equipment, restore the human voice as much as possible, and improve the signal-to-noise ratio. In the process of noise processing, the actual noise reduction algorithm will inevitably damage the sound quality more or less. Therefore, the evaluation of noise reduction is mainly considered from two aspects:

  • Noise suppression level. Including convergence time, inhibition strength, residual stability, etc.
  • Degree of speech impairment. A good noise reduction algorithm can always achieve a relative balance between the two, which can effectively suppress noise without obvious speech damage.

The echo

Echo cancellation is an important module in THE RTC link to eliminate device echoes and ensure smooth call experience. The evaluation echo also starts from two points:

  • The strength of the suppression of the echo. Is there any residual echo?
  • Damage to proximal speech. In the application scenario of RTC, echo is also closely related to the device, platform, model and external device, so the test of echo needs to cover TOP model.

Time delay

In network transmission, audio anti-packet loss algorithms such as FEC, RED, and ARQ, as well as anti-packet loss algorithms such as Jitter Buffer, will generate extra delay, resulting in increased end-to-end delay and negative impact on real-time communication and decreased experience. Especially for some scenarios with low delay, end-to-end delay is an important index to measure the performance of weak networks.

Pain points for subjective testing

At present, the mainstream evaluation method of RTC audio mainly relies on subjective testing and listening to audio, which has high requirements for people’s professional ability and low efficiency. There are mainly the following aspects of pain points:

  • Poor repeatability: It is difficult to guarantee the consistency of the two tests in subjective tests, such as the change of sound field environment, speaker’s pronunciation, volume change, distance difference with the equipment, etc. There are too many uncontrollable factors, so it is impossible to obtain accurate comparative test results.
  • Low test efficiency: The subjective test requires two people to participate in the whole process, and the long test will produce fatigue and lethargy no matter listening or speaking, and the scene needs to be switched according to the use case, so the test efficiency is very low.
  • Low test coverage: due to the problem of efficiency, measurement can only cover limited scenes and limited link combinations, generally speaking, only key scenes can be guaranteed. And the tester’s own voice is limited, there is no way to cover more kinds of human voice.
  • Subjective factors have great influence: sound is very subjective, and the same sound can be heard by different people. The test results of a single person may lead to biased conclusions. In addition, people’s vocalization and listening to sounds have a great relationship with physiological and psychological state, and the same person will give completely different judgments and conclusions in different time periods.

In view of the above pain points, NetEase Yunxin.com has created a set of end-to-end objective evaluation methods from laboratory construction, environmental simulation, collection and playback, and evaluation methods in the evaluation and testing of audio effects.

Standard laboratory

The above picture shows the acoustics laboratory of NetEase Yunxin. The main equipment and hardware configuration are as follows:

  • Head and shoulder simulator: the human body model with built-in mouth simulator and a more accurate ear simulator (in line with IEC 60318-4 /ITU‐T REC.P.57 Type 3.3 standard) can truly reproduce the acoustic characteristics of the head and trunk of ordinary adults for accurate binaural acoustic signal acquisition and mouth sound.
  • 4* Hi-fi: the structure of uniform scattered sound field, online simulation and playback of different scenes and SNR noise environment.
  • Multi-channel sound card: Supports 8-input and 8-output sound collection and playback at the same time, meeting various audio test scenarios.
  • 4 electrical signal interface: support multi-person voice test and echo single-talk and dual-talk test.

Through the construction of a professional audio testing laboratory, it can meet the needs of audio automation testing/competitive product analysis and evaluation/rapid comparison test of baseline effect between versions, obtain repeatable objective test results, and meet the needs of audio algorithm simulation and prototype verification. One person can also complete 3A subjective test: noise reduction, sound quality, echo single-and-double-speak test. At present, there are more and more AI algorithms, and data is the key to AI algorithms. With acoustic laboratory and noise simulation system, AI data can be automatically collected and marked by writing automatic scripts, greatly reducing the cost of data purchase and marking. At present, the networking of the acoustics laboratory of Yunxin is shown in the figure above. The introduction of the laboratory improves the professionalism of development and testing, mainly in the following aspects:

  • Automatic test: Objective 3A automatic test, such as echo test, noise test, can simulate the scenario of multi-party joining.
  • Automatic AI data collection: Open source voice and target noise are respectively played through the head and noise playback system, and recorded on the target end or platform. Labels can be made during recording, and the problems of sequence collection and marking can be solved at the same time.
  • Subjective test: quantitative playback environment and quiet listening environment.
  • Others: model coverage test, model adaptation, algorithm prototype optimization verification.

Objective test standard

The lab mainly provides an objective and repeatable testing environment, and the hardware equipment supports custom collection and playback. In addition, the audio lab of NetEase Yunxin.com has also introduced objective testing standards as an evaluation method for the final data. Audio test criteria are divided into different dimensions.

Subjective/objective

Subjective is based on human subjective evaluation. Objective method is to calculate and evaluate speech quality by model. The typical subjective evaluation standard is P.800, and the objective evaluation method is PESQ.

With/without reference

Full reference/no reference (FR/NR) describes the type of measurement algorithm used. The FR algorithm has two signals: the original signal and the distorted signal. The NR algorithm requires only one distorted signal. Typical FR algorithms are such as PESQ. A typical NR measurement is p.563, and the NR method is often referred to as a “single ended” test.

Perceptual/non-perceptual

Typically, such measurement algorithms attempt to model human perception. Perception modeling is not only used for quality assessment. Other well-known perceptual algorithms such as MP3 or AAC using perceptual models are used to compress music. Non-sensing indicators are general physical or technical indicators, such as level or signal-to-noise ratio.

Objective criteria based on perceptual models

The most classical and widely used objective indicators based on perception model are the active objective speech quality testing standard P.86X series, also known as PESQ/POLQA, which is a typical reference speech evaluation standard. The general idea of PESQ/POLQA is as follows: The level of the original signal (reference signal) and the signal passing through the test system is adjusted to the standard auditory level, and then the input filter is used to simulate the standard telephone handset for filtering.

The two signals after level adjustment and filtering are aligned in time, and the auditory transformation is carried out. This transformation includes compensation and equalization of linear filtering and gain change in the system. The difference between the two auditory transformed signals is used as the disturbance (i.e., the difference), and the disturbance surface is analyzed to extract the two distortion parameters, which are accumulated in frequency and time and mapped to the predicted value of the subjective mean opinion score. Compared with PESQ, POLQA has made a lot of precision optimization, which makes the objective test results and subjective test results more consistent, and has a very wide range of applications in voice evaluation.

Automated testing

POLQA Automated testing

To reduce the influence of hardware acquisition and playback and acoustic link, electrical link is used in network testing. The two devices at the sending end and the receiving end are connected to the sound card using a 3.5mm audio cable. In addition, a TC system provides a network damage environment. The two tested devices connect to the Router of the TC and control packet loss, delay, jitter, and bandwidth of the devices at both ends through scripts.

As shown in the figure above, the test host sends the signal to the test device A through the sound card. After the RTC audio processing of the local end, the test device sends the signal to the receiver device B through network transmission. During this process, different types and degrees of network loss are added in real time through the weak network system. The sound card receives the signal of test device B, and measures the performance of RTC against the weak network countermeasures module by comparing and analyzing the original signal.

  • Supports interworking tests on Android, iOS, Windows, Mac, and Web terminals.
  • Using TC script automation control network environment;
  • Use API automatic control to join a meeting, switch profiles, control parameters, and leave a meeting.
  • Automatic acquisition of bit rate, packet loss, lag and other information in the test process as auxiliary standards;
  • One-click execution to generate version baseline report;

3A Objective automation

At present, NetEase Yunit has built an end-to-end 3A automated test based on the laboratory. The architecture block diagram is shown in the figure above, which is mainly divided into use case management layer, API/UI control layer, acquisition and playback, automatic calibration, analysis and calculation, data and report. It is mainly used for comprehensive evaluation of echo, noise and volume control, and is currently used in version baseline test, version iteration comparison, competitive product comparison and other testing links.

The authors introduce

Ma Jianli, senior audio and video test engineer of NetEase Yunxin, a core member of NetEase Yunxin Audio and video Media Lab, is responsible for the construction of audio test quality system and audio and video quality assurance.