Today, I participated in the past, Present and Future of Real-time Voice Quality Monitoring System of Agora, and shared some of my own understanding based on my previous experience in audio processing.

Audio (generally refers to all the sounds in nature that can be heard by humans, and the spectrum range of the sounds that can be heard by human ears is generally 20, 000 Hz) and speech (speech refers to the sound of human speaking, and the spectrum energy range of the sound of human speaking is mostly distributed in 300, 400 Hz) are different. One can see that one hears a wider range of sounds than one speaks; It’s that you can hear things like Musical Instruments, nature, shrill sounds, but you can’t produce them.

There are several reasons why we need to do quality assessment. For example, in addition to face-to-face communication, the audio in the phone, video, music and other activities is encoded and compressed, so as to facilitate the transmission and storage at a lower cost. Such as the removal of doping noise in the original sound, the original voice enhancement processing; It can be seen that no matter codec processing or other speech processing, the purpose is to make people listen to more comfortable, so the quality evaluation method is to evaluate the sound of people after processing.

Audio evaluation methods are divided into subjective evaluation and objective evaluation.

Subjective evaluation actually refers to the scoring of speech based on auditory experience, commonly known as MOS, CMOS and ABX Test. For example, AB TEST was often used in my early work. For example, when I made a small optimization of the speech enhancement algorithm, I would group the original algorithm and the speech processed by the optimized algorithm in order to improve the actual hearing experience. My friends would help me score the TEST and judge whether it was better or worse. The international Telecommunication Union (ITU) has standardized the subjective evaluation method of speech quality, which is coded as ITU-T P.800.1. Absolute Category Rating (ACR) is a subjective evaluation method widely used at present. Participants rated the overall quality of speech on a scale of 1-5, with a higher score indicating the best quality of speech. This MOS score was later applied to objective quality assessment. Generally, MOS 4 or higher is considered to be of relatively good voice quality. Once MOS is lower than 3.6, the voice quality is basically unacceptable.

Objective evaluation is mainly to use algorithms instead of human scoring, through the algorithm to evaluate the quality of sound. The objective evaluation is divided into reference evaluation and no reference evaluation.

  • The intrusive method, as the name implies, requires the comparison of audio source materials, so this method can only be used for offline processing, not for real-time call processing; Common ones include ITU-T P.861(MNB), ITU-T P.862(PESQ)[2], ITU-T P.863(POLQA)[3], STOI[4], BSSEval[5],
  • The non-intrusive method does not require sound source materials, common ones are ITU-T P.563[6], ANIQUE+[7], ITU-T G.107(E-Model)[8], AutoMOS[9], QualityNet[10], NISQA[11], MOSNet[12] and so on based on AI deep learning

The following table shows the MOS test scores of the major voice codecs (from Opus official website, and later MOS9, i.e., the maximum score is 9

Here’s a look at PESQ and POLQA

PESQ is a referenced objective evaluation scheme that uses two audio signals as inputs, one provided by the ITU organization and the other as output signals processed by the voip system under test. Pesq algorithm extracts the difference of characteristic parameters in time-frequency domain or transform domain from two input signals, and then maps the difference of characteristic parameters through neural network model to obtain objective score of sound quality. The PESQ score is actually a mapping of the MOS value.

POLQA algorithm is a new generation of speech quality evaluation standard, which is suitable for speech quality evaluation in fixed network, mobile communication network and IP network. POLQA by ITU – T (InternationalTelecommunication Union) identified as recommendations p. 863, can be used for high definition voice, 3 g, 4 g/(, 5 g network voice quality assessment. It replaces and upgrades PESQ (ITU-Trecommendation P.862), which was released in 2001.

Different from traditional PESQ, POLQA algorithm has the following advantages:

  • Added the ability to evaluate Wideband and SuperWideband voice quality with support for broadband (48khz).

  • Support the latest voice coding and VoIP transmission technology, specially optimized for the existing OPUS, Silk encoder.

  • Support multilingual environment, national language support. ITU organizations provide standard test corpora for targeted testing.

Of course, audio quality evaluation is not only evaluating codecs, but also other factors such as VAD transmission, packet loss compensation, network quality changes (delay/jitter/packet loss), and even equipment acquisition.

Like the above regardless of the reference and reference, has its limitations in the application, including the use of scene narrow, poor robustness, high complexity, and to overcome the above problem, you need a cover much scene, performance run almost no perception of quality assessment algorithm and system, so the sound network developed a set of unique audio quality assessment methods. Including upward quality assessment and downward quality assessment.

The uplink sound goes through the process of acquisition, AEC(echo cancellation), NS(noise cancellation), and AGC(gain). Therefore, the quality assessment includes the processing effect of the device’s acquisition stability, echo cancellation capability, noise cancellation capability, and volume gain capability.

The downlink is mainly played to people through the device. After codec – network transmission – weak network countermeasures (I understand it to be VAD/PLC/ error correction, etc.)- device playback, the error value between the algorithm and POLQA is less than 0.15 under multi-weak network, multi-device and multi-mode testing, which can be said to achieve good results.

As for the audio quality assessment, I personally believe that the future development will follow the direction of more detailed fields, including different elements, such as audio assessment and music assessment should be different; Including different scenarios, such as real-time online processing and offline evaluation, real-time processing requires high real-time performance and low performance consumption; Offline evaluation, however, does not require such high requirements and requires higher accuracy, so it can make more use of the advantages of AI and the optimization of algorithm system.