Voice networking has been around for decades, but the recent “interactive podcasting” scene has put audio interaction back in the spotlight. How to provide a good audio interactive experience? How to optimize sound quality? How to deal with the network challenges of global transport? How to make the sound more pleasant on the basis of high pitch quality? We’re going to answer these questions one by one, from multiple levels, starting today with a series called “Low Latency High Sound Quality in Detail.”

Bill Gates follows in the footsteps of Elon Musk with an “interactive podcast”. Now, many teams are adding audio social scenarios. This scenario may seem simple to implement, but it is not so easy to achieve the same high-quality experience for users in different countries.

Then we will talk about the technical principles and “transformation” ideas behind high sound quality and low delay from the aspects of encoding and decoding, noise reduction and echo cancellation algorithm, network transmission and sound quality optimization.

Today we are going to talk about voice codecs. But before we get into voice codecs, we need to understand how audio codecs work to get a faster understanding of what affects the sound experience.

Speech coding and music coding

Audio encoding refers to the process of converting audio signals into alphanumeric code streams (as shown below). In this process, the audio signal is analyzed to produce specific parameters. These parameters are then written to the bitstream according to certain rules. This bit stream is also known as a bit stream. After receiving the code stream, the decoder will restore the code stream to parameters according to the agreed rules, and then use these parameters to construct the audio signal.

Image from: Earlham College

Audio codecs have a long history of development. The core algorithm of the early codecs is nonlinear quantization, which is a relatively simple algorithm now. Its compression efficiency is not high, but it is suitable for the vast majority of audio types, including voice and music. After that, with the development of technology and the refinement of codec division of labor, the evolution direction of codec is divided into two ways — voice encoder and music encoder.

Speech codecs, which are mainly used to encode speech signals, gradually evolve towards linear prediction framework based on time domain. The speech signal is decomposed into the main linear prediction coefficient and the secondary residual signal by referring to the pronunciation characteristics of the sound channel. ** Linear predictive coefficient coding requires very little bit rate, but can efficiently build the “skeleton” of speech signal; Residual signals are like “flesh and blood”, supplementing the details of speech signals. ** This design greatly improves the compression efficiency of speech signals, but this linear prediction framework based on time domain cannot encode music signals well under limited complexity.

Music codecs, which encode music signals, have taken another evolutionary path. Because compared with the time domain signal, the frequency domain signal information is more concentrated in a few frequency points, which is more conducive to the encoder to analyze and compress it. So music codecs basically choose to encode signals in the frequency domain.

Later, as the technology matured, the two codec architectures came together again, namely the voice music hybrid encoder, which is the default codec Opus used in WebRTC. This kind of codec is characterized by the fusion of two encoding frameworks and automatically switch the appropriate encoding framework for the signal type. Opus is used in some well-known domestic and international products, such as Discord.

What affects the interactive experience in speech coding?

When it comes to some technical indicators of voice codecs, they generally talk about sampling rate, bit rate, complexity, anti-packet loss ability, etc. What are these technical indicators respectively, and how do they affect the audio experience?

You may have seen statements like “the higher the sampling rate, the better the sound” and “the higher the coding complexity, the better”, but that’s not true!

1. Sampling rate

A sampling process is required to convert an analog signal that can be heard by the human ear into a digital signal that can be processed by a computer. Sound can be decomposed into a superposition of sinusoidal waves of different frequencies and intensities. Sampling can be thought of as collecting a point on a sound wave. The sampling rate refers to the number of points sampled per second in this process. The higher the sampling rate is, the less information is lost in this transformation process, that is, the closer to the original sound.

The sampling rate determines the resolution of the audio signal. In the range of human ear perception, the higher the sampling rate is, the more high-frequency components will be retained, and the clearer and brighter the signal will be. ** for example, when we make a traditional phone call, we often feel that the other party’s voice is dull. This is because the sampling rate of the traditional phone is 8kHz, and only the low-frequency information that can ensure the intelligibility is retained. Many high-frequency components are lost. Therefore, in order to have a better audio interactive experience, we need to increase the sampling rate as much as possible within the range of human ear perception.

Second, the bit rate

After sampling, the sound is converted from analog signal to digital signal. The bit rate is the amount of data per unit time of the digital signal.

** Bit rate determines the degree of detail reduction of audio signal after encoding and decoding. ** The codec assigns a given bit rate to the parameters output by each analysis module in order of priority. When the encoding rate is limited, the codec will guarantee the encoding of the parameters that have a greater impact on the speech quality, and abandon some parameters that have less impact. In this way, at the decoding end, because the parameters used are not complete, the speech signal constructed by it will also have inevitable damage. In general, the higher the bit rate of the same codec, the less damage it will cause after codec. However, the higher the bit rate is not the better, on the one hand, the bit rate and the codec quality is not linear relationship, after the “quality sweet spot”, the increase of the bit rate on the quality of the improvement becomes not obvious; On the other hand, in real-time interaction, too high bit rate may occupy bandwidth and cause network congestion, which in turn leads to packet loss, which in turn damages user experience.

Quality sweet spot: In video, quality sweet spot refers to the optimal subjective quality experience of video by setting a reasonable resolution and post rate at a given bit rate and screen size. It’s a similar story in audio.

3. Coding complexity

The coding complexity is generally concentrated in the signal analysis module at the coding end. Generally speaking, the more detailed the analysis of speech signal, the higher the potential compression rate may be, so there is a certain correlation between coding efficiency and complexity. Similarly, the relationship between coding complexity and codec quality is not linear, and there is also a “quality sweet spot” between the two. Whether a high quality codec algorithm can be designed under the premise of limited complexity often directly affects the availability of codecs.

Iv. Anti-packet loss ability

First, what is the principle of anti-packet loss? When we transmit audio data, we will encounter packet loss. If the current packet is lost, we hope to guess or get the general information of the current frame by some means, and then decode a speech frame similar to the original signal by using the incomplete information. Of course, just guessing is usually not good results, if the previous packet or the next packet can tell the decoder some key information about the current lost packet, the more information, the more conducive to the decoder to recover the lost speech frame. This “key information” contained in “previous packet” or “after packet” is what we will refer to later as “interframe redundancy information”. (We talked more about packet loss antagonism earlier.)

Therefore, anti-packet loss capability and coding efficiency are mutually exclusive, and the improvement of coding efficiency usually requires the minimization of inter-frame information redundancy, while anti-packet loss capability relies on certain inter-frame information redundancy, which can ensure that the current speech frame can be recovered from the preceding and subsequent speech frames when the current packet is lost. In a real-time interactive scenario, a user may walk into an elevator or ride in a speeding car because the user’s network is unreliable. In this network, packet loss and delay jitter are ubiquitous, so the ability of codec to resist packet loss is indispensable. Therefore, how to balance coding efficiency and anti-packet loss ability also needs to go through detailed algorithm design and polishing verification.

How to balance audio experience with technical metrics?

So what does sonnet do? Our engineers have combined these considerations to create Agora Nova (Nova), an HD voice codec for real-time communication.

32 KHZ sampling rate

First of all, Nova chose a higher sampling rate of 32kHz rather than the 8khz or 16khz sampling rates used by other voice codecs. This means Nova has a big head start in the quality of calls. Although the sampling rate of 16kHz commonly used in the industry (note: Wechat (16kHz) has met the basic requirements of speech intelligibility, but some speech details still need a higher sampling rate to capture, we hope to provide a higher resolution voice call capability, which not only ensures the intelligibility, but also improves the clarity, which is why we choose 32kHz.

Optimizing coding complexity

The higher the sampling rate, the higher the articulation of speech, and the more sample points that need to be analyzed/encoded/transmitted per unit of time, the higher the coding rate and complexity. The increase of coding rate and complexity is bound to bring pressure to user bandwidth and device performance power consumption. But that’s not what we want to see. Therefore, through theoretical derivation and a large number of experimental verification, we design a set of simplified high-frequency speech component coding system. Under the premise of a small increase in the analysis complexity, the minimum use of 0.8 KBPS can achieve the encoding of high-frequency signals (based on different technologies, in the past, to express high-frequency signals, the bit rate is generally higher than 1~ 2KBps). Greatly increase the clarity of speech signal.

Balance anti-loss performance and coding efficiency

On the guarantee of anti-packet loss capability, we also choose the most balanced scheme on the premise of guaranteeing coding efficiency. Through experimental verification, this scheme not only ensures coding compression efficiency, but also guarantees the recovery rate when packet loss occurs. In addition to Nova, we also developed and launched voice codecs such as Solo and SoloX with stronger packet loss resistance for unstable network environments.

Agora Nova vs. Opus

Nova has a variety of modes to choose from for different scenarios, such as adaptive mode, high quality mode, low power high quality mode, ULTRA high frequency mode and ultra low bit rate mode.

If Nova is compared to the advanced open source codec, Opus, it has 30% more available spectrum information at common speech coding rate than Opus at the same rate, thanks to Nova’s efficient signal processing algorithm. Under the subjective and objective evaluation system, Nova’s voice coding quality is higher than Opus’s:

  • At the objective evaluation level, the objective quality evaluation algorithm defined by ITU-T P.863 standard is used to score the codec and decoding corpus of the two codecs, and Nova score is always slightly higher than Opus.
  • On the subjective evaluation level, the restoration degree of Nova codec voice signal is higher than that of Opus codec voice signal, which is reflected in the sense of hearing more transparent and less quantization noise.

Thanks to this HD voice codec, sonnet SDK provides users worldwide with a consistent, high-quality audio interactive experience. In fact, the quality of a voice call experience is not only directly related to the quality of the codec code, but also greatly affected by other modules, such as echo cancellation, noise reduction, network transmission, etc. In the next issue, we will cover sonnet’s best practices in echo cancellation and noise reduction algorithms.