Last time we introduced the image format commonly used in RTC communication, this time we will introduce the audio format commonly used in RTC communication.

A, an overview of the

What are the audio formats? To answer this question, let’s take a look at baidu Encyclopedia’s explanation of audio: audio format is music format. Audio format refers to the process of digital and analog conversion of audio files to be played or processed in the computer. The maximum bandwidth of the audio format is 20KHZ, and the rate is between 40-50khz. Linear pulse coding is used to modulate PCM, and each quantization step has the same length. The energy range of the sound spectrum of human speaking is mostly distributed in 300-3400Hz, while the spectrum range of the sound that human ear can hear is generally 20-20000Hz. Therefore, human ear can hear many other sounds in the nature besides human speaking, such as Musical Instruments, nature, shrines and so on.

The development of communication has gone through several stages – pigeon transmission – beacon tower – message (telegraph)- voice call – video call -AR/VR, from the previous text communication to the present audio and video communication, and with the development of The Times, people are no longer satisfied with simply hearing each other’s voice, now for the sound quality, Stereo sound and even space surround sound have a strong demand scene; Therefore, various audio formats need to be matched to meet the needs of the real world.

Second, common audio formats

In terms of the current audio market, there are two main types of audio formats: lossless compression and lossy compression. If we listen to different formats of audio, there will be a big difference in sound quality. Lossless compression of audio files on the basis of 100% preservation of all the data of the source file, the volume of the audio file is reduced, and then the compressed audio file can be restored to achieve the same size and the same bit rate as the source file. There is also lossy compression of audio, which reduces the audio sampling frequency and bit rate so that the output audio file is better than the source file.

1, MP3 – no more familiar name: MPEG audio file compression is a lossy compression, MPEG3 audio encoding has 10:1 ~12: At the same time, it basically keeps the low-audio part without distortion, but sacrifices the quality of the 12KHz to 16KHz high-audio part of the sound file in exchange for the size of the file. Music files of the same length are stored in *.MP3 format, generally only 1/10 of *.wav files. As a result, the sound quality is inferior to CD or WAV sound files.

2, PCM — the most commonly used audio format: PCM Chinese called Pulse Code Modulation (Pulse Code Modulation), is developed in the late 1970s, one of the recording media CD, in the early 1980s by Philips and SONY jointly launched. The pulse code modulation audio format is also adopted by DVD-A, which supports stereo and 5.1 surround sound, released and launched by DVD Symposium in 1999. Pulse code modulation bit rate, from 14-bit to 16-bit, 18-bit, 20-bit until 24-bit; The sampling frequency was increased from 44.1kHz to 192kHz. PCM pulse code modulation technology can improve and improve the aspects of the increasingly small.

3. AMR — FULL name of AMR Adaptive multi-Rate, Adaptive multi-rate encoding, mainly used for audio of mobile devices, with large compression ratio, but poor quality compared with other compression formats, because it is mostly used for voice and call.

4. Opus — WebRTC’s preferred audio: Integration of two voice encoding technologies: voice codec oriented SILK and low latency CELT. Opus can seamlessly adjust high and low bit rates. Inside the encoder it uses linear predictive coding at low bit rates and transform coding at high bit rates (and a combination of the two at the intersection of high and low bit rates). Opus has very low algorithm latency (the default is 22.5 ms), which makes it ideal for encoding low-latency voice calls, such as real-time audio streams over the Web, real-time synchronized audio voice-over, etc. In addition, Opus can achieve even lower algorithm latency by lowering the coding rate, up to 5 ms. Opus had lower latency and better sound compression than MP3, AAC, HE-AAC and other common formats in multiple blind auditory tests.

5. AAC — Always the winner in live video: AAC actually stands for Advanced Audio Encoding, and apple iPods and Nokia phones also support AAC audio files. AAC is an audio format developed by Fraunhofer IIS-A, Dolby, and AT&T as part of the MPEG-2 specification. The algorithm used by AAC is different from the algorithm used by MP3. AAC combines other functions to improve coding efficiency. AAC’s audio algorithm is far superior to previous compression algorithms (such as MP3, etc.). It also supports up to 48 audio tracks, 15 low-frequency audio tracks, more sampling rates and bit rates, multi-language compatibility, higher decoding efficiency. Overall, AAC can provide better sound quality at 30% less than MP3 files.

Lyra — a new product of ARTIFICIAL intelligence: Lyra is a deep-learning-based, low-bit-rate voice codec that works with the Google Duo and enables clear chat at 3KB per second. The data were divided into 40ms frames, the features were extracted and compressed (Log MEL spectrograms), and the features were converted into speech signals in a generation model in the decoder part. The structure of MELP is similar to that of the traditional parametric audio codec method with mixed excitation linear prediction (MELP improves efficiency by calculating and transmitting linear prediction coefficients), but the sound quality of the data generated by the traditional parametric audio codec is poor, and the generation model (such as WaveNet) has been proved to be able to generate multiple speech sampling points from the feature. WaveNetEQ uses the generated model to achieve packet loss compensation. The decoder Griffin-Lim algorithm of MBE (multi-band excitation, dividing the frequency domain into equal frequency bands, transmitting the energy of the frequency band, and determining the sound/silent information of each band) is a method to recover the signal without providing the phase information but only providing the energy. Compared with the WaveNet series algorithms, Griffin-lim is much worse in terms of sound quality. However, methods such as WaveNet are computation-intensive, and Lyra uses thousands of hours of data in more than 70 languages to train WaveRNN’s variant model to achieve higher sound quality and lower complexity.

Third, how is sound transmitted in the RTC

Take a look at the overall flow of audio encoding and decoding:

After digital sampling, the voice of human speech is PCM original sampling data. It can be seen from the figure above that no matter what type of codec, PCM coding is compressed for convenient transmission, and then decoded and restored to PCM.

First of all, in the early fixed telephone period, the fixed telephone codec mainly has G. 711A/U; G. 729; G. 722; G. 723; G. 726, etc. These codec basically use 8KHZ sampling, because communication at that time is mainly between people, 8K sampling rate is enough to cover the most important part of the energy range of human voice. The original g.711a /u was lossless encoding, but with a rate of 64Kbps (but ADSL phone line speed is 64K bandwidth).

Do not know how many friends know ADSL Internet access, the initial is to use this 64K telephone line transmission, but G.711 bandwidth accounted for light, but also how to transmit data, so the subsequent gradually by higher compression rate but the effect is not inferior G.729, G.726 and other codec replaced the use. Among them, G.722 is a well-known family, G.722.1 is a codec developed by Polycom, and G.722.2 is amR-WB +, the ultra wideband version of AMR-WB mentioned below.

Then came the era of mobile communication (2G/3G). Since the communication content was still spoken before people, the Codec was still voice Codec, and the mobile side mainly used AMR(Adaptive Multi Rate-narrow Band Speech Codec). Amr-wb (narrowband AMR and broadband AMR, respectively). Although narrowband AMR still uses 8K sampling, it can be seen from its full name that the codec itself is multi-rate (8 rate modes), and can be switched. The main reason for this feature is to adapt to the situation of wireless channel and transmission channel. For example, you can imagine a base station, if there are 10 mobile phone calls and 100 calls, each mobile phone is allocated channel bandwidth must be different, rate transformation can be flexible according to the channel situation switch, so as to ensure more people talk.

And then Volte(4G), which is what you’re currently using, amR-WB (Adaptive Multi-rateWideband Speech Codec); This codec uses 16K sampling, twice as high as the original; The result is an additional 8K data samples per second in the time domain and a wider range of high frequencies and richer sound detail in the frequency domain. But it doesn’t seem to have improved much for the consumer experience.

However, in the 4G era, with higher bandwidth and richer business development, in order to improve voice definition and call experience, several major manufacturers launched EVS HD codec, and as the only standard to enter 3GPP, EVS compatible with AMR-NB and AMR-WB. It also supports SWB(ultra wideband) and FWB(full wideband) sampling (up to 48KHZ), which has covered the whole spectrum of the sound heard by the human ear. You can see an HD tag on your phone, and this is actually E2. With the launch of EVS and the promotion of new services (like the recent video RBT), you should be able to experience a richer sound experience.

Of course, in the 3G/4G era, with the development of the Internet, Internet-based VOIP technology has also developed vigorously. However, Internet-based VOIP is faced with more complex network conditions than carrier voice calls. After all, it is not a private network, so it faces more serious delay bandwidth problems. VOIP audio codec also has a similar development stage. First, voice codec, such as iLBC and iSLK, are both codec technologies developed by GIPS. After being acquired by Google, the two codec technologies have been applied in WebRTC technology and opened source. ILBC codec is characterized by reducing the redundancy between each audio encoding frame, each frame can be solved independently, so it has a very good anti-packet loss characteristics. In addition to inheriting ILBC capabilities, ISAK appears to have added bandwidth prediction capabilities. The popular Skype codecs are Silk, which has a particularly good encoding effect for speech, supposedly making it sound as if two people are in the same room.

In order to improve voice experience, the default codec used is Opus(a combination of Silk codec and Celt codec); A Music detector in this codec determines whether the current frame is voice or Music. The voice is selected silk and the Music is selected Celt. At the same time, OPUS supports PLC(packet loss compensation) and has good network anti-packet loss characteristics.

Audio is not just used in communications, either. AAC(Advanced AudioCoding), a lossy audio compression format defined by the MPEG-4 standard, was developed by Fraunhofer, with Dolby, Sony, and AT&T as major contributors. It is the natural successor to MPEG Layer III/MP3 in the new multimedia MPEG-4 standard that uses MP4 as a container format for various content. AAC codec is similar to Mpeg4 video codec protocol, which is also divided into multi-profile, LC-AAC(low complexity, ordinary quality) and HE-AAC(high efficiency, high quality). Meanwhile, AAC is also shining in the field of live broadcasting.

Four,

Some friends will ask, so many audio encoding formats, I should choose which; As we can see from the article, no one audio encoding format can cover all application scenarios, and it is wise to choose a suitable encoder according to your own needs in different scenarios.