This is the third technical sharing in our series on low latency and high sound Quality. This time we will zoom in and talk about the trade-offs between sound quality and real-time interaction for different application scenarios in a time-varying network from the perspective of the entire audio engine link.

When we talk about low-latency, high-pitched quality in real-time interactive scenarios, we’re really dealing with the end-to-end sound quality of the entire audio engine link. In the first article, we briefly described the process of an audio transmission. If we further elaborate on this basis, the whole link of the audio engine includes the following steps:

1. Sampling equipment samples acoustic signals to form discrete audio signals that can be operated by computers;

2. Due to the short-term correlation of the audio signal, the audio signal is divided into frames, and the acoustics, environmental noise, echo, automatic gain and other problems are dealt with by 3A solution;

3. The encoder compresses the audio signal in real time to form the audio code stream;

4. The sender sends Audio packets to the network according to the format of IP, UDP, RTP, and Audio Payload.

5. After the anti-shake cache and the decoder reconstruct the continuous audio stream, the player will play the continuous audio stream.

In the above different audio processing links, will inevitably produce some damage to audio signal. We call the above-mentioned damage to audio signal introduced by acquisition and playback as “equipment damage”, the damage introduced in link 2 as “signal processing damage”, the damage introduced in the process of encoding and decoding as “coding damage”, and the damage in link 4 as “network damage”.

In order to provide users with high-quality full-band audio interactive experience, it is necessary to support full-band processing in the audio engine link mentioned above, and minimize the damage introduced by each link under some constraints (such as equipment, network bandwidth, acoustic environment, etc.).

When audio data enters the network, it encounters…

If you think of the web as a highway of information, then audio packets are like cars on the highway. The car was driving from Beijing to Shanghai through the trunk network of expressways and the weak network environment of rugged mountain roads. Assuming a car sets off from Beijing every minute, they will encounter three common problems in real-time transmission: packet loss, delay, and jitter.

Packet loss

“Lost bag” refers to a car that does not make it to the finish line in the time available, or may never make it to the finish line. Some cars may be stuck on Beijing’s third ring Road forever, and some cars may have an accident on the way. If five of our 100 cars fail to arrive in Shanghai on time for various reasons, the “packet loss rate” of our fleet transmission this time is 5%. Yes, it’s the same with Internet transmission. It’s not 100% reliable. There’s always data that doesn’t get to its destination on time.

Time delay

“Delay” refers to the average time it takes each car to travel from Beijing’s Bird’s Nest to Shanghai. Obviously, the motorway must be much faster than the various small roads, and the route from the Bird’s Nest to the motorway also has a great impact, in case of stuck in the third ring road can take many hours. So this value depends on the route that the fleet chooses to take. The same is true for Internet transmission. There are often many alternative paths between two points that need to transmit data, and the latency of these paths often varies greatly.

jitter

“Shake” refers to the difference in the order, spacing and departure of cars. Although our 100 cars started in Beijing at equal intervals of one minute, when they arrived in Shanghai, they did not arrive in one minute in sequence, and there may even be late cars arriving before early cars. The same goes for Internet transmission. If audio and video data is simply played in the order it is received, distortion will occur.

To sum up:

1. Real-time audio interaction is carried out on the network, and the encoded audio code stream is assembled into data packets according to real-time transmission protocol. The data packets from the sender to the receiver follow their respective routes through the network.

2. The service quality of user network connection is sometimes very poor and unreliable in different regions or time periods around the world.

Based on the above reasons, packets usually arrive at the receiver in the wrong order at the wrong time or at the wrong time, or packets are lost, which leads to problems commonly mentioned in the field of real-time transmission: network Jitter, packet loss, and latency.

Packet loss, delay, and jitter are three inevitable problems in real-time transmission based on the Internet, whether in LAN, within a single country or region, or in cross-border and trans-regional transmission.

These network problems are distributed differently in different regions. According to the live network monitored by Agora, 99% of the audio interactions in China, where the network is relatively good, need to deal with packet loss, jitter and network delay. Of these audio sessions, 20% had more than 3% packet loss due to network problems, and 10% had more than 8% packet loss. In India, however, performance varied considerably, with about 40 percent of the 80 percent of audio interactions lost. Optimizing the quality of service over 2G/3G networks in India remains the focus of audio service delivery.

Jitter, delay and bandwidth constraints are also many. These network problems lead to a sharp decline in audio quality, and even affect the intelligibility of audio signals, which cannot meet the essential communication needs of exchanged information. Therefore, it is a required topic to try to repair the damage caused to audio signals in the process, no matter for the self-developed team using WebRTC or the SDK service providing real-time service.

Packet loss control

To ensure reliable real-time interaction, handling packet loss is a must. If continuous audio data is not provided, users will hear glitches and gaps that degrade the call quality and the user experience.

The packet loss problem can be abstracted as how to achieve reliable transmission over unreliable networks. Two Error Correction algorithms, Forward Error Correction (FEC) and Automatic retransmission reQuest (ARQ), are usually used to formulate corresponding policies according to accurate channel state estimation to solve the packet loss problem.

FEC enables the sender to encode and send redundant information through the channel. The receiver detects packet loss and recovers most of the lost packets based on the redundant information without retransmission. That is, higher channel bandwidth is used as the cost of recovering lost packets. Compared with ARQ’s packet loss recovery, FEC experience has a smaller delay, but it consumes more channel bandwidth because it sends redundant packets.

ARQ uses Acknowledgements Signal (acknowledgment Signal) and timeouts, that is, if the sender does not receive an ACKNOWLEDGMENT Signal before the time lapse, it uses the sliding window protocol to help the sender decide whether to retransmit the packet. Until the sender receives an ACK for confirmation or until a predefined number of retransmissions is exceeded. Compared with FEC packet loss recovery, ARQ has a longer delay (due to waiting for ACK or continuous retransmission) and lower bandwidth utilization.

Simply speaking, the FEC and ARQ methods used in packet loss control are to recover the lost packets through extra channel bandwidth and delay. This is the current situation of traditional anti-packet loss methods, so what are the feasible methods to solve this problem?

Let’s take Agora SOLO as an example. In general, codecs do compression and de-redundancy, while anti-packet loss is to some extent an extension of channel processing. Anti-packet loss is an extension of error correction algorithm, which is realized by adding redundancy. The strategy of Agora SOLO is to combine redundancy removal and redundancy addition, adding redundancy for key information and removing more redundancy for non-key information, so as to achieve the effect of joint coding between channel and source.

Delay and jitter control

Packet transmission and queuing will cause delay and jitter. At the same time, delay and jitter will also be introduced to the packets recovered by packet loss control. The adaptive de-Jitter buffer mechanism is usually used to counter to ensure the continuous playback of audio and other media streams.

As we mentioned above, the change in packet delay, called Jitter, is the difference in end-to-end one-way delay between an audio stream or other media stream packets. Adaptive logic makes decision based on delay estimation of packet arrival interval (IAT). Stutter occurs when packet loss control does not recover packets, excessive jitter, delay, and burst packet loss occur, that is, beyond the delay that adaptive cache can counter. At this time, the receiver generally uses the Packet Loss Concealment (PLC) module to predict new audio data, so as to fill in discontinuous audio signals caused by audio data Loss (Packet Loss due to Packet Loss, excessive jitter and delay, Packet Loss in a burst).

To sum up, to deal with network damage is to ensure the sequential output of packets as far as possible through packet loss, delay and jitter control methods on the unreliable communication channel, and to fill the missing audio data with PLC prediction.

In order to minimize network damage, the weak network boundary should be enhanced by combining the following five points:

1. Accurate estimation of network channel state, and dynamic adjustment and application of packet loss control strategy;

2. And matching DE – jitter buffer, faster and more accurate learning speed to adapt to network unsteady (good network turns poor, poor network, sudden comb network) change, adjust the resistance to shake the cache to greater than and more close to the size of the steady state equivalent delay when, can ensure that listeners in the instantaneous network environment quality is good, low latency, Gradually tends to the theoretical optimal;

3. When the weak network is beyond the recoverable boundary, the cost of redundant data or retransmission times in the channel bandwidth is increased by reducing the bit rate (also commonly used to solve channel congestion);

4. Combined with PLC’s ability to adapt to the input signal, to ensure that different speakers, time-varying background noise as far as possible to reduce the perceptible noise;

5. In a small bandwidth, encoder encodes low bit rate and high quality speech, combined with 3 in the case of poor network service quality, to increase the robustness of weak network confrontation.

Based on the above counter strategies of packet loss, delay and jitter, we can provide better real-time audio interactive experience based on Internet transmission. As we said before, network delay, jitter, and packet loss are different in different regions, time periods, and networks. The Agora SDK of Sonnet provides high-quality audio interactive services globally, allowing users in all regions of the world to interact online in real time, and bringing offline acoustic experience to users as much as possible through the audio engine. Therefore, we also conducted several field tests and observed the MoS score (ITU-T P.863) and delay data performance of SDK.

The following is the MOS score and delay data of Agora RTC SDK and other products tested at the same time on the same equipment and under the same carrier network in The Ring road of Shanghai Central. According to statistics, the real-time audio interaction service provided by Agora SDK provides higher sound quality and lower latency.

Figure: COMPARISON of MoS points

Figure: Comparison of delayed data

As can be seen from the MoS score comparison figure, the MoS of Agora SDK is mainly distributed in the high score segment [4.5, 4.7], and the friends are mainly distributed in [3.4, 3.8]. One more statistic that might give you a little bit more intuition. The wechat we use, although not in the same category as the RTC SDK, also provides voice calling services. The highest MoS score of wechat in the environment without weak network is 4.19.

The actual audio quality of the user experience can be shown by the color of the dots in the following audio quality map, with green indicating MOS score greater than 4.0; Yellow indicates that MOS score is in [3.0, 4.0], red indicates [1, 3.0].

Photo: Audio quality of Agora SDK in Central Shanghai

Photo: Shanghai Zhonghuan, youshang audio quality

summary

Packet loss, delay and jitter are unavoidable problems in real-time interactive scenarios. In addition, these problems will not only change due to the network environment, time period, user equipment and other factors, but also due to the development of underlying technologies (such as the large-scale application of 5G). So our optimization strategies for them should also be iteratively optimized.

The optimization of the audio experience does not end after following the audio signal from the sender across the network to the receiver. In the next post, we’ll share the tip of the iceberg in detail.

reading

Detailed explanation of low delay high sound quality: codec

Low delay high sound quality: echo cancellation and noise reduction