This article shares a speech delivered by Gao Zehua, chief audio craftsman of Agora, at RTC 2018, entitled SOLO X™ : Anti-packet Loss Vocoder compatible with WebRTC Standard. This paper mainly introduces the algorithm features of Agora SOLO™, the anti-packet loss encoder, and the new features and test results of Agora SOLO X™, the second generation of codec. The following is a transcript of his speech.

Welcome to the RTC developer community to share your experience with more RTC developers.

Last year, we shared our self-developed anti-loss codec solution Agora SOLO™ at the RTC Conference, which is mainly aimed at anti-loss network and solving the last Mile problem of real-time communication transmission.

Traditional anti-packet loss thinking

Sometimes, there are some misconceptions about lost bags. Many people think that with the popularization of 4G and the construction of 5G, does the problem of packet loss become unimportant? I don’t think so, because under the existing network conditions, there are a lot of packet loss problems whether from server to server or from sever to Last Mile.

When we talk about packet loss, we are concerned with the concept of packet loss. What is packet loss? In fact, any bag that doesn’t arrive on time is a lost bag. If the delay of a packet is large enough, we can generally solve the packet loss problem by retransmitting countless times. If the packet does not arrive at the specified time, it is considered lost. In addition, packet loss is not necessarily caused by the network, but sometimes caused by system scheduling.

Let me give you an example. In general, the playback thread, the decoding thread, and the packet receiving thread can all be separate threads on the Android system. Android is not a particularly real-time operating system. Once the system is busy, it is easy to produce the packet receiving thread and decoding thread conflict, resulting in packet loss. This packet loss is caused by the system. Even during decoding, when the decoder is finished decoding and handed to the player to play, the buffer in the decoder can cause a lag.

Packet loss is also related to bandwidth limitation. If the bandwidth is large enough, the anti-loss effect can be achieved through countless retransmissions. However, if the bandwidth is very limited, the anti-packet loss effect cannot be realized by completely relying on unlimited retransmission.

For example, China Unicom has done such a thing, monthly data, but the monthly bandwidth can only be 128KBps, to meet your basic needs for web pages and audio and video. Under this service, if you turn on some high performance or high bit rate source communication, there will be bandwidth congestion. Bandwidth congestion first leads to increased latency, and then packet loss. So, actually, bandwidth is also related to packet loss.

Because of our global network, the network status in Africa, Southeast Asia, the Middle East and India is still far from that of China. Even in China’s Internet environment, the Internet situation in Shanghai and remote areas is different. Therefore, it is not that 5G will eliminate packet loss in the future. On the contrary, I believe that packet loss will become more and more common with the differentiation of networks.

As an example, describe the situation in which packet loss occurs. The typical packet loss model is shown in the figure above. Device 1 communicates with Device 2, Device 3, and Device 4. For example, device 1 sends packet 1, packet 2, packet 3 to device 2, packet 1, packet 2, packet 3 loses a packet is packet 2. When calculating the packet loss rate in this time window, the current packet loss rate is 33%. In this case, Device 2 sends a return control (Loss Info) to Device 1. The Loss Info carries the information that the packet loss rate is 33%. According to this information, Device 1 will re-process device 2 for anti-packet loss, sending two packets 4 and two packets 5. Therefore, when one packet 4 is lost, another packet 4 can play an anti-packet loss effect. In other words, in the whole system, packet 1, packet 2, packet 3 in the first time window, the packet loss rate is calculated in these three packets, and then the packet loss rate is transmitted back to device 1, which sends redundancy against packet loss according to the information of packet loss rate.

Among them, we will find several problems. The first is the time window. The packet loss rate is 33% over three time Windows. The packet loss rate is 25% over four time Windows. If calculated in two time Windows, the packet loss rate is 50%. This creates a problem, namely the packet loss rate calculation is not so accurate.

Second, device 2 sends a Loss Info message back to Device 1 after receiving the packet sent by Device 1. Generally, Some FEC control is added to Loss Info to ensure that it is not lost. In fact, loss Info packages can still be lost.

We assume that there is no packet loss from Loss Info and device 1 sends redundant packets to Device 2, such as two packets 4 and two packets 5. The packet loss rate is 33%. After packet 4 is lost, packet 4 has anti-packet loss effect, but packet 5 has no anti-packet loss effect, which is a waste of network bandwidth.

Also, assuming that the front window is correct, the loss Info is not lost, and packet 4 and packet 5 have anti-packet loss effects, there will still be a problem. There is already a long RTT in the channel sent back. Send a packet to Device 2 for estimation. After estimation, send a Loss Info back to Device 1. If the RTT is too long, packets will still be lost during packet sending. In other words, in the case of long RTT, the effect of this anti-packet loss scheme is relatively limited, or it can be said that there are some defects.

We can also assume that none of these problems exist. In a multi-person call, the network environment at the receiving end of each device is different. Assume that 3G or lossy Wi-Fi is used between Device 1 and Device 2, while DEVICE 3 and Device 4 use 4G and 5G respectively. So, what strategy should Device 1 take when dealing with Loss Info? If FEC is added to both devices, it means that the bandwidth of device 3 and device 4 is wasted. However, if you do not want to waste bandwidth, no FEC means that the anti-loss effect of device 2 cannot be achieved.

Therefore, in a multi-person environment, the difference of network bandwidth is an important factor to be considered when making network policies. At the same time, there are some human or cost factors that need to be taken into consideration when we do network strategy. For example, in real-time communication between China and India, one of the servers is local in India and can directly connect to the Server in China. However, the communication cost in India is high, which will incur certain charges. At this time, we can choose to go around through the Server in Singapore, but this process will cause some packet loss problems.

Agora SOLO™ algorithm ideas

In a multiplayer environment, traditional anti-packet loss strategies do not work well to solve this problem, so Sonnet launched a new encoder called Agora SOLO™ last October.

An audio signal processed by Agora SOLO™ generates two paired packets, packet 1 and Packet 2, which are the same frame and complementary. That is, when I receive a packet 1, I can decode an 8kbps narrowband audio signal, if I receive a packet 2, I can decode an 8kbps narrowband audio signal, but if I receive a packet 1 and a packet 2 at the same time, I can recover a 16-kbps broadband, high-quality audio signal.

To a certain extent, this codec solves all of the anti-packet loss policy problems encountered by FEC just mentioned.

In general, codecs do compression and de-redundancy, while anti-packet loss is to some extent an extension of channel processing. Anti-packet loss is an extension of error correction algorithm, which is realized by adding redundancy. For Agora SOLO™, it is a combination of de-redundancy and added redundancy, adding redundancy for key information and removing more redundancy for non-key information, so as to achieve the effect of joint coding in channel and source.

We emphasize that adding some redundancy to the encoder to resist packet loss actually leads to some reduction in coding efficiency. But in this case, we can reduce the reduction of coding efficiency by adding some new algorithms. And, compared to the original CoDEC encoding efficiency, the results are better.

The figure above shows the test results we ran with ITU NTT’s Chinese test sequence. A brief introduction, ITU’S NTT is the standard codec test sequence, which has 26 languages. Here only the test results of The Chinese part are presented. So in the horizontal coordinate, PESQ is the objective test standard for narrowband encoders, and in the last column, PESQ-WB is the objective test standard for broadband encoders. The full score is 5 points, and usually 4 points is perfect. There is a comparison between our encoder and some traditional encoders in PESQ score.

We can see that packet 1 received only 8kbps, and the MOS score of PESQ was 3.52. If only packet 2 was received, the MOS score was 3.51. If both packet 1 and Packet 2 are received, the MOS score is 3.95 at 16kbps. So that’s the fraction of the narrow band. The MOS score of PESQ under broadband is 3.582.

With the new encoder, let’s see what it looks like. At this point, there is no need to consider whether to add loss INFO. We can send packets of packet 1, packet 1 ‘, packet 2, packet 2 ‘, packet 3, packet 3 ‘directly from device 1 to all devices 2, 3, and 4.

Why can you do that? Because actually, if packet 1 or packet 1 ‘is lost, it doesn’t matter too much, and you can still recover a limited 8kbps quality of audio data. If no packet is lost and packet 1 and packet 1’ are received, 16kbps audio data can be recovered without any redundant information. This algorithm is not like the original FEC, where two packets are exactly equal, or one is large and one is small, which would waste bandwidth. This algorithm doesn’t waste bandwidth, so you don’t have to worry about:

  • First, delay problem;

  • Second, the packet loss problem of sending back Loss Info.

  • Third, in multi-person communication, the problem of inconsistent bandwidth of each network.

Overall, this encoder is what I would call a near-perfect anti-loss codec because there is no perfect codec.

Agora SOLO™ has the following four features:

  • First, lower latency. There is no need to send packet loss information back to the channel, and multiple data is sent by default.

  • Second, higher quality. The quality of one packet can reach the level of common codec, and the quality of two packets can reach the level of high quality codec.

  • Third, multiplayer environments;

  • Fourth, simplify the strategy. There is no need for policy adjustment and no worry about policy granularity.

Therefore, from the perspective of the four anti-packet loss methods (FEC, PLC, ARC and ARQ), it is necessary to do balance by all means. But with this method, you don’t have to think about it too much, you just call it. In addition, with limited cost, you can send two streams, namely packet 1 and Packet 1 ‘, if you use the bandwidth of a high quality and high cost server and the bandwidth of a low cost relative to packet loss at the same time, which can achieve better results.

The image above shows a comparison of our Agora SOLO™ with other encoders. As can be seen from the figure, the MOS scores of ILBC and AMR-NB are relatively low, which is a normal phenomenon because they are encoders of last era.

The MOS scores of SILK and Opus are relatively high, but we will find that the base MOS drops very sharply when SILK+FEC20 is used, because a part of the bandwidth for redundancy will cause the base MOS score to drop.

In Opus+FEC20, the bit rate at 5% packet loss can not be programmed, because the bit rate control design is relatively conservative. But we can see that when it added FEC20, the base MOS score went down.

The base MOS score of Agora SOLO™ is almost the same as that of SILK and Opus, falling within 0.2. When it is anti-packet loss, there is almost no MOS score decrease once packet loss occurs. When bags are lost by 5%, 10%, 15% and 20%, its MOS score remains in a relatively stable situation, sometimes reaching a gap of 1 point compared with competing products.

Those who have done the comparison of rival products in MOS of communication should know that it is very difficult to get a higher score of 0.3-0.5, and it is almost impossible to get a higher score of 1. Taking 20% as an example, Agora SOLO™ has a MOS score of 3.2, while SILK has a MOS score of 2.2 and 2.3. Although SILK has a MOS score of 3.0 with FEC20, it has a low base MOS score with FEC.

This is the first sequence of THE ITU NTT Chinese language test — female voice. Female voices are relatively difficult to write, because Chinese people speak with four voices, which is more complicated than other languages. Female voices have more high-frequency information than male voices, so it can be harder to compose.

As can be seen from the figure, the MOS score of SILK16 was 3.5, that of Opus16+20FEC was 3.5, that of Opus16+20FEC was 3.3 and that of Opus20+20FEC was 2.8. Although the bit rate is higher, due to bandwidth reasons, the FEC is given a higher bit rate, but its kernel bit rate is lower, so its base MOS score becomes very low.

Agora SOLO™ has a base MOS score of 3.5, roughly the same as Opus16 and SILK16 without losing packages. In the case of lost packages, for example 20%, it has a MOS score of 2.5, which is higher than the others, and this is the Agora SOLO™.

Simulation test results

We did not only offline ITU network loss tests, but also online simulation tests. As shown in the figure above, we use two iphones to conduct the test of simulated packet loss. Device A communicates with Device B through Wi-Fi, Intranet VOS and Wi-Fi.


Here are our test results. Red is packet loss rate, green is frame loss rate. As you can see from the figure, the packet loss rate was around 50%, but after Agora SOLO™, the frame loss rate was only 20%. NOVA is our other encoder, which has a packet loss rate and frame loss rate of 50% without doing any FEC.



When adopting the two-frame per packet strategy to fight packet loss, Agora SOLO™ has almost no frame loss at a 20% packet loss rate. However, under the traditional encoder, the frame loss rate still has a certain loss.

Anti-packet loss voice encoder compatible with WebRTC standard

Over the past year, we have ported the anti-packet loss part of the Agora SOLO™ encoder to OPUS. We have given this technology a new name called Packet Spectrum Complementary technology and developed a new encoder called SOLO X™.

The SOLO X™ algorithm is the same as SOLO™, so let’s look directly at the difference in results.

At 16kbps, we can see the MOS of SOLO X™ drop to 3.2. Its anti-drop performance is 2.4 at 20% of the time, about the same as Agora SOLO™ and better than Opus20 and Opus16.

At 20kbps, its MOS score is 3.46. When 20% of the packet is lost, its MOS score is 2.6, which is still better than Opus16+20FEC and Opus20+20FEC, and the basic MOS score is also higher.

Its main advantage is that after combining with OPUS, it can be compatible with OPUS. If it supports the SOLO X™ encoder, it can solve two higher quality streams when interworking with the Audio Net Native SDK and WebRTC SDK. In order to achieve compatibility, there will be some quality loss, resulting in the MOS score of SOLO X™ at 16KBps is lower than that of SOLO™, which is also the main direction of our later optimization.

For 8kHz, in order to achieve the effect of compatibility, its basic MOS score also decreased, but its anti-packet loss effect did not significantly decrease.

Figure: SOLO X™ running score in ITU NTT

After compatibility with WebRTC, we tested the interoperability of WebRTC. The meaning of googDecodingPLC is that after the network loss is added, when the outside does not receive the packet, the googDecodingPLC function will be called to decode and do the anti-packet loss compensation at the back end. The units on the right vertical axis are frames, i.e. how many frames are compensated for anti-loss during this period. If we achieve anti-packet loss, there is no need to tune a large number of back-end PLCS to compensate.

As can be seen from the data in the figure above, when only using OPUS encoder, the number of compensation frames within 5 minutes is about 6000 and 111000 respectively under 20% and 40% packet loss simulation. With SOLO X™, the compensation frame was just over 1K with 20% packet loss. When 40% of packets are lost, the compensation frame for packet loss is only about 4K, which can be seen as a significant decline, which means that more packets have reached the decoding end and PLC is not required to be called for compensation.

Data representation in SOLO X™ and WebRTC communication

The figure above shows our PESQ-MOS score in compatible and incompatible modes. The data given above are MOS score of compatible Opus encoder, but pesQ-MOS score can be higher in non-compatible mode.

In fact, this figure is also the MOS score of PESQ-WB. Theoretically, we should use POLQA to compare the score, so that the difference and influence of MOS score will be greater at 48kHz and 48KBps and 32kHz and 32kps. Because the higher the weight of the high frequency is, the lower the score will be once there is a loss. This is why pesQ-WB scores lower than PESQ for almost the same contrast strategy and decoded output.

In incompatible mode, its MOS score is much higher than compatible mode. Later, we will optimize the compatible mode, and I think there is a chance to achieve the effect close to the incompatible mode.

The next step for Agora SOLO™

In 2017, we spent two years making the encoder of SOLO™. This year, we are making SOLO X™ to communicate with OPUS. Currently, we are mainly working on Voice, and we plan to launch the encoder for music in 2019, we believe the effect will be more prominent. Because music is actually produced at 128K or 196K bit rate, the cost of FEC at this bit rate is very, very high.

Finally, I want to say that it’s a disadvantage for a startup to put a lot of effort into codec. Because encoder is very tired, also very bitter, ordinary companies usually do not do. So I am very grateful to Sonnet for giving me this opportunity, and also to Tony for his encouragement. Frankly, it took a lot of sleepless nights, sleepless nights to make Agora SOLO™ and SOLO X™.

Scan for a video review of the speech (watch the speech at 02:32 minutes)