What are the technical challenges of the latest interactive podcast craze?

Hey, do you listen to podcasts?

According to statistics, in January last year, the most popular podcasts in the United States had a monthly audience of more than 23.7 million people, before the epidemic affected people’s lives. With the popularity of RTC technology and changes in people’s lifestyles, podcasting has evolved into new forms. In January, podcasting took off as a new form of “interactive podcasting,” fueled by RTC technology and Elon Musk’s “traffic.” These days, the “interactive podcast” scene is getting more and more popular at home and abroad.

Innovation and highlight of “interactive podcast”

“Interactive podcast” is a new interactive scene of online conversation based on interest/topic, whether celebrities/big V, or ordinary users can open or participate in a wide range of interactive communication at any time, and listeners can “raise their hands” to participate in real-time interactive communication.

While “interactive podcasts” sound like chat rooms, there are many differences in terms of content, user relationships, and distribution of information:

There have been many people on the Internet to analyze the “product success” behind it. So how has the technology behind podcasting changed from “podcasting” to “interactive podcasting”? What aspects of a good audio interactive experience need to be optimized?

Technological changes in new forms of podcasting

I. Real-time interactive Technology entrusts “Content Equality”

Many people say that a large part of the success of interactive podcasts is due to “content parity,” which means that anyone can create a podcast room on the platform to initiate a discussion on a topic, or join the podcast room to interact at any time. However, to achieve “content equality”, also need to have the support of technology. Why do you say that?

First of all, “interactive podcast” relative to podcast, the most important thing is to let the host, guests and every listener can real-time interactive voice communication, give listeners a stronger sense of interaction and participation. In order to connect the mai to interact, not only ultra-low latency transmission is required, but the network framework also needs to support two-way transmission. However, THE CDN only supports one-way transmission, so the mai interaction cannot be realized. This requires a live transmission network with global coverage such as Agora SD-RTN™, which can interact with users around the world through low-latency data transmission. This is a change at the network level in the technology behind it.

Nova audio engine, providing professional equipment level quality audio experience

On the other hand, in order to ensure sound quality and listening sense, traditional podcast programs will buy professional equipment, and choose the appropriate environment to record, to avoid surrounding noise, echo. With “interactive podcasts,” users can pick up their phone anywhere in the room and have a conversation or listen to a program, with the same sound quality without the need for professional equipment. This is because a set of software algorithms, such as codec and 3A algorithm, have replaced the role of professional equipment, so that non-professional users can have good sound quality and experience.

On the codec side, The SDK has integrated Agora Nova™, a self-developed codec optimized for real-time audio interaction. To provide a better sound quality experience in pure speech scenarios, the Nova™ uses 32kHz sampling rates, rather than 8kHz or 16kHz sampling rates as other speech codecs do, to capture more details of speech. At the same time, through theoretical reasoning and extensive experimental verification, a streamlined voice high-frequency component coding system was designed to optimize the coding complexity of Nova™. In order to guarantee the anti-packet loss ability, we also choose the most balanced scheme under the premise of ensuring the coding efficiency. After experimental verification, this scheme not only guarantees the coding compression efficiency, but also guarantees the recovery rate when the packet is lost. In both subjective and objective evaluation systems, Agora Nova™ delivers superior voice coding quality to Opus.

In TERMS of 3A (noise reduction, echo cancellation, automatic gain), the Sound-net SDK is intelligent enough to recognize various environments, fully eliminate echoes, and provide superb double-talk performance. At the same time, the noise network SDK can accurately detect noise signals in the noise reduction module, and dynamically adjust the type and parameters of the noise reduction algorithm, so as to effectively eliminate all kinds of noise without damaging the sound quality of speech. Automatic audio gain, even in noisy environment users can experience excellent, maximizes the quality of audio, to ensure a clear interactive podcast experience.

Because in the interactive podcast scenario, users can come from all over the world. In order to ensure a consistent and smooth experience for users around the world. The multi-dimensional network estimation model of acoustic network can intelligently identify the network link situation and the user network environment, and then adapt the bit rate and frame rate according to the user network environment, device performance and the network link situation. At the same time, with excellent jitter buffer mechanism and anti-loss algorithm, can ensure that 80% of the packet loss, still can be smooth audio call.

In addition, Agora realizes the beautification of human voice by coordinating the tone, timbres, dynamics, rhythms and spatial effects of human voice. Meanwhile, it supports the time domain and spatial processing of one or more frequency bands of human voice, so as to achieve the purpose of improving the sound quality and adjusting the timbres of human voice.

Based on the above including network transmission, encoding and decoding, noise reduction and echo cancellation, bit rate adaptive, weak network countermeasures and a series of complete technologies, we launched the “interactive podcast scene solution”.

Soundnet interactive podcast scene solution

Low latency interaction with global coverage

Interactive podcast scenario transmission scheme based on sound network Agora SD-RTN™ implementation. Sd-rtn ™ is deployed globally in more than 200 countries and territories, providing “dedicated line” quality and interactive experience for real-time audio and video. Based on intelligent routing strategy and self-developed transport protocol Agora AUT, the global quality network transmission coverage rate exceeds 99%.

A globally consistent high quality experience

The NOVA™ voice engine provides high fidelity audio at the same bit rate with better audio capture and effective frequency range than the industry’s leading codecs. NOVA™ provides lower bit rates at the same sampling rate, reducing the pressure on the user’s bandwidth and ensuring a good audio experience in any weak network conditions.

At the same time, the sound network uses industry-leading software 3A algorithm, can be intelligent to adapt to all kinds of environment, without damaging the voice quality, effectively eliminate all kinds of noise, echo. Ensure maximum audio interaction experience.

Ten times flexible network architecture design, easy to deal with traffic burst

Designed for real-time transmission, the SD-RTN™ network is designed with an ultra-resilient network architecture designed to handle up to 10 times the load. Can calmly deal with many listed companies, large flow platforms, burst of customers sudden traffic surge.

Security compliance, in compliance with global information security and privacy protection regulatory requirements

Agora has met all the requirements of ISO 27001, ISO 27017, ISO27018 and has been certified by DNV worldwide. Our network architecture and infrastructure are SOC2 compliant, ensuring that all physical and virtual access is effectively managed, monitored and controlled. At the same time, Sonnet also hired Trustwave Holdings and other global privacy and security experts to audit, through the third party privacy protection audit, as well as security experts in network penetration, application vulnerability and compliance assessment and other audit tests, in full compliance with GDPR, CCPA, COPPA, HIPAA, As well as China’s data Security Law (draft), personal Information Protection Law (draft) and other international and domestic regulatory requirements.

In terms of privacy protection, soundnet does not access or store any personally identifiable information (PII) of users at all, Only the operational information necessary to provide the service will be collected — this data includes IP addresses (identifying the user’s geographical location to comply with regional regulations and network connectivity), metering data (because the soundnet is charged by the time it is used), and quality of experience data (to help customers monitor quality of experience through the crystal ball).

In terms of information security, audio network provides application developers with a number of default and configurable security options such as authentication, data encryption and network geo-fencing to protect developers’ audio and video media streams. Agora SDK provides the built-in AES encryption algorithm for customers to use directly. The encryption key is managed by the customer’s application and transmitted between end-user devices outside the Agora network.

We also work with several of the world’s most trusted security organizations to ensure that vulnerabilities are discovered and notified promptly, helping customers quickly make the necessary fixes.

Interactive audio best practice support

In the past seven years, the website has served thousands of customers, including singing bar, pocket werewolf kill, lychee, momo, etc., accumulated the best practices of a variety of real-time audio and video interactive scenarios, with sufficient practical experience and various plans to ensure.

XLA experience quality assurance

In July 2020, Sonnet defined and launched the first experience quality standard XLA in the real-time interactive industry based on nearly one trillion minutes of user experience data and massive user subjective experience evaluation. If any indicator fails to meet the standards, the network can pay 100% compensation at the highest. This commitment also shows the technical strength and service quality of the network in the field of real-time interaction. At present, no other service provider in the market can put forward similar commitment guarantee for real-time audio and video services.

Not just “fast implementation”

In fact, back in February we had a developer in our community implement and open source an application using the Agora Web SDK in two days (click here to learn more). In the app, he says, it takes about seven lines of code to implement audio interaction. The architecture of the current mature interactive podcast scene is shown in the figure below. Users can access the podcast studio through the studio list or other level one entry. In the broadcast room, there will be a virtual platform (channel), the host and guests in the channel dialogue. ‘

If you translate this process into API call logic, it looks like this:

This is how we play our most common interactive podcasts. But as people use it, we see a lot of other ways to play it. Developers can use different combinations of parameters in the Agora API to provide the best sound quality for different types of gameplay. Let’s take a look at some of the parameters that correspond to typical gameplay:

Typical Elon Musk-style guest talk

In this case, the guest is relatively fixed, there will be no frequent fluctuation of the situation. And all the guests are on voice chat. So we can set Audio Profile to Speech Standard. Set AudioScenariosheding to Default. The default sample rate will be 32 kHz, and the speech code will be coded with a maximum code rate of 18 Kbps.

RtcEngine. SetAudioProfile (the AUDIO_PROFILE_SPEECH_STANDARD, the AUDIO_SCENARIO_DEFAULT)

Open communication

Some of you may have seen a room like this: The owner of the room serves as an open communication place where anyone in the audience can apply for a microphone on stage. In this case, there will be frequent ups and downs of wheat. In this case, we can set AudioProfile to Music Standard and AudioScenario to ChatroomEntertainment.

RtcEngine. SetAudioProfile (the AUDIO_PROFILE_MUSIC_STANDARD, the AUDIO_SCENARIO_CHATROOM_ENTERTAINMENT)

Online concert

Perhaps hoping to organize a gala with only “song and dance shows,” during the Spring Festival, some users started organizing friends to hold online concerts in interactive podcasts. To ensure sound quality, we need the support of music coding. At this point we can set AudioProfile to MusicHighQuality to provide high sound quality support; AudioScenario is set as GameStreaming to ensure a better real-time interactive experience even with high sound quality.

RtcEngine. SetAudioProfile (the AUDIO_PROFILE_MUSIC_HIGH_QUALITY, the AUDIO_SCENARIO_GAME_STREAMING)

Here are just three examples, but there’s a lot of gameplay emerging around the interactive podcast scene. If you have new ideas, but don’t yet know how to implement them based on the Agora API, let us know by Posting to the RTC community. For more details on the scenario, please call 400 632 6626.