This is the 9th day of my participation in the August More Text Challenge. For details, see:August is more challenging

What is live audio and video

Real-time audio and video (RTC) is a real-time interactive audio and video communication technology based on IP technology.

The difference between RTC and live broadcast protocol

Broadcast protocol Playback delay
FLV 3s-5s
RTMP 3s-5s
HLS 10s+

And here we will use the RTC technology is amazing ~

It is based on IP technology, it has a latency of less than 400ms, the content of RTC transmission is audio and video data.

Real-time audio and video application scenarios

  • Audio video call

    • Product features

      1V1, multiple voice and video calls

      Can beautify, use props and so on.

    • Technical characteristics

      Large diversity of supported devices

      Network access switches frequently

Because this kind of product is mainly oriented to users, different users use different devices. According to different equipment needs to do different optimization. That’s why we say there’s a wide range of supported devices.

In practice, mobile network 4G, 5G switch WIFI, or switch between base stations. These changes in the network environment require reconnection.

Two scenarios are described below: Douyin live and Live Liamax.

  • Trill live

    • Product Function I

      • Electricity live
      • The game live
      • Live show
    • Technical Features ⅰ

      • The anchor segment pushes the stream
      • CDN pull on the viewer side
  • Live even wheat

    • Product Function ⅱ

      • Multiple anchors interact with the frame, and the audience watch the live
      • Karaoke, game interaction, interactive communication
    • Technical Features ⅱ

      • Server & client confluence
      • Combined circulation of push
      • Real-time audit

Live mics combine video streams from multiple hosts and send them to viewers. This confluence is usually done on the server side, but as client performance continues to improve, it is now possible to put the confluence on the client side, which saves costs.

As we all know, the delay of traditional live broadcast technology is relatively high. It usually takes more than 5-10 seconds from the audience’s comments to the feedback from the anchor. In this case, there will be some problems in the live broadcast of education, e-commerce and sports.

We mentioned earlier that RTC can realize the real-time transmission of audio and video streams with low latency, so can RTC be applied in live broadcast scenarios?

The answer is yes, as long as we convert the TCP based network transport protocol to THE UDP based RTC.

So why don’t we use RTC in the first place?

First, because of the cost, the cost of CDN is one third of that of RTC, and the deployment of RTC is relatively resource consuming.

Second, RTC requires a lot of network optimization, which is quite complicated.

Ordinary live broadcast is replaced with a low-delay live broadcast scheme

Scheme Ⅰ

Replace the pull-out end (playback end) with RTC: the benefits are large.

Because the delay of the audience side is large, RTC is usually replaced from the audience side.

Scheme Ⅱ

The push stream end (anchor end) is replaced with RTC: in revenue.

Because the host network environment is generally good, the host end is not given priority.

RTC application scenario: Online education

  • One-on-one education

    • Product features

      • 1 v1 teaching
      • Whiteboard, courseware
      • The cloud to record
      • The prison class
    • Technical characteristics

      • The courseware synchronization
      • Audio and video calls are similar
      • You might need to cross borders

    The requirements are the same as audio and video calls, which require timely feedback and low latency. Cross-border one-to-one calls may be physically distant, resulting in high latency.

  • Large class

    • Product features

      • Ten thousand class
      • Whiteboard, courseware
      • The cloud to record
      • The prison class
    • Technical characteristics

      • 1 person release
      • The courseware synchronization

The technical difficulty of large class is lower than that of 1V1 education, because in general, it is only the teacher who pushes the flow, and there is not much interaction. In general, large classes are less interactive and the learning experience may not be very good.

As a result, small class is born, which has strong interactivity, but its difficulty is the greatest, higher than 1V1 education difficulty. Because everyone’s network environment is different, different bit rate videos need to be sent to different users.

  • A small class

    • Product features

      • Many people interact
      • Whiteboard, courseware
      • The cloud to record
      • The prison class
    • Technical characteristics

      • Multiple people publish and subscribe
      • The courseware synchronization

RTC usage scenario: Video conference

  • Flying book video conference

    • Product features

      • Video interaction with hundreds (thousands) of people
      • Screen sharing
      • Document sharing
      • PSTN access
      • Background blur, beauty…
    • Technical characteristics

      • Multi-person audio and video interaction
      • Access device Diversity
      • The audio noise reduction
      • Weak network optimization
      • AI ability

In general, video conferencing is technically difficult, requires high audio noise reduction, and has PSTN access.

RTC usage scenario: games

  • The game against

    • Product features

      • Small voice
      • Range of voice
    • Technical characteristics

      • Low delay, low energy consumption, low flow
      • Range of voice

This is because games consume computer and network resources and require low latency. Therefore, it is necessary to achieve low delay, low energy consumption and low flow rate.

  • Cloud game

    • Product features

      • The game runs on the server
      • Client rendering, control
    • Technical characteristics

      • Ultra low delay
      • Massive control command

This makes it possible to try high performance games even if the device is not high performance. Great for big games and game demos.

For a good game experience, you need ultra-low latency. And because our RTC can transmit massive amounts of control commands, it can be used for cloud gaming.

An overview of real-time audio and video technology

RTC system architecture diagram

Signaling is a number of control instructions, signaling server can be used to call, coordination.

This is done by the post-processing server.

The client is the terminal for audio and video calls. Let’s take a look at the overall technical architecture of the client.

QoS is the guarantee that it can still be used in the case of weak network.

Events are reported because any log needs to be uploaded, allowing for error handling and performance optimization and algorithm improvement.

  • Full platform support

    • Device adapter
    • Performance fit
  • The connection

    • Broken network reconnection
    • Multipath transmission
  • The data operation

    • Events reported
    • Log collection

Low-performance devices use low-performance algorithms.

At the same time support WIFI, 4G needs to achieve multipath transmission.

The collected audio and video data need to be encoded and compressed and then transmitted through the network, and then decoded and played.

Signaling server

Signaling: Control information transmitted between devices in a network for coordinated operation.

Signaling server: a server used to transmit and secondary signaling.

  • Q&A

    • Global deployment
    • Signaling arrival rate
    • The connection
  • Implementation scheme

    • WebSocket
    • Custom Protocol

Media server

Media server: Transfers audio and video streams between end users, thus enabling audio and video communication between users. Usually deployed on the edge, close to the user.

Simulcast&SVC provides different bit rate and frame rate videos according to different users’ network conditions.

BWE& Congestion control is used to estimate the available bandwidth of the user to determine what bit rate to send to the user.

Here’s a look at some typical media server architectures:

post-processing

  • Audio and video recording
  • Combined circulation of push
  • Screenshots and slices
  • audit

What else?

  • The data operation
  • Quality assessment
  • QoS
  • Automated testing
  • Application Scenario Exploration

You need data to optimize, whether the video is clear, whether the audio is pleasing to the ear, you need quality assessment.

Automated testing and quality assessment are also important.

It’s also important to explore new application scenarios.