preface

In recent years, real-time audio and video communication applications have shown a trend of explosion. Behind these real-time communication technologies, there is a technology that has to be mentioned – WebRTC.

In January of this year, WebRTC was published as an official standard by the W3C and IETF. According to a report by research firm GrandViewReseach, the global WebRTC market is expected to reach $21.023 billion by 2025, representing a five-year cagr of 43.6% compared to $2.3 billion in 2019.

In this series, we’ll look at why WebRTC is popular with developers and companies. How will WebRTC develop in the future? How is Agora based on WebRTC secondary development, and how will it support WebRTC NV version?

WebRTC can be thought of as a browser-native real-time communication tool that runs without installing any plug-ins or downloading any additional programs. Different clients can see each other in real time by jumping to the same URL using the same or different browser. But this is just a “God’s perspective,” and the technical framework and implementation details involved are far from simple.

The basic concept

Before we get into how WebRTC works, let’s clarify a few key technical concepts.

P2P

One of the most striking features of WebRTC is that it can achieve real-time point-to-point audio and video (multimedia) communication. In order to communicate through a Web browser, everyone’s Web browser needs to agree to “go on”, know each other’s network location, and bypass network security and firewall protection to transmit all multimedia traffic in real time.

In browser-based peer-to-peer communication, one of the biggest challenges is how to locate and establish a network connection with another computer’s Web browser for efficient data transfer.

When you want to visit a website, you usually enter the url directly or click the link to view the page. In this process, you are actually making a request to a server that responds by providing web pages (HTML, CSS, and JavaScript). The key to making this access request is that you make an HTTP request to a known and easily located server (via DNS) and get a response (i.e. a web page).

At first glance, this may not seem like a difficult question, but here’s an example: Now let’s say I want to have a video conversation with a colleague. So how can we make a request and actually receive the audio and video data directly from the other party?

The problem in the above scenario can be solved by P2P technology. WebRTC itself is based on peer-to-peer connection. RTCPeerConnection is the API responsible for establishing P2P connection and transferring multimedia data.

Firewall and NAT penetration

In daily life, most of us access the Internet through our work or home networks, where our devices are usually behind firewalls and network Access Translation devices (NAT) and therefore are not assigned static public IP addresses. Taking this one step further, the NAT device translates private IP addresses inside the firewall into public-facing IP addresses to ensure security and IPv4 restrictions on available public IP addresses.

Let’s take a look at the previous example, given the NAT device involved, how do I know my colleague’s IP address to send audio and video data to this address, and how does he know my IP address to send audio and video data back? This is where STUN (Session Traversal Utilities for NAT) and TURN (Traversal Using Relays around NAT) servers come in.

For the WebRTC technology to work properly, a public-facing IP address is first requested from the STUN server. If the request is answered and we receive a public-facing IP address and port, we can tell others how to connect directly to us. Someone else can do the same thing using STUN or TURN servers.

Signaling & Session

Because of NAT, the WebRTC cannot be directly connected to the peer end. Therefore, devices need to use signaling services to discover and negotiate with each other for real-time audio and video exchange. The network information discovery process described above is one of the larger signaling topics, which in WebRTC’s case is based on the JavaScript Session Establishment Protocol (JSEP) standard. Signaling involves network discovery and NAT penetration, session creation and management, communication security and coordination, and error handling.

WebRTC does not specify what implementation signaling must be used, in order to allow developers to use more flexible technologies and protocols.

At present the industry uses WebSocket + JSON/SDP scheme more. WebSocket is used to provide the signaling transmission channel, and JSON/SDP is used to encapsulate the signaling content:

Based on TCP, WebSocket provides the ability of persistent connection and solves the inefficiencies of HTTP, such as supporting only half-duplex and Header information redundancy. Websockets allow servers and clients to push messages at any time, regardless of the previous request. One significant advantage of using WebSockets is that almost every browser supports WebSockets.

JSON is a common serialization format used in the Web domain to encapsulate some user-defined signaling content. (It’s essentially a serialization tool, so a solution like Protobuf/Thrift is perfectly feasible).

Session Description Protocol (SDP) is a Session Description Protocol that encapsulates the signaling content negotiated by streaming media capabilities. The two WebRTC agents share all the states required for establishing a connection through this Protocol.

If the conceptual content is not easy to understand, think of it as an everyday communication process:

When we are ready to communicate with a stranger or a stranger wants to join your chat, when you or the other person sends this message, no matter whether you accept or reject, we need to exchange this message with the other person. Only after you communicate will you be able to learn more about whether you can have a good chat together. What helps you quickly summarize this information is the Session Description Protocol (SDP), which includes information such as what proxy it uses, what hardware it supports, and what type of media it wants to exchange.

When two people want to start a conversation, one of them has to start 👇👇👇

I am 17 years old. I am in high school. I like playing basketball.

Peer: I speak Chinese, I’m 23 years old, I work, I like playing basketball, my English is not so good, I can’t help you but we can play ball together (Answer SDP).

The purpose of this process of exchanging information and getting to know each other is to see if we can make the next step, or if we can’t make the next step at all. It doesn’t matter who sent the message first, the important thing is that whoever sent the message even out of politeness needs to respond in order for the conversation to be effective.

The agreement

A protocol is a standard/convention, and a protocol stack is the implementation of the protocol, which can be understood as code, libraries of functions, and calls by upper-layer applications. The protocol stack in WebRTC has written the underlying code, conforms to the protocol standard, and provides the developer with a function module to call. Developers only need to care about the application logic, where the data goes from there, how it is stored, and the order of communication between devices in the system.

WebRTC utilizes a variety of standards and protocols, including data flow, STUN/TURN server, signaling, JSEP, ICE, SIP, SDP, etc.

WebRTC protocol stack

signaling

  • Application layer: WebSocket/HTTP
  • Transport layer: TCP

Media stream

  • Application layer: RTP/RTCP/SRTP
  • Transport layer: SCTP/QUIC/UDP

security

  • DTLS: The key used to negotiate media streams
  • TLS: key used for signaling negotiation

ICE (Interactive connection establishment)

  • STUN
  • TURN

Among them, ICE (Interactive Connectivity Establishment), STUN and TURN are necessary to establish and maintain end-to-end connection. DTLS is used to protect data transmission at the peer end. SCTP and SRTP are used to multiplex, provide congestion and flow control, and partly provide reliable delivery and other additional services on top of UDP.

Basic architecture

Through the above introduction, I believe you have understood some key concepts in WebRTC. So let’s take a look at WebRTC’s most critical infrastructure, which is also important to understand how WebRTC works.

Basic component Architecture

WebRTC’s component architecture is divided into two layers: the application layer and the core layer. The green part of the image shows the core functions provided by WebRTC, while the dark purple part shows the JS API provided by the browser (that is, the browser has wrapped the core C++ API into a JS interface).

The pale purple entry arrow at the top of the image is the upper-layer application, which accesses the API provided by the browser directly from within the browser and ultimately calls to the core layer.

As for the core function layer, there are four main parts:

  • C + + API layer

There are a few apis, mainly PeerConnection. The API for PeerConnection includes transmission quality, transmission quality reports, various statistics, various streams, etc. (Design tips: for the upper layer, the API is simple, convenient for application layer development; The interior is more complex.)

  • Session Layer (context management)

If the application creates audio, video, and non-audio video data transmission, it can be processed in the Session layer and do management related logic.

  • Engine/Transport Layer (most important, core)

    This section is divided into three different modules: Voice Engine, Video Engine, and Transport, which can be used for audio and Video transmission decoupling.

    The Voice Engine contains a series of audio features such as audio capture, audio codec, and audio optimization (including noise reduction, echo cancellation, etc.).

    • ISAC/ILBC codec;

    • NetEQ (Buffer) ADAPTS the network and prevents network jitter.

    • Echo canceler: audio and video focus, determine the quality of the product, WebRTC provides a very mature algorithm, development only need to adjust the parameters can; Noise Reduction and automatic gain.

    Video Engine includes Video capture, Video codec, dynamically modify Video transmission quality according to network jitter, image processing, etc.

    • VP8, openH264 codec;

    • Video jitter buffer: Prevents Video jitter.

    • Image enhancements: What do you call Image enhancements?

    In WebRTC, all audio and video are received and sent. The Transport layer includes leak detection, network link quality detection, network bandwidth estimation according to the situation, and non-audio and video transmission such as audio and video and files are carried out according to the network bandwidth.

    • UDP for the bottom layer, SRTP for the upper layer (that is, secure and encrypted RTP);

    • Multiplexing: Multiplexing of multiple streams on the same channel;

    • P2P layer (including STUN+TURN+ICE).

  • The hardware layer

    • Video collection and rendering;

    • Audio acquisition;

    • Network IO.

There is no video rendering in the core layer of WebRTC, all rendering needs to be done by the browser layer itself.

The working principle of

WebRTC actually involves a lot of complex technical issues, such as audio capture, video capture, codec processor and so on. Since we hope to present a simple and easy to understand WebRTC workflow in this chapter, we will not discuss more details about the implementation of WebRTC technology in this chapter. If you are interested in it, please click on #WebRTC# column to check it out.

As we mentioned in part 1 of Why WebRTC | “WebRTC is a SET of W3C Javascript apis for developers to support real-time audio and video conversations in web browsers”, These JavaScript apis actually generate and transmit multimedia data for real-time communication.

The main apis for WebRTC include Navigator. GetUserMedia, which turns on recording and cameras, RTCPeerConnection, which creates and negotiates peer connections, and RTCDataChannel, which represents a two-way data channel between peers.

For the WebRTC workflow, it might be more intuitive to look at the “How to implement a 1-to-1 call” scenario:

  1. Both sides first call getUserMedia to open the local camera;
  2. Send a request to join the room to the signaling server;
  3. Peer B receives the Offer SDP object sent by Peer A, saves the Answer SDP object through the SetLocalDescription method of PeerConnection, and sends it to Peer A through the signaling server.
  4. In the offer/answer process of SDP information, Peer A and Peer B have created the corresponding audio Channel and video Channel according to THE SDP information, and started the collection of Candidate data. Candidate data (local IP address, public IP address, and address assigned by the Relay server).
  5. After collecting Candidate information, Peer A sends it to Peer B through the signaling server. In the same process, Peer B sends the same message to Peer A.

In this way, Peer A and PeerB exchange media and network information with each other. If they can reach A consensus (find the intersection), they can start to communicate.

To help you better understand WebRTC technology, our latest edition of “Agora Talk” features engineers from the Agora WebRTC team.

They will share more useful and interesting technical details on the topics of “RTC Hybrid Development Framework Practices based on Web Engine Extension Technology” and “Next Generation WebRTC: A Vision for Real-time communication.”

In the next section, we’ll cover some of the current challenges of WebRTC development, common development tools, and some of the improvements we’ve made in the Agora Web SDK.

Stay tuned ~