#WebRTC series

  1. WebRTC source research (1) WebRTC architecture
  2. WebRTC source code research (2) WebRTC source directory structure
  3. WebRTC source research (3) WebRTC operation mechanism
  4. WebRTC source research (4) Web server working principle and common protocol basis
  5. WebRTC source code research (5) Nodejs build environment

WebRTC source research (3) WebRTC operation mechanism

1. The rail with the flow

  • Track
  • MediaStream

What is the Track?

Track is also called a Track. For example, one way audio is a track, one way video is also a track, the track here is to take the concept of track, two tracks are never intersecting, audio and video is not intersecting, stored separately. Two audio tracks are actually two tracks, also do not intersect.

What is MediaStream?

MediaStream also known as the media stream, borrows the concept of the traditional media stream before, in the traditional media stream also includes audio tracks, video track and subtitle track, so there is a hierarchy of concepts, in the media stream contains a lot of rail, thus formed the concept of a hierarchy, we looked down again are better understood.

Let’s look at a few more important classes. The first one is the same name as the one we just saw, MediaStream, which is a separate class in WebRTC, and we won’t talk about it much here.

The second is RTCPeerConnection, RTCPeerConnection is the most important class in the whole WebRTC, because this class is a large and complete class, it contains a lot of functions, what are the benefits of this design? For the application layer, in particular, in the application layer I just create a PeerConnection, that is, create a connection, we set the MediaStream (stream) in the connection, then all its underlying transmission and pathfinding are performed internally by the PeerConnection itself. This is actually very good news for the application layer. But it does a lot of work at the bottom, because we know that WebRTC mainly uses P2P transmission, which includes the type and detection of your P2P, whether P2P can get through, whether it can successfully penetrate, if not successfully, we still need to transfer through the turn server, This series of operations are completed under PeerConnection.

So we look at the internal code of WebRTC, in fact, very complex, for the application layer developers, using WebRTC to develop, in fact, will be a lot more convenient.

As long as we know this important class and an important method of our own, we can complete an application of our own, so PeerConnection is a class we should focus on mastering, for the application layer we need to know what functions it has. What can we do to increase the functionality of our application?

When we learn the underlying code of WebRTC in the future, we should also focus on the PeerConnection, because it is a general point, through which we can go further step by step when we analyze the logic of each piece. When we encounter difficulties, we can come back and go further again.

The third is RTCDataChannel, DataChannel is our of audio and video data, through DataChannel, transmitted, actually DataChannel is obtained through PeerConnection, so there is a relationship between them, So our text, our files, our binary data, can all be transferred through the DataChannel, so when we get the DataChannel object, we shove the data into it, and the upper application is done. In fact, at the bottom, it does a lot of logic.

So through these three classes, we know that PeerConnection is the core, MediaStream contains many tracks, add these tracks to some Stream, and then add it to PeerConnection, then the bottom layer is not concerned, it is automatically transmitted to the corresponding end. It is also the same for common data. I take binary data, first I obtain the DataChannel through PeerConnection, and then I insert binary data into the DataChannel, so that our non-media data and non-audio and video data can be normally transmitted to the peer end.

Next, we need to familiarize ourselves with several very important classes in WebRTC, which are described in detail here:

2. WebRTC important classes

2.1 MediaStream

  • MediaStream— MediaStream is used to represent a media data stream (retrieved via the getUserMedia interface) that allows you to access input devices such as microphones and Web cameras. This API allows you to retrieve media streams from either of them.

2.2 RTCPeerConnection

RTCPeerConnection the most important class of RTCPeerConnection is the core class

  • RTCPeerConnectionA:RTCPeerConnectionObject allows users to communicate directly between two browsers.
  • SDP: describes the content that the current connector wants to transmit, the supported protocol type, the supported codec type, etc.
  • RTCIceCandidate: indicates aICEThe protocol candidates, in short, are the IP and port of the destination node.
  • RTCIceServer: indicates aICE Server, which is used to discover the IP address of the current hostICE ServerTo communicate, we get a set of IP:Port candidates that can be used for the connection. The two parties establish the connection by exchanging ICE candidates.

2.2.1 RTCPeerConnection Call procedure

Let me look at the most core class, namely PeerConnection, and its call process is shown as follows:

The image above was taken from the WebRTC website.

  1. The first isStreamFlow, that isMedia StreamIt includes a lot of tracks, including video, audio… , of course, there can be only one audio or one video;
  2. And then there’sPeerConnection.ConnectionInternally there are two threads: one isWorkerThreads, the other one isSignalingThreads, that is, throughPeerConnectionFactoryInterfaceTo create two threads,PeerConnectionFactoryIt’s actually a connection factory, and it can create a lot ofPeerConnectionAnd it can not only be createdPeerConnection; You can also createMediaStream; You can also createLocalVideoTrackorLocalAudioTrack. After creating the first is to create a track, and then throughAddTrackAdd them toMediaStreamTo the media stream.
  3. There could be multiple media streams, all of which eventually go throughAddStreamAdded to thePeerConnectionThey reuse the same connectionConnectionOf course at the bottom it could be a different path, so what is multiple hereStream.
  4. We all know that there may be audio and video on my machine when we plug into oneStreamInside? So this is why there are so many waysStreamThat is, it is possible to communicate with multiple parties, each party is actually oneStreamIf you think about an audio and video conference where there are three parties that can communicate by video, that’s one for each partyStream.

2.2.2 RTCPeerConnection Call sequence diagram

After getting familiar with the PeerConnection relationship, let’s take a look at the call relationship of these methods. The following figure is the timing diagram of RTCPeerConnection call:

  • The first is the application layerCreatePeerConnectionFactoryAnd that creates itPeerConnectionFactoryThis factory, this factory triggered againCreatePeerConnection.
  • Then you create onePeerConnectionConnect, this factory will also be createdCreateLocalMediaStream,CreateLocalVideoTrack,CreateLocalAudioTrackThe rail,
  • And then passAddTrackAdd these tracks toStreamCall this after addingAddStreamAdd the stream toPeerConnectionAfter the stream is committed, changes to the stream will be committedCommitStreamChanges.
  • When the stream triggers a change, the event is triggered to create oneofferSDPDescription of.
  • So once you have this description, you send it through the application layer through signaling to the remote end, and you get this on this endoffer SDP.SDPSo the information that’s in there is what are the videos and what are the audio and what are the audio formats and what are the video formats? What’s your transport address? The information actually comes through.
  • And then based on that information the remote end will return oneanswerTo this signaling, signaling and our media stream of information are actually two ways, not throughTCPIt’s transmitted. It goes throughUDPTransmitted when this signaling receives thisanswerAnd then, it’s going to be passed to thisconnection, and then each connection gets the other party’s media stream information as well as its transport port and transport address. In this way, they can get through the channel, and they can transmit the media data to each other.
  • When the remote data comes in, thisConnectionI’m also going to add this stream from the far end to this oneAPPIn go,APPIt’s also one in itselfConnectionObserverSo this is really an observer, and it wants to know what’s going on with this connection.
  • With a clear sequence diagram like the one above, we can easily see the whole pictureapiAn invocation procedure of.
  • So that’s how webRTC works. The knowledge of SDP signaling will be explained in detail in subsequent articles.

2.3 RTCDataChannel

Non-audio and video data are transmitted through RTCDataChannel

  • DataChannel: Data channel (DataChannel) interface represents a two-way data channel between two nodes, which can be set up for reliable or unreliable transmission.

3. Some basic concepts of WebRTC

4. WebRTC Call principle

First question to consider:

What are the difficulties in implementing point-to-point real-time audio and video dialogues between two clients (browsers or apps) with different network environments (cameras/microphones)? Which parts of the problem need to be solved by us, and which parts should be solved by Google?

  1. Audio and video coding and decoding ability communication

  2. Network transmission data

  3. How to Spot each other

To solve the above problems, the following four steps are required:

  1. Media negotiation

  2. Network negotiation (each client has multiple mapped addresses)

  3. Media negotiation + network negotiation data exchange channel

  4. Signaling server development: SDP/Candidate interaction, room maintenance.

4.1 Media Negotiation

To implement P2P communication, we first need to know whether each other supports the same media capabilities. WebRTC uses V8 codecs by default, if the other party to connect does not support V8 decoding, if there is no media negotiation process. Even if the connection is successful and the video data is sent to the other party, the other party cannot play the video.

For example, peer-A can support multiple encoding formats of VP8 and H264, while peer-B can support VP9 and H264. To ensure the correct encoding and decoding of both ends, the simplest way is to take their intersection H264

Note: There is a special Protocol called Session Description Protocol (SDP) that describes this kind of information. In WebRTC, the two parties involved in video communication must first exchange SDP information so that both parties can know the truth. The process of exchanging SDP is also called “media negotiation”.

4.2 Network Negotiation

After understanding each other’s media capabilities, the next thing you need to confirm is each other’s network communication capabilities. Only by knowing each other’s network, can we find a communication link.

The ideal network situation is that each browser’s computer has a private public IP address and can be directly connected to point-to-point. In fact, the real network environment is very complex, especially in China, many enterprises have multi-level networks, and some networks even forbid sending UDP packets. In this way, under the symmetric NAT network, it is impossible to get through through holes, and TURN server must be used to forward packets.

As shown in the picture below, the reality is that our computers and computers have been large or small in a local area network, requiring NAT. (Network Address Translation, Network Address Translation)

What is NAT? Network Address Translation provides a common IP Address for your device. The router has a public IP address, and all devices connected to the router have private IP addresses. In response to the request, the device’s private IP corresponds to the router’s public IP and dedicated communication port. In this way, devices do not need to occupy a dedicated public IP address and can be clearly identified on the network.

Let’s take a look at what network address translation is, as shown below:

2 STUN

STUN (Session Traversal Utilities for NAT) is a network protocol that allows a client behind a NAT (or multiple NAT) to find its public address (IP +port). Find out which type of NAT you are behind and which Internet port the NAT binds to a local port.

  • In one sentence, what STUN does is: Tell me what your public IP address + port is.

  • The problem is that STUN does not always succeed in assigning IP addresses to the calling devices that require NAT. P2P transmits media streams using local bandwidth. In multi-party video calls, call quality depends on the user’s local bandwidth.

  • Even if a public IP address is obtained through the STUN server, the connection may not be established. Because different NAT types handle incoming UDP packets differently.

  • Three of the four main types are STUN permeable: full coned NAT, restricted coned NAT, and port restricted coned NAT. However, Symmetric NAT (also called bidirectional NAT) commonly used on large enterprise networks cannot be used. Such routers implement so-called Symmetric NAT restrictions through NAT. That is, the router will only accept connections made by nodes you’ve previously connected to. This type of network requires TURN technology.

  • So what do we do? TURN can solve this problem well.

Let’s first understand what is TURN service?

4.2.2 TURN

  • TURNThe whole is calledTraversal Using Relays around NAT, it isSTUN/RFC5389An extension of the Relay function. If the terminal is behind the NAT, the terminal may not be able to communicate directly with its peer in certain scenarios. In this case, the server on the public network is required to forward the incoming and outgoing data.

If STUN fails to allocate a public IP address, you can use the TURN server to request a public IP address as a trunk address. In this mode, the bandwidth is assumed by the server, so the local bandwidth is less stressful in multi-person video chats. The above are the two protocols frequently used in WebRTC, STUN and TURN server, which we use the Coturn open source project to build.

In WebRTC development, we often hear the term ICE (Interactive Connectivity Establishment). Unlike STUN and TURN, ICE is not a protocol, but a Framework. It combines STUN and TURN. The Coturn open source project integrates STUN(holing) and TURN(relaying) capabilities. Network information: put on candidate

P2P= STUN + TURN + ICE

4.3 Switching Channel for Media negotiation and Network negotiation Data

We know that two clients negotiate media information (SDP) and network information (candidate), then how to exchange? Do you need a middleman to do the exchange? At this point we need a Signal server (room server) to forward each other’s media and network information.

4.4 Signaling Server

  • The signaling server architecture is shown as follows:

With the help of signaling server, the above mentioned SDP media information and Candidate network information exchange can be realized.

In real development, signaling server not only exchanges media information SDP and network information candidate, for example, also handles: (1) room management and (2) people in and out of the room

5. WebRTC connection establishment process

After introducing the meaning of the individual parts of the ICE framework, let’s take a look at how the framework works:

  1. The connection parties (Peer) exchange information through a third-party server (Signalling) of the respectiveSessionDescriptionThe data.
  2. The Peer obtains its OWN NAT structure, subnet IP address, public IP address, and port from the STUN Server through STUN protocol. The IP and port pair here are calledICE Candidate.
  3. The connection parties (Peer) exchange information through a third-party server (Signalling) theirICE CandidatesIf the two parties are under the same NAT, they can establish the connection only through the Intranet Candidate, whereas if they are under the asymmetric NAT, the public network identified by the STUN Server is requiredCandidateCommunicate.
  4. If only STUN Server is used to discover the public networkCandidateIn other words, at least one of the two connected parties is under symmetric NAT. In this case, the client under symmetric NAT needs to seek the forwarding service provided by the TURN Server, and then forward the forwarding formCandidateAnd it is signalled to a Peer.
  5. The Peer sends a packet to the destination IP portSessionDescriptionThe key involved and what is expected to be transmitted establish an encrypted long connection.

For the specific time sequence diagram of connection establishment, refer to a diagram of Zero Sound Academy as follows:

Reference on zhihu great spirit: zhuanlan.zhihu.com/p/122340268