Introduction | av era, WebRTC under a variety of products and business scenarios have fall to the ground. In the familiar with how to get device audio and video data in the browser and WebRTC how to get audio and video data for network transmission at the same time, we should consolidate the basic knowledge of network transmission protocol, which can help us learn WebRTC more deeply. Recommended for use with articles from front-end audio and video features.

1. Transport layer protocol: TCP vs. UDP

We all know that HTTP, which runs on top of TCP, is the basis of what makes the World Wide Web work. As a front-end developer, we seem to be familiar with HTTP and TCP, so that HTTP status codes, message structures, TCP three-way handshake, four-way wave, and so on have become standard basic interview questions. But other agreements seem more or less alien to us.

The following figure shows the 4-layer structure of a TCP/IP communication protocol. In the transport layer based on the Internet layer, it provides data transmission services between nodes. Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) are the most well-known protocols.

The two protocols themselves involve a lot of content, but in the actual use of choice, we might as well directly through the comparison of TCP and UDP, to learn and understand them.

1.1. Comparison between TCP and UDP

There are three differences as a whole:

  • TCP is connection-oriented and UDP is connectionless.

  • TCP is byte stream oriented, in fact TCP treats data as a series of unstructured byte streams; UDP is packet oriented.

  • TCP provides reliable transport, meaning that data transmitted over a TCP connection is not lost, does not duplicate, and arrives in sequence, while UDP provides unreliable transport.

1.1.1. UDP connectionless, TCP connection-oriented

UDP does not need to establish a connection before transmitting data. Both parties can send data at any time. Therefore, UDP is connectionless. TCP establishes a connection by shaking hands three times before data transmission, and releases the connection by waving hands four times after data transmission. Details will not be described here.

1.1.2. UDP packet oriented, TCP byte stream oriented

For UDP, the sending and receiving application layer only sends or receives packets to the UDP transport layer. UDP only adds or removes the UDP header from application-layer packets and retains application-layer packets. Therefore, UDP is packet oriented.

For TCP, TCP only stores the data handed over by the application layer as a series of unstructured byte streams in the cache and constructs TCP packets according to the policy to send them, while the receiving TCP only collects the data payload and stores it in the cache and delivers the cache byte streams to the application layer. TCP does not guarantee the size relationship between sent and received data at the application layers of both parties. Therefore, it is byte stream oriented, and it is also the basis of TCP to achieve flow control and congestion control.

1.1.3. UDP is an unreliable connection and TCP is a reliable connection

UDP generates packet loss during data transmission, and the sender does not process the packet. The receiver also does nothing if it finds an error in the checkhead. Thus UDP provides a connectionless unreliable service up.

During TCP data transmission, if packet loss occurs or the receiver checks error codes (in this case, the receiver discards the packet), the receiver does not send an acknowledgement packet. In this case, the receiver resends the packet due to timeout. Thus, TCP uses its policies to ensure that no matter what happens during its transmission, the receiver will receive the packet correctly. Thus, TCP is said to provide reliable connection-oriented services upward.

1.2. Why UDP

Given TCP’s many advantages and features, why use UDP for real-time audio and video transmission?

The reason is that real-time audio and video are particularly sensitive to latency, and TCP-based systems are not low enough. Consider that in the case of packet loss, the RTT in TCP’s timeout retransmission mechanism increases exponentially by 2. If the 7 retransmissions still fail, the theoretical calculation will reach 2 minutes! In the case of high latency, it is obviously impossible to achieve normal real-time communication, and the reliability of TCP becomes a liability.

However, the actual situation is that usually real-time audio and video data in the transmission of a small number of packets lost, the receiver is not a big impact. UDP does not belong to the connection protocol, we think it is basically sent regardless of receiving, so it has the advantages of small resource consumption, fast processing speed.

Therefore, UDP is very high in real-time and efficiency. UDP is usually used as the transport layer protocol in real-time audio and video transmission.

WebRTC also uses reliable TCP for signaling control, but UDP as the transport layer protocol for audio and video data transmission (as shown above right).

2. Application layer protocols: RTP and RTCP

Will UDP alone be enough for real-time audio and video communications? The answer is clearly not enough! Udp-based application layer protocol is also needed to do more security processing for audio and video communication.

2.1. RTP protocol

In audio and video, the data amount of a video frame needs multiple packets to be transmitted, and corresponding frames are formed at the receiving end to correctly restore the video signal. So do at least two things:

  1. Detect error order and maintain synchronization between sampling and playback.

  2. A packet loss needs to be detected at the receiving end.

UDP does not have this capability. Therefore, RTP is used as the application layer protocol in real-time audio and video transmission rather than UDP.

RTP real-time TransportProtocol(RTP) is used for real-time data transfer. So what capabilities does the RTP protocol provide?

Including the following four points:

  1. End-to-end transmission of real-time data.

  2. Ordinal number (for packet loss and reorder check)

  3. Timestamps (time synchronization and distribution monitoring)

  4. Definition type of payload (for data encoding format)

But not:

  1. Send in time

  2. Quality assurance

  3. Deliver (possibly lose)

  4. Timing (arrival order)

Let’s take a brief look at the RTP protocol specification ****[1]

An RTP packet consists of two parts: the header and the payload.

The RTP header is explained below. The first 12 bytes are fixed and the CSRC can have multiple or zero bytes.

  • V: indicates the RTP version number. The current version number is 2.

  • P: Fill flag, occupying 1 bit. If P=1, fill the end of the packet with one or more additional 8-bit groups that are not part of the payload.

  • X: Extension flag, 1 bit, followed by an extension header if X=1.

  • CC: CSRC counter, accounting for 4 digits, indicating the number of CSRC identifiers.

  • M: flag, accounting for 1 bit, different payloads have different meanings, for video, mark the end of a frame; For audio, mark the beginning of the session.

  • PT (payload type) : a 7-bit payload type that records the payload type /Codec of RTP packets. In streaming media, PT is used to distinguish audio streams from video streams for the receiver to find corresponding decoders to decode them.

  • Sequence number: contains 16 bits, which identifies the SEQUENCE number of the RTP packet sent by the sender. The sequence number increases by 1 for each RTP packet sent. This field can be used to check for packet loss when the underlying bearer protocol is UDP and the network is in bad condition. When network jitter occurs, it can be used to reorder data. The initial sequence number is random, and the sequence of audio and video packets is counted separately.

  • Timestamp: occupies 32 bits and must use 90kHZ clock frequency (90000 in the program). The timestamp indicates the sampling time of the first octet of the RTP packet. The receiver uses timestamps to calculate latency and delay jitter and to synchronize control. The timing of packets can be obtained from the RTP packet timestamp.

  • Synchronous source (SSRC) identifier: a 32-bit identifier used to identify the synchronous source. A synchronous source is a source that generates media streams. It is identified by a 32-digit SSRC identifier in the RTP header, independent of the network address. The receiver distinguishes different sources according to the SSRC identifier and groups RTP packets.

  • Contributor source (CSRC) identifiers: Each CSRC identifier is 32 bits and can have 0 to 15 CSRC. Each CSRC identifies all the supplied sources contained in the PAYLOAD of the RTP message.

2.2. RTCP protocol

As mentioned earlier, the RTP protocol is generally fairly crude and does not provide on-time delivery or other quality of service (QoS) guarantees. Therefore, RTP requires a matching protocol, namely RTCP (real-time econtrolProtocol), to ensure its service quality.

The RTP standard defines two sub-protocols, RTP and RTCP.

For example, in the transmission of audio and video packet loss, disorder, jitter, these WebRTC have corresponding processing strategies at the bottom. However, how to transmit the “network quality information” to the other party in real time, is RTCP its role. Compared to RTP, RTCP takes up very little bandwidth, usually only 5%.

First, there are many types of RTCP packets:

  1. Sender Report (SR): indicates the statistics of the current active Sender sending and receiving. PT=200

  2. Receiver Report RR (Reciver Report) : Receives reports and inactive senders receive statistics. PT=201

  3. Source Description report (SDES) : Source Description items, including CNAME PT=202

  4. BYB Report: Participants end a conversation PT=203

  5. APP report: APP custom PT=204

  6. Jitter reported IJ PT=195

  7. Transmission feedback RTPFB PT=205

  8. Playload feedback PSFB PT=206….. In this case, we can pay attention to two important packets: SR and RR, and use them to inform the receiving and receiving ends of the network quality.

The above is the protocol specification of SR:

  • The Header part identifies the type of the packet, for example, SR (200) or RR (201).

  • The Sender Info section is used to specify how many packets have been sent as the Sender.

  • The Report Block section specifies how the sender receives packets from the various SSRC when it is the receiver.

By reporting the above information, each end knows the feedback data of network transmission, and can adjust the transmission strategy according to it. Of course, the content of the agreement itself is not only the simple paragraph above, but also involves the calculation method of various feedback data. There is no space here to elaborate on it.

2.3. RTP Session Flow Summary

Once you understand why YOU chose UDP and what RTP/RTCP does, let’s briefly summarize the process at the transport protocol level:

When an application establishes an RTP session, the application determines a pair of destination transport addresses. The destination transport address consists of a network address and a pair of ports: one for RTP packets and one for RTCP packets. The RTP data is sent to the even-numbered UDP ports, and the corresponding control signal RTCP data is sent to the odd-numbered UDP ports (the even-numbered UDP ports +1), thus forming a UDP port pair. The general process is as follows:

  1. RTP receives the stream media information from the upper layer and encapsulates it into RTP packets.

  2. RTCP receives control information from the upper layer and encapsulates it into an RTCP control packet.

  3. RTP sends RTP packets to even-numbered ports in a UDP port pair. RTCP Sends the RTCP control packet to the receiving port of the UDP port pair.

2.4. Quickly capture RTP and RTCP packets using Wireshark

The paper come zhongjue shallow, and must know this to practice. Let’s restore the true colors of RTP packets and RTCP packets by actually playing WebRTC streaming media and capturing packets.

The Wireshark is a powerful software that analyzes network packets and displays the process of exchanging network packets. The Wireshark is used to monitor network requests and locate network problems. Wireshark: Wireshark: Wireshark: Wireshark: Wireshark: Wireshark: Wireshark

  1. Download and install Wireshark (very simple).

  2. Open Tencent classroom in the browser, select a free and live course, generally using WebRTC broadcast. (Another TAB to open the WebRTC debugging tool here will show the page WebRTC play real-time streaming media data network situation.)

  3. To open the Wireshark, select the network package to capture the network adapter on the Wireshark. When there are multiple network cards on your machine, you need to select a network interface. Here, I selected my Wifi and clicked the blue button in the upper left corner to start capturing packets.

  4. Once you start capturing packets, a large number of captured packets of various protocols are displayed. The following figure shows the functions of each area in the Wireshark. After you enter UDP in the filter bar, only UDP packets are displayed in the packet list and some packets are parsed. You can see their protocols clearly in the Protocol column.

  5. The WireShark cannot directly identify RTP UDP packets. Therefore, right-click the UDP packets to decode as RTP packets.

  6. In this case, the Wireshark can identify RTP packets and view the hierarchical protocols in descending order: IP => UDP => RTP Click on the RTP layer and expand it to see that the RTP packets are the same as the RTP packets in the preceding layer: The version number and padding mark are displayed. And so on.

Playload Type (PT value) : represents the type of the load, where 122 can be confirmed as H264 video load type data according to THE SDP of WebRTC.

Time stamp: it is recorded that the sampling time is 6120, and it should be converted according to the sampling rate.

SSRC: The identifier of the synchronization source (SSRC) is 0x0202C729. These are all RTP headers, and finally the playload is the media data.

RTP not only supports UDP, but also facilitates the transmission of low-latency audio and video data. It allows the receiver and sender to negotiate the encapsulation and codec format of audio and video data through other protocols. The playload Type field is flexible and supports a wide range of audio and video data types. RTP payload formats

Let’s look at audio and video frames for RTP packets in more detail:

Where the following multiple data packets from SEq =21 to SEq =24, each is an audio frame, so the timestamp is different. The red box is composed of multiple packets from SEq =96 to SEq =102, which constitute a video frame of PT=122, so the time stamps of these packets are also the same. This is because a video frame contains a large amount of data and needs to be sent in multiple packets. While audio frames are smaller, they are sent as a separate packet. From their packet length size, it can be seen that the video packet is much larger than the audio packet. In addition, if seQ =102, mark field is true, it means the last packet of a video frame. By combining SEQ, we can know whether the receiving of audio and video data is out of order or packet loss.

  1. Open the information panel through the upper ‘Toolbar’ => ‘Phone’ => ‘RTP’, you can see that there is an audio RTP stream and a video RTP stream. The source address port and destination address and port of the flow are analyzed on the left. The right side shows RTP-related content, showing THE SSRC of RTP streams, payload types, packet loss, and so on.
! [](https://p9-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/8c1384e475a74a6391e2d15c4a812943~tplv-k3u1fbpfcp-watermark.image)Copy the code
  1. Finally, let’s look at RTCP. Enter RTCP in the filter bar and you can see Sender Report and Receive Report. Their PT (PackCET Type) is 200 and 201, respectively, and the SSRC reported is 0x02029DFC, along with the details of the packet sent and received. Detailed content analysis can be combined with the RTCP specification protocol to further learn.
! [](https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/0c5e8a76818e48c19ebe9c4b04816c53~tplv-k3u1fbpfcp-watermark.image)Copy the code

3. Summary

Many people think that when a developer learns to use WebRTC, it is enough that he can quickly get started with practice and business landing. Is it necessary to understand these transport protocols? But often even if you know how to use it, it doesn’t mean you can use it to your best advantage. I think the limits of what you can achieve with a technology often depend on how deep you understand it at the bottom.

Here is a brief introduction of why real-time audio and video choose UDP as the transport layer protocol, as well as a brief introduction of the WebRTC protocol involved in the more important two protocols RTP/RTCP, like WebRTC technology involves and fusion of many kinds of technology (audio and video processing, transmission, security encryption and so on) each module involved in the protocol can be written separately, The space is limited and the content THAT I grasp is relatively limited, this article cannot carry on the development of more content. If you want to learn to practice WebRTC, this article will only give you a glimpse at its transport protocol level. Because the protocol often involves the bottom layer, usually the use of attention is often not, so it also introduces how to quickly start to grasp the package to help understand, if you want to further study also need to find another information in-depth study.

4. Reference materials

[1] RTP protocol: tools.ietf.org/html/rfc35f…