An overview of the

With the wide application of WebRTC technology, more and more people enjoy the convenience of real-time audio and video communication. At the same time, Internet users are more concerned about privacy than ever before. Therefore, an end-to-end (user end to user end) encryption solution is urgently needed to protect users’ private communications from being monitored by a third party.

WebRTC requires that the media stream between two points of connection be encrypted, using dTLS-SRTP. It can effectively guarantee the information security of two communication points. At present, mainstream RTC service providers mostly use SFU architecture. The two user nodes that need to establish communication do not directly connect with WebRTC, but each establish WebRTC link with the MEDIA server of RTC, and then transfer media data through the media server.

The problem caused by similar architectures is that data security needs to be guaranteed by RTC service providers. The communication between users and RTC suppliers is relatively secure. Once the supplier has information monitoring behavior or the media server has security vulnerabilities, the media data of the two user nodes for communication will become insecure.

To solve this problem, this paper proposes an end-to-end encryption scheme, which can guarantee the data security of two user nodes even when RTC provider is used for communication. The scheme can be applied to both Web and Native terminals.

Common RTC architectures

(figure 1)

Mesh: Each end is directly connected to the desired node, and communication security is guaranteed by the two endpoints of the link. Typical user scenarios include P2P live broadcasting.

MCU: Each endpoint is linked to the server separately, and the server will mix the media streams of the endpoint, merge multiple media streams into one stream and then distribute it to each endpoint. Traditional video conferencing equipment suppliers are mostly based on MCU architecture.

SFU: Each endpoint connects to the server. The server forwards media streams to the endpoint that needs to establish a link. Most Internet video conferencing service providers are based on SFU architecture.

The architecture of MCU is similar to that of SFU. The core difference is that MCU needs to mix media streams from sender users and then send them to receiver users, while SFU simply forwards media streams, which greatly saves computing resources on the service side. Compared with the MCU/SFU architecture, the connection between users and the Mesh architecture is directly established without passing through the server. The advantage is that there is almost no need to occupy the computing resources of the server. However, in some scenarios, the communication link of the client may be very poor or even the link cannot be established. In MCU/SFU, RTC suppliers usually optimize the edge node, which greatly guarantees the transmission reliability of the link.

At present, the architecture of RTC service providers is mostly SFU, and the scheme of this paper is also designed for SFU architecture. Because the Mesh architecture is directly connected to the client, the DTLS-SRTP required by WebRTC meets the security requirements. MCU architectures require mixed streaming, and links that require encryption do not want third parties to know about the streaming content, so end-to-end encryption scenarios naturally do not support MCU architectures.

Implementation approach

(figure 2)

Figure 2 shows the core process of WebRTC from collection to transmission. As you can see, theoretically there are roughly three points for the optional encryption and decryption process besides the raw frame. Mainly because the encoding is dependent on the media data, it cannot be encrypted before encoding.

The best encryption and decryption point is encoded Frame & decoded Frame, because the data size may change. In extreme cases, when a single packet exceeds the MTU, additional packet unpacking is required, which increases the complexity of the processing process.

For Native end, it is relatively easy to insert a custom processing link, and the implementation of internal WebRTC can be completely independent. But on the Web side, WebRTC processing flow is limited by the browser implementation. Therefore, when determining the specific implementation, it is necessary to understand the internal processing logic of the browser and fully compatible with the browser to ensure that end-to-end encryption is a fully applicable operation.

Frame processing flow of Chrome

The W3C introduced the WebRTC Insertable Streams interface for end-to-end encryption and other application scenarios. This interface allows Web side to process encoded frame data. The sample code for the Web side is as follows:

// Send an example
let pc = new RTCPeerConnection({ encodedInsertableStreams: true}); .const sender = pc.addTrack(videoTrack, mediaStream);
let senderTransform = (chunk, controller) = > {
    chunk.data = cryptData(chunk.data); // Encrypt the encoded frame
    controller.enqueue(chunk);
}
let { readableStream, writableStream } =  sender.createEncodedStreams();
readableStream
    .pipeThrough(new TransformStream({transform: senderTransform))
    .pipeTo(writableStream)
Copy the code

When WebRTC Insertable Streams is enabled, an additional transform frame is added to the browser’s frame processing flow, as shown in Figure 3.

(figure 3)

There is a problem

Modify the encoding frame inside the Transform, and you will find that the receiver will sometimes have normal audio but abnormal video display, splintered screen, or stop decoding. In the case of H264 encoding, this is because the binary stream of the encoding frame contains the description of the frame header. This information is used before and after unpacking. Once the encryption is modified, the browser will process the information abnormally. Therefore, it is necessary to avoid the description information in the frame header when processing the encoded frame.

Chrome H264 unpack, package logic

In Chrome, the raw frame captured by the sender is encoded and output a continuous Network Abstraction Layer Unit (NALU) stream. When H264 is unpacked, the coding stream is separated according to the NALU start code (0x000001 or 0x00000001), the NALU boundary is confirmed, and the NALU is successively sent to the RTP unpacking module. Make it conform to one of the two RTP Packet formats: STAP (single-time Aggregation Packet) or FU-A (Fragmentation unit A). Generally PPS, SPS, SEI will be packaged into A STAP package, and large video frames will be cut into several FU-A packages. Therefore, after the frame data is encrypted, ensure that no start code exists in the encrypted data. Otherwise, the NALU of the encoding frame will be incorrectly separated, so that the receiver cannot decode the correct video frame.

When Chrome receives all RTP packets of the same video frame, it first combines the Payload of the RTP packet into a video frame, then reads the PPS_ID of the video frame, and sends the video frame into the transform function. This means that PPS_ID cannot be changed when encoding frames are edited. According to the H264 standard, the PPS_ID of the video frame is usually located in the third Columbus field of the video frame, as shown in Figure 4. Therefore, PPS_ID needs to be skipped and only the data following it processed during frame editing.

(figure 4)

H264 frame editing details

Limited to the unpacking implementation of Chrome, two problems need to be considered when editing the encoding frame. First, NALU start code (0x000001 or 0x00000001) cannot be generated. Second, the data before the video frame PPSID cannot be modified. Meanwhile, as non-video frames like PPS, SPS and SEI usually do not involve sensitive information, in order to simplify the processing logic, we encrypt video frames and design the following frame editing process.

(figure 5)

AES encryption

Having identified the modifiable portion of the media frame, you need to select a reliable encryption scheme.

(figure 6)

Aes-256-cbc symmetric encryption algorithm is selected to realize the communication between the Encrypted data on the Web and Native end. The algorithm has good compatibility with different clients. Symmetric encryption means that encryption and decryption need to use the same key. During AES encryption, the file is divided into plaintext blocks. Each block contains 16 bytes and 128 bits. If the last block is less than 128 bits, the file is added up to 128 bits based on the padding policy. The former ciphertext block will participate in the encryption calculation of the latter plaintext block. To improve encryption reliability, a randomly generated Initialization Vector is usually used to calculate the first ciphertext block. Decryption is the reverse process of encryption, which can be completed by executing the encryption steps in reverse order.

(figure 7)

In RTC communication, the same room can usually share the same key. Then create a random IV when encrypting each frame of data, and encrypt the frame data based on the IV and the key. During transmission, IV in the header takes up 16 bytes, followed by AES encrypted data. When decrypting, the sender uses the same key, 16-byte IV header and ciphertext as input to obtain the encoded frame data.

Since IV needs to occupy an additional 16 bytes, video frames are usually numbered from K to tens of K with relatively small impact, while audio frames are sent from tens of bytes to hundreds of bytes at a time, and IV insertion will appear to have a large impact. A possible optimization is to use the timestamp (timestamp) of the audio frame in the chunk object as a random variable. However, TIMESTAMP occupies only 4 bytes, less than 16 bytes, so the rest can be filled with arrays according to custom rules or simply tiled timestamp four times as IV.

validation

(figure 8)

The scheme was implemented on Web, iOS and Android terminals. After several hours of testing and weak network testing, no abnormality was found in audio and video transceivers. When an end that does not have the decryption capability plays an encrypted stream, a green screen is displayed, and RTC statistics indicate that decryption fails.