Original text: signal.org/blog/how-to…

preface

This article is not a complete translation of the original, but a key summary.

Signal is a cross-platform encrypted messaging service developed by the Signal Technology Foundation and Signal Messenger LLC. Signal sends one-to-one and group messages over the Internet. Messages can contain images and video. It can also be used for one-to-one and group voice calls over the Internet. The nonprofit Signal Foundation was launched in February 2018, with Brian Acton providing an initial $50 million in funding. Signal has racked up more than 150 million downloads since January 2021, and the software has about 40 million monthly active users. Signal has been installed on over 50 million Android devices as of 2021-2-14. Musk called on the public to use Signal in 2021 after WhatsApp changed its privacy policy.

Signal relies on a centralized server maintained by Signal Messenger. In addition to routing Signal’s messages, the server helps discover contacts that are also registered with Signal users and automatically exchange users’ public keys. By default, Signal’s voice and video calls are direct connections between the two parties. If the caller is not in the recipient’s address book, the route is called through the server to hide the user’s IP address.

Signal also makes extensive use of the Rust language internally, as well as open-source several service and component libraries. These include signal-calling-service, which helps Signal support end-to-end encrypted group calls up to 40 people. This article explains the principle of signal-calling-service.

Open source the principle of Calling signal-calling-service

Selective Forwarding Units (SFU)

In a group call, each party needs to send their audio and video to all other participants in the call. There are three possible generic architectures that can do this:

  • Full Mesh: Each participant sends its media data (audio and video) directly to other participants. This only works for small calls, but not for large numbers of participants. Most people’s Internet connections aren’t fast enough to send 40 copies of a video at once.
  • Server mixing: Each call participant sends its media data to the Server. The server “mixes” the media and sends it to each participant. This works for many participants, but is incompatible with end-to-end encryption because it requires the server to be able to view and change media data.
  • Selective Forwarding: Each participant sends his media to the server. The server “forwards” the media to other participants without viewing or changing it. This works for many participants and is compatible with end-to-end encryption.

Because Signal must have end-to-end encryption and extend to many participants, selective forwarding is used. The server that performs selective forwarding is often called an selective forwarding unit or SFU.

A simplified version of the Rust code for SFU’s main loop logic reads:

let socket = std::net::UdpSocket::bind(config.server_addr);  
let mutclients = ... ;// changes over time as clients join and leave
loop {
  let mut incoming_buffer = [0u8; 1500];
  let (incoming_size, sender_addr) = socket.recv_from(&mut incoming_buffer);
  let incoming_packet = &incoming_buffer[..incoming_size];

  for receiver in &clients {
     // Don't send to yourself
     ifsender_addr ! = receiver.addr {// Rewriting the packet is needed for reasons we'll describe later.
       letoutgoing_packet = rewrite_packet(incoming_packet, receiver); socket.send_to(&outgoing_packet, receiver.addr); }}}Copy the code

In fact, the Signal team looked at a number of open source IMPLEMENTATIONS of SFU, but only two were consistent with congestion control, and a lot of changes were needed to support a maximum of eight participants. Therefore, they decided to re-implement a new SFU with Rust. It has been online for Signal for nine months now and could easily expand to 40 participants (with more to come). It can be used as a reference implementation of SFU based on WebRTC protocol, and the code readability is very high. The congestion control algorithm of GOOGCC is used for reference.

The hardest part of SFU

The most difficult part of SFU is forwarding the correct video resolution to each calling participant while network conditions are constantly changing.

The difficulty is a combination of the following basic problems:

  • The capacity of each participant’s network connection is constantly changing and difficult to know. If the SFU sends too much, it will cause additional delay. If too little SFU is sent, the quality will be low. As a result, SFU must constantly and carefully adjust the amount it sends to each participant to get it “just right.”
  • The SFU cannot modify the media data it forwards. To adjust the amount it sends, it must choose from the media data sent to it. If the “options” available are limited to sending the highest resolution available or not sending at all, it is difficult to adapt to a variety of network conditions. As a result, each participant must send multiple resolutions of video to the SFU, and the SFU must constantly and carefully switch between them.

The solution is to combine several techniques that we will discuss separately:

  • Simulcast and Packet Rewriting allow for switching between different video resolutions.
  • Congestion Control determines the correct amount to send.
  • Rate Allocation determines what to send within that budget.

Simulcast and Packet Rewriting

In order for the SFU to switch between different resolutions, each participant must send many layers (layers, resolution) to the SFU at the same time. This is called Simulcast, commonly known as big and small stream. It is a concept in WebRtc.

  • Upstream flow is generally three channels, according to the resolution and bit rate, generally divided into FHQ (medium and large) three layers
  • Downstream streams can be assigned to different users, such as a small stream Q when the network is bad and then cut back to stream F when the network is good

Unlike a video streaming server, an SFU does not store anything, and its forwarding must be instantaneous, through a process called packet rewriting.

Packet rewriting is the process of changing the timestamp, serial number, and similar ids contained in a media packet that indicate its location on the media timeline. It converts packets from many independent media timelines (one per layer) into a unified media timeline (one layer).

Packet rewriting is compatible with end-to-end encryption, because after end-to-end encryption is applied to media data, the sending participant adds the ID and timestamp of the rewriting to the packet. This is similar to how TCP serial numbers and timestamps are added to packets after encryption when using TLS.

Congestion control

Congestion control is a mechanism for determining how much to send over the network: not too much, not too little. It has a long history and is mainly a form of TCP congestion control. Unfortunately, TCP’s congestion control algorithms generally don’t work for video calls, as they tend to result in increased latency that leads to a poor call experience (sometimes referred to as “lag”). To provide good congestion control for video calls, the WebRTC team created GOOGCC, a congestion control algorithm that determines the correct amount of traffic to be sent without causing a significant increase in latency.

Congestion control mechanisms typically rely on some kind of feedback mechanism sent from packet receiver to packet sender. Googcc is intended for use with transport-CC, where the receiver sends periodic messages back to the sender, for example, “I received packet X1 at time Z1; At time Z2 for packet X2…” . The sender then combines this information with its own timestamp, for example, “I sent packet X1 at Y1, and it was received at Z1; I sent packet X2 at time Y2 and received it at Z2…” .

Googcc and transport-cc are implemented in the form of stream processing in Signal Calling Service. The input to the stream pipeline is the above data about when a packet is sent and received, which we call acks. The output of a pipe is a variation of how much should be sent over the network, called the target send rate.

The first steps of the process are to plot the relationship between delay and time, and then calculate the slope to determine whether the delay is increasing, decreasing, or stabilizing. The last step is to decide what to do based on the current slope. A simplified version of the code looks like this:

let mut target_send_rate = config.initial_target_send_rate;
for direction in delay_directions {
  match direction {
    DelayDirection::Decreasing => {
      // While the delay is decreasing, hold the target rate to let the queues drain.
    }
    DelayDirection::Steady => {
      // While delay is steady, increase the target rate.
      letincrease = ... ; target_send_rate += increase;yield target_send_rate;
    }
    DelayDirection::Increasing => {
      // If the delay is increasing, decrease the rate.
      letdecrease = ... ; target_send_rate -= decrease;yieldtarget_send_rate; }}}Copy the code

Congestion control is difficult, but now it can be used mostly for video calls:

  • The sender selects an initial rate and starts sending packets.
  • The receiver sends back feedback about when the packet was received.
  • The sender uses this feedback to adjust the sending rate according to the above rules.

Rate allocation

Now that the SFU knows how much to send, it must decide what to send (which layers to forward), a process known as rate allocation.

This process is like SFU selecting from a menu of layers that are constrained by the transmit rate budget. For example, if each participant sends two layers and there are three other participants, there are a total of six layers on the menu.

If the budget is big enough, we can send all the content we want (up to the maximum layer of each participant). But if not, we have to prioritize. To help prioritize, each participant tells the server what resolution it needs by requesting the maximum resolution. Using this information, we assign rates using the following rules:

  • Layers greater than the maximum requested are excluded. For example, if you view only the small video grid, you don’t need to send the high resolution of each video.
  • Smaller layers take precedence over larger ones. For example, it’s better to view everyone at low resolution than to view some people at high resolution and others at all.
  • Larger request resolutions take precedence over smaller ones. For example, once everyone can be seen, the video that is considered the largest will be filled in at a higher quality before the others.

A simplified version of the code looks like this:

// The input: a menu of video options.
// Each has a set of layers to choose from and a requested maximum resolution.
letvideos = ... ;// The output: for each video above, which layer to forward, if any
let mut allocated_by_id = HashMap::new();
let mut allocated_rate = 0;

// Biggest first
videos.sort_by_key(|video| Reverse(video.requested_height));

// Lowest layers for each before the higher layer for any
for layer_index in 0..=2 {
  for video in &videos {
    if video.requested_height > 0 {
      // The first layer which is "big enough", or the biggest layer if none are.
      let requested_layer_index = video.layers.iter().position(
         |layer| layer.height >= video.requested_height).unwrap_or(video.layers.size()-1)
      if layer_index <= requested_layer_index {
        let layer = &video.layers[layer_index];
        let (_, allocated_layer_rate) = allocated_by_id.get(&video.id).unwrap_or_default();
        let increased_rate = allocated_rate + layer.rate - allocated_layer_rate;
        if increased_rate < target_send_rate {
          allocated_by_id.insert(video.id, (layer_index, layer.rate));
          allocated_rate = increased_rate;
        }
      }
    }
  }
}

Copy the code

integration

By combining these three technologies, there is a complete solution:

  • SFU uses GOOGCC and transport-CC to determine how much it should send to each participant.
  • SFU uses rate allocation to select the video resolution (layer) to forward.
  • SFU rewrites multiple layers of packets into one layer per video stream.

The result is that each participant can view all other participants in the best way given the current network conditions and is compatible with end-to-end encryption.

End-to-end encryption

When it comes to end-to-end encryption, it’s worth describing briefly how it works. Because it is completely opaque to the server, its code is not in the server, but in the client. In particular, our implementation exists in RingRTC, an open source video calling library written in Rust.

The contents of each frame are encrypted before being split into packets, similar to SFrame. The interesting part is actually the key distribution and rotation mechanism, which must be robust to the following scenarios:

  • People who do not join the call must not be able to decrypt their media data before they join. If this is not the case, someone who can access encrypted media data (for example by undermining SFU) will be able to know what is happening in a call before they join, or worse, never join at all.
  • The person leaving the call must be unable to decrypt the media after they leave. If that’s not the case, people with access to encrypted media could see what happened on their phone calls after they left.

To ensure these attributes, we use the following rules:

  • When a client joins a call, it generates a key and sends it to all other clients of the call via Signal messages (which are themselves end-to-end encrypted), and uses the key to encrypt the media data before sending it to the SFU.
  • Whenever any user joins or leaves a call, each client in the call generates a new key and sends it to all clients in the call. It then starts using the key after 3 seconds (allowing the client a period of time to receive the new key).

Using these rules, each client can control its own key allocation and rotation, and rotate keys based on who is in the call rather than who is invited to the call. This means that each client can verify that the above security attributes are guaranteed.

summary

In a Reddit discussion thread related to the article, the library’s author stated:

  1. I enjoy writing this project in Rust and think Rust is the best language to implement such projects.
  2. There are two efforts to optimize performance:
    • Faster and more concurrent code for reading and writing packets by using epoll and multithreading.
    • Change our locking to be more granular.
  3. Consider all of the server’s major logic to be almost less important to performance than the generic “push a lot of packets through the server.”
  4. RingRTC integrates Rust code into the Android platform through JNI.