This article was originally submitted by Rongyun Technical team, the author is Su Dao, senior engineer of WebRTC of Rongyun, please indicate the source of reprint.

1, the introduction

In a typical IM application, the display of the first frame of a video is an important user experience metric when using real-time audio and video chat.

This paper mainly analyzes the audio and video processing process of the WebRTC receiver to understand and optimize the display time of the first frame of the video, and summarizes and shares it.

2. What is WebRTC?

For those of you who have never been exposed to real-time audio and video technology, what is WebRTC? We need a brief introduction.

Speaking of WebRTC, we have to mention Gobal IP Solutions, or GIPS. A VoIP software developer founded in Stockholm, Sweden in 1990, it offers arguably the best voice engine in the world. Please refer to”Interview with the Father of WebRTC standards: WebRTC past, Present and Future”.

Skype, Tencent QQ, WebEx, Vidyo, and others use its audio processing engine, which includes patented echo cancellation algorithms, low-latency algorithms to accommodate network jitter and packet loss, and advanced audio codecs.

Google also uses the GIPS license in Google Talk. Google acquired GIPS in 2011 for $68.2 million and made its source code open source, plus the VPx series of video codecs acquired by On2 in 2010 (see “INSTANT Messaging Audio and Video Development (XVII) : Video coding H.264, VP8 past life), WebRTC open source project came into being, that is, GIPS audio and video engine + to replace H.264 VPx video codec.

Since then, Google has incorporated WebRTC with libjingle, an open source project for P2P hole-making in Gtalk. Currently WebRTC provides support for all platforms including Web, iOS, Android, Mac, Windows and Linux.

(The above introduction is quoted from “The Great WebRTC: Perfecting the Ecosystem, or Revolutionizing real-time audio and video Technology”)

Although WebRTC aims to achieve cross-platform real-time audio and video communication on the Web, developers can easily migrate and apply it outside of the Web platform due to the Native, high quality and cohesion of the core layer code. So far, WebRTC is almost the only high-quality real-time audio and video communication technology available for free in the industry.

3. Process introduction

A typical real-time audio and video processing flow might look like this:

  • 1) The sender collects audio and video data and generates frame data through the encoder;
  • 2) The data is packaged into RTP packets and sent to the receiver through the ICE channel;
  • 3) The receiver receives the RTP packet and takes out the RTP payload to complete the operation of framing.
  • 4) After that, the audio and video decoder decodes frame data to generate video image or audio PCM data.

As shown below:

The tuning of parameters that this article deals with is in step 4 of the figure above.

Because it is the receiving end, it will receive the Offer request from the other end. I’m going to set SetRemoteDescription and then SetLocalDescription.

The blue part of the picture below:

4. Parameter adjustment

4.1 Video Parameter adjustment

When SetRemoteDescription is received from the Signal thread, the VideoReceiveStream object is created in the Worker thread. The specific process of SetRemoteDescription – > VideoChannel: : create WebRtcVideoReceiveStream SetRemoteContent_w.

WebRtcVideoReceiveStream contains a Stream_ object of type VideoReceiveStream, Through the webrtc: : VideoReceiveStream * Call: : CreateVideoReceiveStream created.

Immediately after creation, Start the VideoReceiveStream work by calling the Start() method.

At this point VideoReceiveStream contains an RtpVideoStreamReceiver object ready to start processing video RTP packets.

After creating createAnswer, the recipient sets local descritpion with setLocalDescription.

In a Worker thread corresponding setLocalContent_w methods according to the SDP set receives the parameters of the channel, will call to WebRtcVideoReceiveStream: : SetRecvParameters.

WebRtcVideoReceiveStream: : SetRecvParameters implementation is as follows:

void WebRtcVideoChannel::WebRtcVideoReceiveStream::SetRecvParameters(

const ChangedRecvParameters& params) {

bool video_needs_recreation = false;

bool flexfec_needs_recreation = false;

if(params.codec_settings) {

ConfigureCodecs(*params.codec_settings);

video_needs_recreation = true;

}

if(params.rtp_header_extensions) {

config_.rtp.extensions = *params.rtp_header_extensions;

flexfec_config_.rtp_header_extensions = *params.rtp_header_extensions;

video_needs_recreation = true;

flexfec_needs_recreation = true;

}

if(params.flexfec_payload_type) {

ConfigureFlexfecCodec(*params.flexfec_payload_type);

flexfec_needs_recreation = true;

}

if(flexfec_needs_recreation) {

RTC_LOG(LS_INFO) << “MaybeRecreateWebRtcFlexfecStream (recv) because of “

“SetRecvParameters”;

MaybeRecreateWebRtcFlexfecStream();

}

if(video_needs_recreation) {

RTC_LOG(LS_INFO)

<< “RecreateWebRtcVideoStream (recv) because of SetRecvParameters”;

RecreateWebRtcVideoStream();

}

}

According to the SetRecvParameters code above, If codec_settings is not empty, rTP_header_Extensions is not empty, and Flexfec_payload_type is not empty, the VideoReceiveStream is restarted.

Video_needs_recreation Indicates whether to restart the VideoReceiveStream.

** The restart process is: ** Release the previously created one and rebuild the new VideoReceiveStream.

Take codec_settings as an example. The initial video Codec supports H264 and VP8. If the peer end supports only H264, the negotiated CODEC supports only H264. Codec_settings in SetRecvParameters is not null for H264. In fact, both before and after VideoReceiveStream have H264 CODEC, there is no need to rebuild VideoReceiveStream. You can configure the locally supported Video Codec initial list and RTP Extensions to adjust the receiving parameters in the generated local SDP and remote SDP consistently, and determine whether the COdec_settings are the same. If not, then video_needs_recreation is true.

This setting will prevent SetRecvParameters from triggering the restart of the VideoReceiveStream logic.

In the debug mode, the modified to verify no “RecreateWebRtcVideoStream recv () because of SetRecvParameters” print, can prove no VideoReceiveStream restart.

4.2 Audio Parameter adjustment

Similar to the above video adjustment, the audio will be re-created because the RTP extensions are inconsistent, and the original AudioReceiveStream will be released and the AudioReceiveStream will be re-created.

Reference code:

bool WebRtcVoiceMediaChannel::SetRecvParameters(

const AudioRecvParameters& params) {

TRACE_EVENT0(“webrtc”, “WebRtcVoiceMediaChannel::SetRecvParameters”);

RTC_DCHECK(worker_thread_checker_.CalledOnValidThread());

RTC_LOG(LS_INFO) << “WebRtcVoiceMediaChannel::SetRecvParameters: “

<< params.ToString();

// TODO(pthatcher): Refactor this to be more clean now that we have

// all the information at once.

if(! SetRecvCodecs(params.codecs)) {

return false;

}

if(! ValidateRtpExtensions(params.extensions)) {

return false;

}

std::vectorwebrtc::RtpExtension filtered_extensions = FilterRtpExtensions(

params.extensions, webrtc::RtpExtension::IsSupportedForAudio, false);

if(recv_rtp_extensions_ ! = filtered_extensions) {

recv_rtp_extensions_.swap(filtered_extensions);

for(auto& it : recv_streams_) {

it.second->SetRtpExtensionsAndRecreateStream(recv_rtp_extensions_);

}

}

return true;

}

The constructor of the AudioReceiveStream starts the audio device by calling StartPlayout of the AudioDeviceModule.

AudioReceiveStream’s destructor stops the audio device by calling AudioDeviceModule’s StopPlayout.

Restarting the AudioReceiveStream triggers StartPlayout/StopPlayout multiple times.

In tests, these unnecessary operations resulted in a small interruption in the audio when entering the video conference room.

** Solution: ** You can configure the audio Codec initial list and RTP Extensions to generate the same local SDP and remote SDP to avoid the AudioReceiveStream restart logic.

In addition, audio Codec is mostly implemented internally by WebRTC. By removing some unused Audio Codec, the corresponding library files of WebRTC can be reduced.

4.3 Mutual influence of audio and video

There are three very important threads inside WebRTC:

  • 1) Woker thread;
  • 2) Signal thread;
  • 3) Network threads.

The call to the PeerConnection API is entered by the signal thread into the worker thread.

The worker thread processes media data, while the network thread processes network-related transactions. Channel. h indicates that methods ending with _w are worker thread methods, and calls from the signal thread to the worker thread are synchronous operations.

For example, InvokerOnWorker in the following code is a synchronous operation, and setLocalContent_w and setRemoteContent_w are methods in the worker thread.

bool BaseChannel::SetLocalContent(const MediaContentDescription* content,

SdpType type,

std::string* error_desc) {

TRACE_EVENT0(“webrtc”, “BaseChannel::SetLocalContent”);

returnI nvokeOnWorker(

RTC_FROM_HERE,

Bind(&BaseChannel::SetLocalContent_w, this, content, type, error_desc));

}

bool BaseChannel::SetRemoteContent(const MediaContentDescription* content,

SdpType type,

std::string* error_desc) {

TRACE_EVENT0(“webrtc”, “BaseChannel::SetRemoteContent”);

return InvokeOnWorker(

RTC_FROM_HERE,

Bind(&BaseChannel::SetRemoteContent_w, this, content, type, error_desc));

}

The SDP information in setLocalDescription and setRemoteDescription is sent to audio/video through the PushdownMediaDescription method of PeerConnection RtpTransceiver Set SDP information.

* * example: ** Audio SetRemoteContent_w takes a long time to execute (such as InitPlayout execution time of audio AudioDeviceModule), which will affect the setting time of video SetRemoteContent_w.

PushdownMediaDescription code:

RTCError PeerConnection::PushdownMediaDescription(

SdpType type,

cricket::ContentSource source) {

const SessionDescriptionInterface* sdesc =

(source == cricket::CS_LOCAL ? local_description()

: remote_description());

RTC_DCHECK(sdesc);

// Push down the new SDP media section for each audio/video transceiver.

for(const auto& transceiver : transceivers_) {

const ContentInfo* content_info =

FindMediaSectionForTransceiver(transceiver, sdesc);

cricket::ChannelInterface* channel = transceiver->internal()->channel();

if(! channel || ! content_info || content_info->rejected) {

continue;

}

const MediaContentDescription* content_desc =

content_info->media_description();

if(! content_desc) {

continue;

}

std::string error;

bool success = (source == cricket::CS_LOCAL)

? channel->SetLocalContent(content_desc, type, &error)

: channel->SetRemoteContent(content_desc, type, &error);

if(! success) {

LOG_AND_RETURN_ERROR(RTCErrorType::INVALID_PARAMETER, error);

}

}

.

}

5. Other problems affecting the display of the first frame

5.1 Android image width and height 16-byte alignment

AndroidVideoDecoder is a video hardware solution for WebRTC Android platform. AndroidVideoDecoder uses MediaCodec API to complete the call to the hardware decoder.

MediaCodecThere are apis associated with decoding:

  • 1) dequeueInputBuffer: if it is greater than 0, it is the index of the buffer that fills the encoded data. This operation is a synchronous operation;
  • 2) getInputBuffer: A ByteBuffer array that fills encoded data. Combined with the return value of dequeueInputBuffer, a ByteBuffer that can fill encoded data can be obtained;
  • 3) queueInputBuffer: After the application copies coded data to ByteBuffer, MediaCodec is informed of the buffer index of the coded data that has been filled in.
  • 4) dequeueOutputBuffer: if greater than 0, it returns the index of the buffer filling the decoded data. This operation is a synchronous operation;
  • 5) getOutputBuffer: A ByteBuffer array that fills decoded data. Combined with the return value of dequeueOutputBuffer, a ByteBuffer that can fill decoded data can be obtained;
  • 6) releaseOutputBuffer: tells the encoder that data processing is complete and releases ByteBuffer data.

In practice, it has been found that the width and height of the video sent by the sender requires 16-byte alignment, as decoders require 16-byte alignment on some Android phones.

** The basic principle is: ** Video decoding on Android first passes the data to be decoded to MediaCodec via queueInputBuffer. The dequeueOutputBuffer is then repeated to see if there are any solved video frames. The dequeueOutputBuffer will have mediacodec.info_output_buffers_changed once if it is not 16-byte aligned. Instead of successfully decoding a frame in the first place.

** Test found: ** Frame width and height non-16-byte alignment will be about 100 ms slower than 16-byte alignment.

5.2 The Server forwards keyframe requests

The iOS mobile devices, WebRTC after an App into the background, video decoding by returning to kVTInvalidSessionErr VTDecompressionSessionDecodeFrame, said decoding session is invalid. This triggers a keyframe request from the viewer to the server.

The server must forward the keyframe request from the receiver to the sender. If the server does not forward the keyframe to the sender, the receiver will have no image to render for a long time, resulting in a black screen problem.

In this case, you can only wait for the sender to generate the key frame and send the receiver, so that the receiver of the black screen can be restored to normal.

5.3 WebRTC Internal Logic Examples for Discarding Data

Webrtc also has a lot to do with verifying the correctness of the data between receiving the packet data and feeding it to the decoder.

Example 1:

The PacketBuffer records the smallest number of the current cache, first_seq_num_ (this value is also updated). When InsertPacket is inserted in the PacketBuffer, if the sequence number of the packet to be inserted is less than first_seq_num, the packet will be discarded. If the packet is discarded continuously, the video will not be displayed or will be stuck.

Example 2:

Under normal circumstances, the picture ID of the frame in the FrameBuffer, the timestamp is always positive.

If the FrameBuffer receives picture_ID less than the picture ID of the last decoded frame, there are two cases:

  • 1) If the timestamp is larger than that of the last decoded frame and it is a key frame, it will be saved.
  • 2) All frames except case 1 are discarded.

The code is as follows:

auto last_decoded_frame = decoded_frames_history_.GetLastDecodedFrameId();

auto last_decoded_frame_timestamp =

decoded_frames_history_.GetLastDecodedFrameTimestamp();

if(last_decoded_frame && id <= *last_decoded_frame) {

if(AheadOf(frame->Timestamp(), *last_decoded_frame_timestamp) &&

frame->is_keyframe()) {

// If this frame has a newer timestamp but an earlier picture id then we

// assume there has been a jump in the picture id due to some encoder

// reconfiguration or some other reason. Even though this is not according

// to spec we can still continue to decode from this frame if it is a

// keyframe.

RTC_LOG(LS_WARNING)

<< “A jump in picture id was detected, clearing buffer.”;

ClearFramesAndHistory();

last_continuous_picture_id = -1;

} else{

RTC_LOG(LS_WARNING) << “Frame with (picture_id:spatial_id) (“

<< id.picture_id << “:”

<< static_cast(id.spatial_layer)

<< “) inserted after frame (“

<< last_decoded_frame->picture_id << “:”

<< static_cast(last_decoded_frame->spatial_layer)

<< “) was handed off for decoding, dropping frame.”;

return last_continuous_picture_id;

}

}

Therefore, in order for the received stream to play smoothly, the sender and the forwarding server need to ensure that the picture_ID and timestamp of the video frame are correct.

WebRTC also has a lot of other frame loss logic. If the network is normal and there is continuous receiving data, but the video is stuck or the black screen is not displayed, it is mostly the problem of the stream itself.

6. Summary of this paper

By analyzing the processing logic of WebRTC audio and video receiver, this paper lists some points that can optimize the display of the first frame, such as adjusting the relevant parts of local SDP and remote SDP that affect the processing of the receiver. This prevents the Audio/Video ReceiveStream restart.

In addition, the effects of Android decoder on video width and height, server’s processing of key frame request, and some frame loss logic in WebRTC code on video display are listed. These points improve the display time of the first frame of Rongyun SDK video and improve user experience.

Due to my limited level, the content of this article may have some limitations, welcome to reply for discussion. (This article is simultaneously published at: www.52im.net/thread-3169…)

7. Reference materials

“Sharing of Rongyun Technology: Practice of Network Link Preservation Technology of Rongyun Android IM Products”

IM Message ID Technical Topic (3) : Decoding the Chat Message ID Generation Strategy of Rongyun IM Products

“Rongyun Technology Sharing: Real-time Audio and Video First Frame Display Time Optimization Practice based on WebRTC” (* Article)

“Instant messaging Cloud CTO experience sharing: Technology entrepreneurship, are you really ready?”