Video-related development for Android has probably been one of the most divisive and incompatible parts of the Android ecosystem and Android apis. Google has very poor control over the API related to camera and video coding, resulting in many differences in the implementation of the two apis by different vendors. Moreover, the optimization of the API design has been quite limited, and some people even consider it as “one of the most difficult apis on Android”.

Taking wechat as an example, we recorded a 540p MP4 file. For Android, we generally follow the following process:



Basically, the YUV frame output from the camera is preprocessed and sent to the encoder to obtain the encoded H264 video stream.

The above is just the encoding for the video stream, and the audio stream needs to be recorded separately, and then the video stream and audio stream are synthesized to produce the final video.

This article will mainly analyze two common problems in the coding of video stream:

  1. Choose video encoder (hard or soft)?
  2. How to quickly preprocess (mirror, zoom, rotate) the YUV frame output from the camera?

Video encoder selection

For the demand of recording video, many apps need to process each frame of data separately, so rarely will directly use MediaRecorder to directly admit video, generally speaking, there will be so two choices

  • MediaCodec
  • FFMpeg+x264/openh264

Let’s break it down one by one


FFMpeg+x264/openh264

In addition to using MediaCodec for coding, another popular solution is to use FFMPEG + X264 / Openh264 for soft coding, FFMPEG is used for some video frame preprocessing. Here we mainly use X264 / Openh264 as the video encoder.

X264 is basically considered to be the fastest commercial video encoder on the market today, and basically all h264 features are supported, through reasonable configuration of various parameters or can get better compression rate and coding speed, limited by space, here will not describe the parameter configuration of H264, Take a look at the tuning of x264 encoding parameters here and here.

Openh264 is another H264 encoder open-source by Cisco. The project was opened in 2013. Compared with X264, openh264 is a little younger, but since Cisco has paid the full annual patent fee for H264, it can be directly used for free for external users. Firefox has Openh264 built in directly as a codec for its video in webRTC.

However, compared with X264, Openh264 has poor support for advanced features of H264:

  • The value of Profile can be baseline, Level 5.2
  • Multi-threaded encoding supports only slice based encoding, but not frame based multi-threaded encoding

In terms of coding efficiency, Openh264 isn’t any faster than X264, but the best part is that it’s free to use.

YUV frame preprocessing

According to the process given at the beginning, we need to perform some pre-processing on the YUV frame output by the camera before sending it into the encoder

1. The zoom

If the camera preview size is set to 1080p, the YUV frame output in onPreviewFrame is directly 1920×1080 in size. If we need to encode a video with a different size, we need to scale the YUV frame in real time during recording.

In wechat, for example, the camera previews 1080p data, which requires encoding 960×540 video.

The most common method is to use ffMPEG sws_scale function for direct scaling, the effect/performance is better generally choose SWS_FAST_BILINEAR algorithm:

mScaleYuvCtxPtr = sws_getContext(
		           srcWidth,
		           srcHeight,
		           AV_PIX_FMT_NV21,
		           dstWidth,
		           dstHeight,
		           AV_PIX_FMT_NV21,
		           SWS_FAST_BILINEAR, NULL.NULL.NULL);
sws_scale(mScaleYuvCtxPtr,
                    (const uint8_t* const *) srcAvPicture->data,
                    srcAvPicture->linesize, 0, srcHeight,
                    dstAvPicture->data, dstAvPicture->linesize);Copy the code

On the Nexus 6P, zooming directly with FFMPEG takes about 40ms+, and for our 30fps record, it takes about 30ms per frame at most. If zooming takes that much time, we’ll be able to record video at around 15fps.

Obviously, using FFMPEG directly for scaling is too slow, have to say that SWSScale is simply the dregs of FFMPEG, after comparing several commonly used calculations in the industry, we finally consider using this fast scaling algorithm:

We choose a local mean algorithm, which calculates the four pixels of the final image by four adjacent points in the two rows before and after. For each pixel of the source image, we can use Neon to directly achieve it, taking scaling Y component as an example:

	const uint8* src_next = src_ptr + src_stride;
    asm volatile (
      "1: \n"    
        "Vld4.8 {d0, d1, d2, d3}, [%0]! \n"
        "Vld4.8 {d4, d5, d6, d7}, [%1]! \n"
        "subs %3, %3, #16 \n"  // 16 processed per loop
        
        "vrhadd.u8 d0, d0, d1 \n"
        "vrhadd.u8 d4, d4, d5 \n"
        "vrhadd.u8 d0, d0, d4 \n"

        "vrhadd.u8 d2, d2, d3 \n"
        "vrhadd.u8 d6, d6, d7 \n"
        "vrhadd.u8 d2, d2, d6 \n"

        "Vst2.8 {d0, d2} [% 2]! \n"  // store odd pixels
        
        "bgt 1b \n"
      : "+r"(src_ptr).          // % 0
        "+r"(src_next)./ / %1
        "+r"(dst)./ / %2
        "+r"(dst_width) // %3
      :
      : "q0". "q1". "q2". "q3"              // Clobber List
    );Copy the code

The Neon instruction used above can only read and store 8 or 16 bits of data at a time. For extra data, you need to use the same algorithm to implement it in C.

On the Nexus 6P, each frame can be scaled in less than 5ms. For zoom quality, ffMPEG’s SWS_FAST_BILINEAR algorithm is used to compare the images scaled by the above algorithm. The peak signal to noise ratio (PSNR) is around 38-40 in most scenarios, which is good enough.

2. Rotate

On Android, the YUV frame generated by onPreviewFrame is usually rotated 90 or 270 degrees due to the camera mounting Angle. If the final video is to be shot vertically, it is generally required to rotate the YUV frame.

For the rotation algorithm, if it is pure C code, generally is O (n^2) complexity algorithm, if it is rotating yuV frame data 960×540, on nexus 6P, each frame rotation also needs 30ms+, which is obviously not acceptable.

So let’s think a little bit differently here, can we not rotate the YUV frame?

In fact, in the head of mp4 file format, we can specify a rotation matrix, specifically in the Moov.trak.tkhd box. The video player will read the matrix information when playing the video, so as to determine the rotation Angle, displacement, zoom, etc. For details, please refer to the Apple documentation

Using FFMPEG, we can easily assign this rotation Angle to the synthesized MP4 files:

char rotateStr[1024];
sprintf(rotateStr, "%d", rotate);
av_dict_set(&out_stream->metadata, "rotate", rotateStr, 0);Copy the code

So you can save a lot of spinning costs while recording, excited!

3. The mirror

When using the front camera to shoot, if the YUV frame is not processed, the video directly shot will be mirrored and flipped. The principle here is the same as looking into a mirror. The YUV frame taken from the front camera direction is exactly opposite, but sometimes the mirrored video may not meet our needs. So at this point we need to mirror the YUV frame.

However, since the camera is usually installed at 90 or 270 degrees, the original YUV frame is actually flipped horizontally. Therefore, when performing mirror flipping, you only need to exchange each row of data up and down with the middle as the central axis. Note that Y and UV should be processed separately.

asm volatile (
      "1: \n"
        "Vld4.8 {d0, d1, d2, d3}, [%2]! \n"  // load 32 from src
        "Vld4.8 {d4, d5, d6, d7}, [%3]! \n"  // load 32 from dst
        "subs %4, %4, #32 \n"  // 32 processed per loop
        "Vst4.8 {d0, d1, d2, d3}, [%1]! \n"  // store 32 to dst
        "Vst4.8 {d4, d5, d6, d7}, [%0]! \n"  // store 32 to src
        "bgt 1b \n"
      : "+r"(src).   // % 0
        "+r"(dst)./ / %1
        "+r"(srcdata)./ / %2
        "+r"(dstdata)./ / %3
        "+r"(count) // %4  // Output registers
: // Input registers
      : "cc". "memory". "q0". "q1". "q2". "q3"  // Clobber List
    );Copy the code

Again, the rest of the data is just plain C code, and on the nexus6p, the mirror can flip a frame of 1080×1920 YUV data in less than 5ms


After coding the H264 video stream, the final process is to combine the audio stream with the video stream and package it into an MP4 file. This part can be realized through the system’s MediaMuxer, MP4V2, or FFMPEG. This part is relatively simple and will not be described here


References

  1. Lei xiaohua (leixiaohua1020) column, the famous thor blog, there are a lot of audio and video coding /ffmpeg related learning materials, essential for entry. May he rest in peace in heaven
  2. Android MediaCodec Stuff, which contains some examples of MediaCodec code, can be used for the first time to refer to here
  3. Coding for NEON, a series of tutorials that describes how to use common NEON directives. Neon was used for zooming. In fact, most audio and video processing can be done using Neon. For example, YUV frame processing can be optimized for zooming, rotation, and mirror flipping
  4. Libyuv is a YUV processing library for video zooming, which is only applicable to 1080p->540p video zooming. For general purpose video zooming, you can use libYUv to compress video frames