Ahab. The original title, “Theoretical Knowledge and Fundamental Concepts related to video”, has been revised and altered since its collection.

1, the introduction

With the popularity of mobile Internet, real-time audio and video technology has played an important role in more and more scenarios. It is no longer limited to real-time video chat and real-time video conference in IM, but also common in remote medical treatment, distance education, smart home and other scenarios.

Although the use of real-time audio and video technology is becoming more and more popular, the technical barrier for programmers is still high (to be prepared) and it is very difficult to fully master real-time audio and video technology in a short period of time.

Taking real-time audio and video chat in IM as an example, a simplified video chat technology is essentially a combination of audio and video technology and network technology, as shown in the figure below: The part above the network module is the category involved in audio and video technology.

▲ The picture is quoted from the story behind wechat Mini Program Audio and Video Technology.

So, want to learn real-time audio and video development, it is to learn audio and video relevant technical knowledge first commonly, as to network technology, can separate completely study.

As want to work in this respect, however, the small white flour view, cannot be mastered in a short time, audio and video technology, but can quickly understand the relevant knowledge concept, in the brain in their rapidly organize corresponding knowledge map, help for related knowledge by in-depth study and research in the future, is an efficient method of technological learning.

This article will briefly explain to you 11 very important basic knowledge concepts related to real-time audio and video technology through popular text, hoping to play a role in your future work in this field.

Learning and communication:

Open source IM framework source: github.com/JackJiang20…

2. About the author

Wang Yinghao: Now lives in Guangzhou.

Github : github.com/yhaolpz

CSDN: blog.csdn.net/yhaolpz

Personal blog: yhaowa.giee.io

3. Reference materials

[1] Zero Basics, the most popular video coding technology in history

[2] Zero Basic Introduction: A comprehensive inventory of basic knowledge of real-time audio and video technology

[3] It only takes one article to understand the delay problem in real-time audio and video chat

4. What is video?

According to the principle of human visual persistence, an image that changes more than 24 frames per second appears to be smooth and continuous. Such a continuous picture is called video.

In layman’s terms, a video is equivalent to showing multiple images in a row, and it works like this:

▲ Picture quoted from “Zero Base, The history of the most popular video coding technology introduction”

5. What is resolution?

5.1 basis

Resolution is measured by the number of horizontal and vertical pixels, indicating the level of fineness of a flat image. Video refinement depends not only on video resolution, but also on screen resolution.

The P of 1080P refers to Progressive scan, or vertical pixel point, which is “high”, so 1920 * 1080 is called 1080P, not 1920P.

5.2 the sampling

When a 720P video is played on a 1080P screen, the image needs to be enlarged, also known as up-sampling.

“Upsampling” almost adopts interpolation method, that is, new elements are inserted between the pixels of the original image using appropriate interpolation algorithm, so image amplification is also called image interpolation.

A brief note on the interpolation algorithm:

Common interpolation algorithm technical principle:

  • 1) Adjacency interpolation algorithm: fill four pixels (double magnification) with the color of one pixel of the original image, which is relatively simple and easy to realize. It was widely used in the early days, but it would produce obvious jagged edges and Mosaic;
  • 2) Bilinear interpolation method: it is an improvement of the adjacent interpolation method. First, it carries out first-order linear interpolation in the two horizontal directions, and then carries out first-order linear interpolation in the vertical direction. It can make up for the deficiency of the adjacent interpolation algorithm effectively, but it still has the sawtooth phenomenon and leads to some unexpected details softening.
  • 3) pairs of three interpolation method: is improvement of bilinear interpolation method, it not only consider the surrounding four directly adjacent pixel gray value, the influence of also given their grey value rate and the effect of the interpolation pixel gray value of the continuation of the continuity of the original image gray level change, so that the magnified image shade change natural and smooth.

In addition, there are many more complex and better effects of the algorithm, such as wavelet interpolation, fractal and so on.

5.3 the sampling

When a 1080P video is played on a 720P screen, you need to reduce the image, also called sampling.

** “downsampling” is defined as: ** For a sequence of samples, sampling several samples at an interval to get a new sequence.

For an image with a resolution of MxN, take s times of down-sampling to obtain an image with a resolution of (M/s)x(N/s) (S should be the common divisor of M and N). In other words, the image in the SXS window of the original image is transformed into a pixel, and the value of this pixel point is the mean value of all pixels in the window.

The best experience is that the screen has the same resolution as the video and plays in full screen. If the video resolution is too high, the screen cannot display, and if the video resolution is too low, the screen cannot play.

6. What is bitrate?

6.1 basis

Bit rate, also known as bit rate, has different meanings in different fields. In the multimedia field, it refers to the number of bits of audio or video played per unit of time. It can be understood as throughput or bandwidth.

The unit is Bits per second, which is the amount of data transmitted per second. The unit is KBPS and MBPS.

Calculation formula: Bit rate (KBPS) = File size (KB)/Duration (s)

Popular understanding is the sampling rate, the greater the sampling rate, the higher the accuracy, the better the image quality, but the greater the amount of data, so to find a balance: with the lowest bit rate to achieve the least distortion.

In a video, the complexity of images in different periods is different, such as scenes with high-speed changes and scenes with almost static, and the amount of data required is also different. It is not reasonable to use the same bit rate, so dynamic bit rate is introduced.

6.2 Dynamic bit rate

VBR is short for Variable Bit Rate. The Bit Rate can change with the complexity of the image.

Small bit rate is used for simple image content segments, and large bit rate is used for complex image content segments, which not only ensures the quality of playback, but also takes into account the limitation of data quantity.

For example, RMVB video files, VB refers to VBR, which means the use of dynamic bit-rate encoding, to achieve the effect of both quality and volume.

6.3 Static Bit Rate

CBR stands for Constant Bit Rate.

The quality of the fragment with complex image content is unstable, while the quality of the fragment with simple image content is better. The calculation formula listed above is obviously for CBR. In addition to VBR and CBR, there are also CVBR (Constrained VariableBit Rate), ABR (Average Bit Rate) and so on.

7. What is sampling rate?

** Definition: ** The number of samples per second, in Hertz (Hz), extracted from a continuous signal to form a discrete signal. There’s no need to confuse sampling rate, sampling rate, and sampling rate. They’re all synonyms.

Videos generally do not identify sampling rate attributes, such as:

Sampling rate itself is a generalized concept. For video, if it is not described by sampling rate, it should be divided into two levels: frame frequency and field frequency.

  • 1) From the frame rate level: Sampling rate refers to frame rate, which refers to how many frames of images are displayed in one second;
  • 2) From the field frequency level: sampling rate refers to pixel frequency, which refers to how many pixels are displayed in one second.

Pixel frequency is an indicator of the display, which can be understood as the maximum bandwidth of the display, and can play a role in limiting resolution and refresh rate.

According to its meaning, a formula can be obtained:

Pixel frequency = frame rate X number of pixels in a frame

For:

Frame rate = 138.5 x 1024 x 1024/1920/1080 ≈ 70.04

, the 70Hz obtained is the normal frame rate range, which can also be reversed to confirm that the understanding of pixel frequency is correct.

8. What is frame rate?

** Definition: ** is used to measure the number of frames displayed. The unit is FPS (Frames per Second) or Hertz (Hz).

The higher the frame rate, the more smooth and realistic the picture, the higher the processing capacity requirements of the graphics card, the greater the amount of data.

At the beginning of the article, we mentioned that graphics that are more than 24 frames per second look smooth and continuous. This is for movies and other video, but 24 frames is not necessarily smooth for games.

Why does a 24fps movie feel smooth, and a 24fps game feel lousy?

** The first reason: the two image generation principle is different **

A movie frame is exposed over a period of time, and each frame contains information for a period of time, whereas a game’s image is calculated by the graphics card, and each frame contains information for a single moment.

Let’s say a circle moves from the top left to the bottom right:

The former is a frame of the movie, and the latter is a frame of the game. It can be seen that there will be drag shadows in the action in the movie, giving people a dynamic effect, coherence but not jam.

** Second reason: Movie FPS are stable, while games are unstable **

If the movie is 24 FPS, that means the screen is refreshed every 1/24 of a second at a fixed frame interval.

If the game is 60 FPS, the screen refreshes every 1/60 second or so, and the frame interval is not stable, even if 60 frames are displayed in a second, it may be 59 frames in the first half second and 1 frame in the second half.

9. What is video coding?

9.1 basis

Definition: To convert a file from one video format to another video format using a specific compression technique. There is a strong correlation between video data in time domain and spatial domain, which also means that there is a lot of “time-domain redundant information” and “spatial redundant information”, and the compression technology is to remove the redundant information in the data.

9.2 Lossless Compression

Lossless compression is also known as reversible coding. The reconstructed data is identical with the original data and is suitable for disk file compression. Lossless compression mainly adopts entropy coding, including Shannon coding, Huffman coding and arithmetic coding.

9.2.1) Shannon code:

Shannon coding uses the cumulative probability distribution function of source symbols to distribute code words, which is not efficient and practical, but has a good theoretical guidance for other coding methods.

9.2.2) Huffman coding:

Huffman coding constructs the code word with the shortest average length of different prefix based on the probability of occurrence.

** The basic method is: ** Scan the image data once, calculate the probability of the appearance of various pixels, specify unique code words of different lengths according to the size of the probability, and then get a Huffman code table of the image.

The encoded image data records the code word of each pixel, and the corresponding relationship between the code word and the actual pixel value is recorded in the code table.

9.2.3) Arithmetic coding:

Arithmetic coding is described by two basic parameters: symbol probability and coding interval. In the case of given symbol set and symbol probability, arithmetic coding can give a nearly optimal coding result.

Compression algorithms using arithmetic coding usually estimate the probability of the input symbol first, and then encode it. The more accurate the estimate, the closer the encoding result will be to the optimal result.

9.3 Lossy Compression

Lossy compression is also called irreversible coding. The reconstructed data is different from the original data. Lossy compression is applicable to any scenario that allows untruth, such as video conferencing, videophone, video broadcasting, and video surveillance.

Encoding methods include predictive coding, transform coding, quantization coding, mixed coding and so on.

10. What are coding standards?

10.1 basis

** Definition: ** In order to ensure the correctness of coding, coding to be standardized, standardized, so there is a coding standard.

There are two formal organizations that develop video coding standards:

1) ISO/IEC (International Organization for Standardization);

2) ITU-T (Communication Standards Department of the International Telecommunication Union).

**ISO/IEC coding standards are: **MPEG-1, MPEG-2, MPEG-4, MPEG-7, MPEG-21 and MPEG-H, etc.

** The coding standards developed by ITU-T are: **H.261, H.262, H.263, H.264 and H.265, etc.

Mpeg-x and H.26X standard video coding adopts lossy compression mixed coding, the main difference is the resolution of image processing, prediction accuracy, search range, quantization step and other parameters, so its application is different.

The MPEG – 10.2 x series

**10.2.1) MPEG-1: **

Mpeg-1 consists of five parts.

The second part is the video coding scheme, which specifies the coding scheme of the progressive scanning video.

The third part of the audio coding scheme, the audio stream compression is divided into three layers and successively increase the compression ratio, the popular MP3 (MPEG-1 Layer 3) is according to this part of the encoding scheme after the compression of the file format.

**10.2.****2) MPEG-2: **

Mpeg-2 has a total of 11 parts, which improve bit rate and quality on the basis of MPEG-1.

Part 2 video Coding Scheme, which specifies the coding scheme for interlaced video, was developed in conjunction with ITU-T, which refers to it as H.262.

The third part of the audio encoding scheme, continues the MPEG-1 three-layer compression scheme, the compressed file format is still MP3, but the compression algorithm has been improved.

In part 7, AAC (MPEG Advanced Audio Coding) was proposed for the first time to replace MP3 format with smaller capacity and better sound quality.

**10.2.****3) MPEG-4: **

Mpeg-4 has a total of 27 parts and focuses more on the interactivity and flexibility of multimedia systems.

The third part of the audio encoding scheme optimizes the AAC encoding algorithm and gradually replaces MP3 after its introduction. For example, the audio packaged together with video gives priority to AAC format, but mostly uses MP3 format for civil use.

Part 10 proposes the AVC (Advanced Video Coding) code, which was developed in collaboration with ITU-T and is called H.264 by ITU-T.

Part 14 introduces the MP4 format package, the official file suffix is “.mp4”, and other formats based on MP4 for extended or reduced versions, including: M4V, 3GP, F4V, etc.

**10.2.****4) MPEG-7: **

Mpeg-7 is different from MPEG-1, MPEG-2, and MPEG-4 in that it is not an audio and video compression standard.

Mpeg-7 is called the Multimedia Content Description Interface (MPEG-7). The purpose of MPEG-7 is to produce a standard for describing multimedia information and associating that description with the described content for fast and effective retrieval.

**10.2.****5) MPEG-12: **

In fact, MPEG-12 is the integration of some key technologies. Through this integrated environment, it manages global digital media resources and realizes functions such as content description, creation, publication, use, identification, charge management, copyright protection and so on.

**10.2.****6) mPEG-h: **

Mpeg-h includes one digital container standard, one video compression standard, one audio compression standard, and two conformance test standards.

The video compression standard is High Efficiency Video Coding (HEVC), which is developed jointly with ITU-T. Compared with H.264/MPEG-4 AVC, the data compression rate is doubled.

10.3 h. 26 x series

**10.3.1) H.261: **

H.261 was the first practical digital video coding standard, using a hybrid coding framework, including inter-frame prediction based on motion compensation, spatial transform coding based on discrete cosine transform, quantization, Zig-Zag scan, and entropy coding.

The design of H.261 was so successful that subsequent international standards for video coding were based on the DESIGN framework of H.261, including MPEG-1, MPEG-2 / H.262, H.263, and even H.264.

**10.3.****2) h.262: **

H.262 is an extension of MPEG-1, supports interlaced scanning, and is technically consistent with the MPEG-2 video standard, which is used in DVD.

**10.3.****3) h.263: **

H.263 is a low bit rate video coding standard for video conferencing, based on H.261.

Compared with H.261, it adopts half pixel motion compensation and adds four effective compression encoding modes, which can provide better image effect at lower bit rate than H.261.

The first edition of H.263 was released in 1995, followed by the second and third editions of H.263+ in 1998 and 2000.

**10.3.****4) H.264: **

H.264 is also known as MPEG-4 Part 10, or MPEG-4 AVC, which is a block-oriented, motion-compensated video coding standard.

Officially released in 2003, it has become one of the most commonly used formats for high-precision video recording, compression and distribution.

H.264 can provide high quality video images at low bit rates, with a 50% bit rate savings compared to H.263.

Compared with H.263, H.264 does not require more coding options, reducing the coding complexity.

H.264 can use different transmission and playback rates according to different environments, and provides rich error handling tools to control or eliminate packet loss and error.

The performance improvement of H.264 comes at the cost of increased complexity, with the computational complexity of H.264 encoding approximately three times that of H.263 and the decoding complexity approximately two times that of H.263.

The H.264 protocol defines three types of frames, which are I frame, P frame and B frame:

  • 1) I frame: I frame refers to the coding frame and key frame within a frame, which can be understood as the complete retention of a frame. When decoding, only the data of this frame can be completed without reference to other frames, and the data volume is relatively large;
  • 2) P frame: P frame is the forward prediction coding frame, which records the difference between the current frame and the next key frame (or P frame). The final frame can only be generated when decoding by relying on the cached picture and superlaying the difference defined in this frame. The amount of data is much smaller than that of I frame;
  • 3) B frame: B frame is the bidirectional predictive coding frame, which records the difference between the current frame and the following frame. When decoding, it relies on the preceding I frame (or P frame) and the following P frame, and the amount of data is much smaller than I frame and P frame.

The data compression ratio is approximately: I frame: P frame: B frame = 7:20:50. It can be seen that P frame and B frame greatly save the amount of data, and the saved space can be used to save more I frames to achieve better picture quality at the same bit rate.

**10.3.****5) h.265: **

H.265, or High Efficiency Video Coding (HEVC), was officially launched in 2013.

H.265 coding architecture is similar to H.264, mainly including, inter – frame prediction, inter – frame prediction, conversion, quantization, block removal filter, entropy coding and other modules.

The H.265 coding architecture as a whole is divided into coding units, prediction units and conversion units.

H.265 builds on H.264 and uses advanced technology to improve the relationship between bit flow, coding quality, latency and algorithm complexity to achieve optimal Settings.

The quality of H.265 encoded video can be similar to or better than h.264 encoded video at a bit rate reduction of 51-74%.

The H.265 will be able to deliver higher quality Internet video with limited bandwidth. Smartphones, tablets and other mobile devices will be able to play full 1080p video directly online, bringing Internet video up to speed with the “high resolution” screen.

Here’s a picture to feel it:

In addition to MPEG-X and H.26X series standards, there are other coding standards, such as Google VP series, video coding standards are summarized, as shown in the figure:

11. What is video encapsulation format?

Video packaging formats such as MP4 and MKV are used to store or transmit encoded data, which can be understood as a container.

Encapsulation is to organize audio, video, subtitles and other data according to certain rules, including common information such as encoding type, according to which the player can match the decoder and synchronize audio and video.

Different package formats support different video and audio encoding formats. For example, MKV supports more video and audio encoding formats, while RMVB mainly supports Real video and audio encoding formats.

Wikipedia lists common video encapsulation formats. You can view the audio and video encoding formats supported by each encapsulation format.

12. What is video decoding?

** Definition: ** Decompress the video encoded data into video raw data, which is the reverse process of video coding.

For a player, a very important indicator is how many kinds of video decoding can be supported.

13. What is the principle of video playback?

To play a local video file, you need to decapsulate, decode audio and video, and synchronize audio and video.

** Unsealed package: **

It separates the input encapsulated data into audio compressed encoded data and video compressed encoded data. For example, the data in FLV format, after unencapsulation operation, will output h.264 encoded video stream and AAC encoded audio stream.

** Decoding: **

Compress and encode video/audio data into uncompressed video/audio raw data.

Audio compression coding standards include AAC, MP3, AC-3 and so on, and video compression coding standards include H.264, MPEG2, VC-1 and so on.

Decoding is the most important and complex part of the whole system.

Through decoding, the compressed encoded video data output into uncompressed color data, such as YUV420P, RGB and so on; Compressed encoded audio data is output as uncompressed audio sampling data, such as PCM data.

** Audio and video synchronization: **

According to the parameter information obtained in the process of decapsulation module, the decoded video and audio data are synchronized, and the video and audio data are sent to the system’s video card and sound card for playback.

14. What is the relationship between real-time audio and video and network?

The following is a detailed flow diagram of a typical real-time audio and video data:

▲ The picture is quoted from the story behind wechat Mini Program Audio and Video Technology.

As shown in the figure above, real-time audio and video technology has one more step of network transmission than ordinary audio and video local playback. In other words: real-time audio and video technology = audio and video technology + network technology.

Due to space constraints, I won’t discuss the technical details in this article, but for those who are interested, please continue to read the Real-time Audio and Video Development Technology Album on Instant Messenger.

15. Study deeply

If you’re a beginner and want to learn about real-time audio and video technology in an easy-to-understand way, read on:

Instant Messaging Audio and Video Development (XIX) : A Zero-based Introduction to the Most Popular Video Coding Techniques ever (* highly recommended)

Basics of Real-time Audio and Video Technology

This article has been simultaneously published on the “INSTANT Messaging technology circle” public account:

The link for the simultaneous release is: www.52im.net/thread-3194…