Basic knowledge of audio and video development

Today, I will learn the basic knowledge of audio and video. I will be in touch with the development of audio and video in my daily work. For example, TSPlayer, IjkPlayer and MediaPlayer are all involved in providing playing ability. But the specific implementation and can support the ability is not the same, to continue to go deep must be in-depth audio and video learning, Android development of several main directions are respectively application, Framework, audio and video, NDK, etc., if continue in the field of Android, these pits are still must be filled, the main content is as follows:

Video coding
Audio coding
Multimedia player component
Frame rate
The resolution of the
Coding format
Encapsulation format
Bit rate
Color space
Sampling rate
Quantitative accuracy
channel

Video coding

Video encoding refers to the conversion of a video file format into another video file format through specific compression technology. The main codec standards in video transmission are as follows:

Motion Still Image Expert Group m-JPEG
- M-jpeg is an image compression coding standard, is short for Motion-JPEG, JPEG standard is mainly used to deal with still images, while M-JPEG the Motion of the video sequence as a continuous still image to process, this kind of compression independently complete compression of each frame, in the editing process can be randomly stored each frame, Accurate to frame editing, M-JPEG only compresses the spatial redundancy within frames, not the time redundancy between frames, so the compression efficiency is not high.
International Organization for Standardization (ISO) moving image expert group of MPEG series standards
- There are five main MPEG standards: Mpeg-1, MPEG-2, MPEG-4, MPEG-7 and MPEG-21, etc. MPEG standard video compression and coding technology mainly uses interframe compression coding technology with motion compensation to reduce time redundancy, and uses DCT technology to reduce spatial redundancy of images. Entropy coding reduces statistical redundancy in information representation. The comprehensive application of these technologies greatly enhances the compression performance.
H.261, H.263, H.264 of ITU-T, etc
- H.261: The first practical digital video decoding standard, the compression algorithm is the motion compensation interframe prediction combined with block DCT hybrid coding, the motion compensation uses full pixel accuracy and loop filtering, supporting both CIF and QCIF resolution.
- H. 263: H.263 is the same encoding algorithm as H.261, but with a slight improvement, h.263 standard can provide better image results than H.261 at low bit rates. Its motion compensation uses half pixel accuracy and supports five resolutions: CIF, QCIF, SQCIF, 4CIF and 16CIF.
- H. 264: H.264 is a new digital video coding standard developed by the Joint Video Group (JVT) jointly formed by two organizations, ISO and ITU-T, so it is both THE H.264 of ITU-T, It is Part 10 of ISO/IEC’S MPEG-4 Advanced Video Coding (AVC), so whether MPEG-4 AVC, MPEG-4 Part 10, or ISO/IEC 14496-10, H.264 is a hybrid coding system based on the traditional framework, with local optimization and emphasis on coding efficiency and reliability. H.264 has a high compression ratio and high-quality smooth images. Video data compressed by H.264 requires less bandwidth in the process of network transmission, and is the highest compression rate of video compression standard.

Audio coding

Common audio codec standards are as follows:

ITU: G.711, G.729, etc
MPEG: MP3, AAC, etc
3GPP: AMR, AMR-WB, AMR-WB+, etc
There are also standards developed by enterprises, such as Dolby AC-3, DTS, WMA and so on

Common introductions are as follows:

MP3 (MPEG-1 Audio Layer 3) : An Audio compression technology, which is designed to greatly reduce the amount of Audio data, using MPEG Audio Layer 3 technology, the music at 1:10 or even 1:12 compression rate, compressed into smaller files, But for most users replay sound quality compared with the original not compressed audio no significant decline, it is the use of the human ear is not sensitive to high frequency sound signal characteristics, the time domain waveform signal into frequency domain signal, and divided into multiple frequencies, the different frequency bands using different compression ratio, the high frequency increasing compression ratio (or even ignore signals), The use of low frequency signal of small compression ratio, to ensure that the signal is not distorted, which is equivalent to abandoning the human ear basic can not hear the high frequency sound, only retain can hear the low frequency part, so as to carry on certain compression of audio, in addition, MP3 belongs to the lossy compression file format.
AAC: Advanced Audio Coding, originally based on MPEG-2 Audio Coding technology, after the emergence of MPEG-4, AAC reintegrated its features, and added SBR technology and PS technology. Unlike traditional MPEG-2 AAC, also known as MPEG-4 AAC, AAC is a file compression format specially designed for audio data. Compared with MP3, AAC provides better sound quality and smaller files. However, AAC is a lossy compression format, and its advantages will decrease with the advent of large-capacity devices.
WMA: Short for Windows Media Audio, a family of Audio codecs developed by Microsoft and corresponding digital Audio encoding formats. WMA includes four different codecs: WMA, the original WMA codec, as a competitor to MP3 and RealAudio codecs; WMA Pro, support for more sound channels and higher quality audio [; WMA Lossless, Lossless codec; WMA Voice, for storing speech, Some pure Audio ASF files that use Windows Media Audio to encode all their content also use WMA as an extension, which supports encryption and cannot be played locally illegally. WMA is also a lossy file format.

More audio and video codec standards can be referred to: Audio codec standards

Multimedia player component

Android multimedia player components include MediaPlayer, MediaCodec, OMX, StageFright, AudioTrack, etc., as follows:

MediaPlayer: provides a playback control interface for the application layer
MediaCodec: provides an interface to access the underlying MediaCodec
OpenMAX: Open Media Acceleration, also known as OMX, is a multimedia application standard. Android’s main multimedia engine, StageFright, uses OpenMax via IBinder for codec processing.
StageFright: Android 2.2 starts with the introduction of a replacement for the default media playback engine, OpenCORE. Stagefright is a Native layer media playback engine with built-in software-based codecs for popular media formats that utilize the OpenMAX framework. Introduced is the OMX-Component part of OpenCORE, which exists as a shared library in Android, corresponding to libstagehori.so.
AudioTrack: Manages and plays a single audio resource. Only PCM streams are supported. For example, most WAV format audio files are PCM streams, which can be played directly by AudioTrack.

Common multimedia frameworks and solutions

Common multimedia frameworks and solutions include VLC, FFmpeg, GStream, etc., as follows:

VLC: Video LAN Client is a free, open source cross-platform multimedia player and framework.
FFmpeg: multimedia solution, not multimedia framework, widely used in audio and video development.
GStreamer: An open source multimedia framework for building streaming media applications.

Frame rate

Frame rate is a measure of the number of frames displayed. The unit is “Frame per Second” or “Hertz, Hz”, which refers to the number of frames per Second (FPS), or the number of times a graphics processor can update the graphics per Second. A higher Frame rate allows for smoother, more realistic animation. Generally, 30fps is acceptable. However, increasing the performance to 60fps can significantly improve the sense of interaction and realism, but generally speaking, it is not easy to detect a significant improvement in fluency beyond 75fps. If the frame rate exceeds the screen refresh rate, graphics processing capacity will be wasted because the monitor cannot update at such a fast rate. Frame rates that exceed refresh rates are then wasted.

The resolution of the

Video resolution refers to the size or size of the image formed by video imaging products. What do common 1080P and 4K represent? P itself means line-by-line scanning, representing the total number of pixels in video, 1080P represents the total number of pixels in video, and K represents the total number of pixels in video. 4K is the number of pixels with 4,000 columns. Generally, 1080P is 1080 x 1920 and 4K is 3840 x 2160.

The refresh rate

Screen refresh rate is the number of pictures be refreshed per second, the refresh rate is divided into vertical refresh rate and level of the refresh rate, generally referred to the refresh rate of vertical refresh rate, usually refers to the vertical refresh rate said image of the screen redraw how many times per second, which is the number of screen refresh every second, in Hertz (Hz) for the unit, the refresh rate is higher, the better, the image is more stable, The more natural and clear the image display, the less the impact on the eyes, the lower the refresh frequency, the more severe the image flicker and shake, the faster the eye fatigue, generally speaking, if the refresh frequency can reach 80Hz or more, the image flicker and shake can be completely eliminated, the eyes will not be too easy to fatigue.

Coding format

For audio and video, the coding format corresponds to the audio coding and video coding. In contrast to the audio coding standard and video coding standard in front, each coding standard corresponds to the coding algorithm. Its purpose is to achieve data compression and reduce data redundancy through a certain coding algorithm.

Encapsulation format

Direct look at the baidu encyclopedia of information about the packaging format, encapsulation format (also called container), is to have good coding compression and audio video rail track according to certain format in a file, that is just a shell, or you put it as a video track and audio track folder can also, said popular point, video track of rice, The audio track is the equivalent of a dish, and the package format is a bowl, or a pot, that holds the meal.

Bit rate

Bit rate, also known as Bit rate, refers to the number of bits transmitted or processed per unit of time. The unit is BPS (Bit per second), which can also be expressed as B /s. The higher the Bit rate, the greater the amount of data transmitted per unit of time. The multimedia industry usually uses bit rate (KBPS) when referring to the data transmission rate of audio or video in a unit of time. Generally speaking, if it is 1M broadband, you can only watch the video with a bit rate of less than 125KBps on the Internet, and the video with a bit rate of more than 125KBps can only be watched smoothly until the video buffer.

Bit rate is generally divided into fixed bit rate and variable bit rate:

Fixed bit rate will ensure the constant bit rate of the bit stream, but will sacrifice the video quality. For example, in order to ensure the constant bit rate, some rich content of the image will lose some image details and become blurred.
Variable bit rate refers to that the bit rate of the output bit stream is variable, because the peak information of the video source itself is changing. From the perspective of ensuring the quality of video transmission and making full use of information, variable bit rate video coding is the most reasonable.

The bit rate is directly proportional to the video quality and file submission, but when the bit rate exceeds a certain value, the video quality is not affected.

Color space

YUV: A color coding method, generally used in image processing components, YUV in the photo or video coding, taking into account human perception, allows to reduce the bandwidth of chromaticity, where Y represents brightness, U represents chromaticity, V represents concentration, Y ‘UV, YUV, YCbCr, YPbPr refers to the range, There is often confusion or overlap. From the evolution of history, YUV and Y’UV are usually used to encode TV analog signals, while YCbCr is used to describe digital image signals, suitable for video and picture compression and transmission, such as MPEG, JPEG, now YUV has usually been widely used in computer systems.
RGB: the original color light mode, also known as RGB color model or Red Green Blue color model, is an additive color model, the Red (Red), Green (Green), [Blue (Blue) three primary colors of the color light with different proportions of the addition, in order to synthesize a variety of color light, most of the current display uses RGB this color standard.

YUV is mainly used to optimize the transmission of color video signals to make them backward compatible with old black and white TVS, and its biggest advantage is that it consumes very little bandwidth compared to RGB video signal transmission.

Sampling rate

Sampling rate in a second, and said an extract from the continuous signal and discrete signal sampling number, in Hertz (Hz), sampling rate is to point to convert analog signal into digital signal sampling frequency, the human ear can hear the sound of the generally between 20 Hz to 20 KHZ, according to the sampling theorem, the sampling frequency is more than twice the highest frequency of the signal, After sampling, the digital signal can completely reflect the real signal, and the common sampling rate is as follows:

8000 Hz: The sampling rate used by the telephone is sufficient for human speech
11025 Hz: sampling rate used for AM broadcast
22050 Hz and 24,000 Hz: sampling rates used for FM FM broadcasting
44100Hz: audio CD, usually used for mpeg-1 audio (VCD, SVCD, MP3) sampling rate
47,250 Hz: Sampling rate used by commercial PCM recorders
48,000 Hz: The sampling rate used for digital sound for miniDV, DIGITAL TV, DVD, DAT, film, and professional audio

The standard sampling frequency of CD music is 44.1khz, which is also the most commonly used sampling frequency between sound cards and computer operations at present. Currently, the sampling rate of popular blue ray is quite high, reaching 192kHz. At present, the vast majority of sound cards can support 44.1khz, 48kHz, 96kHz, and high-end products can support 192kHz or even higher. In short, the higher the sampling rate, the better the quality of sound files obtained, and the larger the storage space.

Quantitative accuracy

Is converted to a digital signal in the process of sound waves in not only the sample rate affects the integrity of the original sound, still has an important influence factor is the quantitative precision and sampling frequency is the number of samples per second, and the quantitative accuracy is for sonic amplitude of the cutting, cutting the number of is calculated on the maximum amplitude cut into 2 n, n is the number of bit. The number of bits is the audio resolution.

In addition, the number of bits also determines the range of acoustic amplitude (namely, the dynamic range, the gap between the maximum volume and the minimum volume). If this bit is larger, it can represent a larger value and describe the waveform more accurately. The data of each bit can record a dynamic signal equal to about 6dB. 16Bit can provide a maximum 96dB dynamic range (only 92dB after high-frequency vibration), so it can be inferred that 20Bit can reach 120dB dynamic range, large dynamic range, what will be the benefit? The dynamic range is the ratio of the output noise power of the system to the maximum undistorted volume power. The greater the value, the higher the system can withstand high dynamics.

channel

Sound channel refers to independent audio signals collected or played back in different spatial positions during recording or playing, so the number of sound channels is the number of sound sources during recording or the corresponding number of speakers during playback. Common sound channels are mono, stereo, 4, 5.1, 7.1 and so on, as follows:

Mono: Set up a speaker.
Stereo sound channel: expanded a mono speaker two speakers for left and right sides is symmetrical, voice in the process of recording was assigned to two independent channels, so as to achieve a good effect on sound localization, this technique is especially useful in music appreciation, xin zhong can clearly distinguish various instruments come from, which makes music more creative, more close to the face. Stereo technology is widely used on many Sound cards since Sound Blaster Pro and has become a far-reaching audio standard.
4 channel: The 4-channel surround defines 4 articulation points, namely front left, front right, back left and back right, and the audience is surrounded in the middle. Meanwhile, it is also suggested to add a bass speaker to strengthen the playback processing of low-frequency signals. This is the reason why 4.1 channel speaker system is widely popular nowadays. The 4-channel system can bring the listener from many different directions of sound around, can get the body in a variety of different environments of the hearing experience, to the user with a new experience.
5.1 Channel: In fact, 5.1 channel system is derived from 4.1 channel system. It divides the surround channel into two parts, left surround and right surround, and increases bass effect in the central position.
7.1 Sound channel: on the basis of 5.1 sound channel system, 7.1 sound channel system adds two pronunciation points of center left and center right. Simply speaking, it establishes a set of relatively balanced sound field around the listener and adds back middle sound channel.

For more information, see the wechat public number.