preface

Want to learn audio and video recently, borrow this in the library “audio and video development advances guide”, read period of time feel quite good to be bought in some treasure.

In the future, I will be immersed in the research of audio and video for a period of time. Please make a special topic to record the reading notes of each chapter (from the perspective of iOS development, android is not involved for the time being).

Chapter one Basic concept of audio and video

This chapter describes the concepts related to sound, image, and video. On the basis of the book, the author studied some curious questions when reading and recorded them in this article.

The physical properties of sound

The three elements of sound waves are frequency, amplitude and waveform. Frequency represents the height of the scale, amplitude represents loudness, waveform represents timbre.

The higher the frequency, the shorter the wavelength. Low-frequency sounds have a longer wavelength, so they can more easily get around obstacles, so there is less energy decay, and the sound travels farther. Human hearing has a frequency range of about 20Hz to 20kHz.

Loudness is a reaction of energy.

The shape of the wave determines the tone it represents.

Digital audio

To digitize an analog signal, there are three steps: sampling, quantization and coding.

The sampling

Sampling: Digitizing a signal on a timeline. Corresponds to the frequency of the sound.

According to Nyquist’s theorem, when a voice is sampled at a frequency more than two times higher than the highest frequency, the quality of the sound heard by the human ear will not be reduced after digital processing. Therefore, the sampling frequency is generally 44.1kHz.

quantitative

Quantization: The digitization of a signal along the amplitude axis. Corresponds to the amplitude of the sound.

For example, if each sample is represented by a 16-bit binary signal, the range is [-32768, 32767].


Sampling and quantization can delineate the shape of the wave, the timbre. The transformation of the three elements of sound waves is complete.


coding

Encoding is the recording of sampled and quantized digital data in a certain format, such as sequential storage or compressed storage, and so on.

  • Audio naked data format, Pulse Code Modulation, English called PCM(Pulse Code Modulation).

To describe PCM data, the following concepts are required: sampleFormat, sampleRate, and channel.

Quantization format and sampling rate As mentioned above, the number of channels refers to the number of sounds that can support different sounds. It is not difficult to understand that the number of stereo channels is 2 channels by default.

  • Data bit rate: the number of bits per second.

Take CD sound quality as an example, the quantization format (bit depth) is 16 bits, the sampling rate is 44100, and the number of tracks is 2.

Bit rate 44100 * 16 * 2 = 1378.123 KBPS

The storage space of such data in one minute is 1378.125 x 60/8/1024 = 10.09MB


Audio compression

In fact, the principle of compression coding is to compress redundant signals. Redundant signals refer to signals that cannot be perceived by human ears, including audio signals outside the range of human ears and masked audio signals, etc. Audio that is outside the range of human hearing was mentioned above. The masked audio signal is mainly due to the masking effect of human ear, which is mainly manifested as frequency domain masking effect and time domain masking effect.

The book does not cover the masking effect, presumably for application layer developers, do not need to understand. But the author looked it up and exclaimed, Man is wonderful.

Here cited baidu Encyclopedia information, not interested can be directly skipped.

Masking effect in the frequency domain

A strong pure tone will mask the weak pure tones that sound simultaneously in its vicinity, which is called frequency domain masking, or simultaneous masking. For example, a pure tone with a sound intensity of 60dB and a frequency of 1000Hz is 18dB higher than another pure tone with a frequency of 1100Hz. In this case, our ears can only hear the 1000Hz forte. If we have a 1000Hz pure tone and a 2000Hz pure tone with a sound intensity 18dB lower, our ears will hear both sounds at the same time. To make the 2000Hz pure tone inaudible, you need to drop it to 45dB below the 1000Hz pure tone. In general, the closer a weak pure tone is to a strong pure tone, the easier it is to mask it; Low frequency pure tones can mask high frequency pure tones effectively, but the masking effect of high frequency pure tones on low frequency pure tones is not obvious.

Since the relationship between sound frequency and masking curve is not linear, the concept of “criticalband” is introduced to measure sound frequency coherently. It is generally considered that there are 24 critical bands in the range of 20Hz to 16kHz.

Time domain masking effect

In addition to masking between simultaneous sounds, there is also masking between adjacent sounds in time, which is called time domain masking. Time domain masking is classified into pre-masking and post-masking, as shown in Figure 12-05. The main reason for temporal masking is that the human brain needs a certain amount of time to process information. In general, lead masking is very short, only about 5 to 20ms, while lag masking can last 50 to 200ms. The distinction is easy to understand.


The following describes several commonly used compression encoding formats. Just take a quick look.

  1. WAV coding

One implementation of WAV coding is to add 44 bytes in front of the PCM data format, which are used to describe PCM sampling rate, number of sound channels, data format and other information.

Features: Very good sound quality, a large number of software support.

Application: multimedia development of intermediate files, save music and sound materials.

  1. MP3 encoding

MP3, a medium to high bit rate MP3 file encoded using LAME, sounds and feels very similar to the source WAV file.

Features: sound quality in 128Kbit/s above performance is good, high compression ratio, a large number of software and hardware support, good compatibility.

Application: music appreciation with high bit rate and requirement for compatibility.

  1. AAC encoding

Features: excellent performance at bit rate less than 128Kbit/s, and mostly used for audio encoding in video.

Application: 128Kbit/s below the audio encoding, mostly used for video audio track encoding.

  1. Ogg coding

Features: can use smaller bit rate than MP3 to achieve better sound quality than MP3, high, medium and low bit rate have good performance, compatibility is not good, streaming features are not supported.

Application scenario: Audio message scenario in voice chat.


This is the concept of audio in the book. However, THE author is still confused. How does the player know its sampling rate, number of channels and data format for an audio clip?


Composition of WAV files

So THE author looked up this article WAV file format. Simply put, it’s what a paragraph in the head is defined to mean. So the codec is performed according to the meaning of the convention.

As for other coding formats, how is the file format, please refer to the reader when necessary.


The physics of the image

  • Red, green and blue cannot be decomposed, so they are called primary colors.

Let’s say a phone screen has a resolution of 1280 × 720, which means there are 1280 columns and 720 rows, so the entire phone screen has 1280 × 720 pixels. Each pixel consists of three sub-pixels. These three pixels are respectively red, green and blue, which together form a color.

Numerical representation of the image

RGB representation

  • Floating-point representation: The value ranges from 0.0 to 1.0. For example, OpenGL ES uses this representation for each sub-pixel.

  • Integer: the value ranges from 0 to 255 or from 00 to FF. Eight bits represent one sub-pixel and 32 bits represent one pixel. This is the RGBA_8888 data format similar to the image format on some platforms. For example, the representation method of RGB_565 on Android platform is 16-bit mode to represent a pixel, R is represented by 5 bits, G is represented by 6 bits, and B is represented by 5 bits.

For an image, the integer representation method is generally used to describe it, such as calculating the size of a 1280 * 720 RGBA_8888 image, which is also the size occupied by bitmap in memory.

1280 * 720 * 32/8 = 3.516MB

YUV representation

“Y” represents brightness, also known as gray scale value; “U” and “V” represent chroma, which describes the color and saturation of the image and specifies the color of the pixel.

“U” and “V” are represented by Cr and Cb respectively. Respectively reflect the difference between the brightness value of RGB input signal and red part and blue part.

If there are only Y signal components but no U and V components, then the image represented in this way is black and white gray image.

  • The most common representation is that Y, U, and V are all represented by one byte. Therefore, the value ranges from 0 to 255.

  • The most common YUV sampling format is 4:2:0.

That doesn’t mean there’s only Y, Cb, and no Cr components. Directly above, easy to understand.

The sampling described below is for the UV component. 4:4:4 represents complete sampling, with each Y corresponding to a set of UV components. 4:2:2 represents a 2:1 horizontal sampling, a vertical complete sampling, and a set of UV components shared by each of the two Y’s. 4:2:0 represents a 2:1 horizontal sampling and a 2:1 vertical sampling, with each of the four Y’s sharing a set of UV components.

Taking an example from the book, it can be seen that the sampling rate of chroma is 4:1, that is, Y:U:V is 4:1:1. For 32 pixels, there are 32 Y’s, 16 U’s, and 16 V’s, which take up 48 bytes of storage.

YUV and RGB conversion

This part is to apply a reasonable transformation formula conversion.


So much for the images in the book. The author himself also consulted the file composition of the image.

Image compression encoding format and file composition

JPG is lossy and PNG is lossless.

About the difference between them.

For JPG file formats, check out this JPG file format analysis. The WAV file format works in a similar way.

Interestingly, there is a piece of data in the file format that represents the thumbnail.

In iOS development, thumbnails can be read first when photos are displayed in an album. Load the original data when the specific display.


Video encoding

There is redundant information in the video source collected. We can remove the redundant information in time and space by inter-frame coding technology and intra-frame coding technology. See this article for details on how to do it and why you can do it. It’s not covered in the book.

After a series of eliminating redundant information, the amount of video data can be greatly reduced, which is more conducive to the preservation and transmission of video and the process of removing redundant information, which is called compression coding.


There are many standards for compression coding. The most widely used one is H.264 (AVC) formulated by ITU-T, which is famous for its high compression and high quality and supporting streaming media transmission on various networks.

Principle: Grouping. Each frame in the group can be cross-referenced to remove redundant information.

Let’s start with the concept

  • IPB frame

    • I-frame: A frame encoded within a frame. I-frame is usually the first frame of each GOP.
    • P frame: Predictive frame, also called predictive frame.
    • B frame: Bidirectional predictive interpolation coded frame, also known as bidirectional predictive frame.

Generally speaking, the compression rate of I is 7, the compression rate of P is 20, and the compression rate of B can reach 50.

  • GOP

A Group Of pictures formed between two I frames is the concept Of GOP(Group Of Pictures). Gop_size represents the number of frames between two I frames.

  • IDR frame

In the decoder, as soon as an IDR frame is received, the reference frame buffer is cleaned immediately and the IDR frame is treated as the frame being referenced.

  • PTS and DTS

Two different timestamps. DTS(Decoding Time Stame) is mainly used for video Decoding, and PTS(Presentation Time Stamp) is mainly used for video synchronization and output in the Decoding stage.


B frames, before and after prediction frames. So when decoding, it will disrupt the order of decoding and display. So with B frames, PTS and DTS are bound to be different.


In addition to H.264, ISO developed the standard :Motion JPEG, namely, MPEG, MPEG algorithm is suitable for dynamic video compression algorithm.

H.264 and MPEG are just encoding algorithms, and file suffixes depend on which container you choose.


That concludes the first chapter of the book.

Video compression encoding format and file composition

Compared with audio and images, the file composition of video is much more complicated.

  • MP4 file format brief analysis

  • Mp4 file format parsing

  • Video file format — Video encapsulation format — Video encoding format distinction



2019.5.17 update

Looking back at this chapter, I feel the need to add a little knowledge.

Encapsulate: encapsulate a file of a certain type after processing with a coding algorithm. For example, PCM -> WAV, which does not use the encoding algorithm, and then add some information to the file data header to describe the audio, packaged into a. WAV file.

Unpack: When packaged, there will be audio and video information. When unpacking, the binary data is identified according to the instructions.

Encoding: the recording of sampled and quantized digital data in a certain format, such as sequential storage or compressed storage, etc. For example, video is compressed and coded to remove spatial redundant information. Of course, it is not possible to store every frame with every pixel of data, but to use the explanatory information to tell you how to compile it.

Decoding: it is to compress and encode video/audio data and decode it into uncompressed video/audio raw data. After coding the video above, generate an image of each frame according to the instructions.

Audio-visual synchronization: the synchronization of images and sounds. Video and audio are time-stamped. There are three ways to synchronize, as discussed in Chapter 3.

Protocol/unprotocol: analogous to network protocol. When transmitting on the network, which protocol is obeyed, the receiving end will make corresponding processing. In addition to audio and video data, the protocol can also have playback instructions.

So when you receive a video stream from the network, you do the following.


reference

  • [1] Zhan Xiaokai, WEI Xiaohong. An advanced guide to Audio and video Development based on Android and iOS practices [M]. Beijing: China Machine Press, 2018:1-13.

  • Conclusion: Zero-base learning method for video audio codec technology