Quick exploration, audio and video technology is no longer mysterious

Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This article was posted by Goo in cloud + Community

Why are there so many formats of audio and video closely related to life? What are the mechanisms supporting live, on-demand and instant video? In the face of complicated audio and video knowledge, how should learn? Quick exploration, audio and video technology is no longer mysterious.

preface

In the face of a technology, we are familiar with but unfamiliar with it. We can skillfully use the PLATFORM based API to meet various needs and master the platform features, framework and principles. However, with the deepening of technical points, I found that there was a fundamental and profound knowledge blind area.

Limiting ourselves to platform API development won’t get us very far. The key to break through the bottleneck period of technology growth lies in the combination of technology precipitation and business direction, which requires us to accumulate and deepen knowledge. This article shares the author to the audio and video technology knowledge network exploration path, hoping to bring help to everyone.

1. Collection – Where does the data come from?

1.1 Sampling Principle

Definition: the process of converting analog signals into digital signals by discretization of continuously changing images in spatial coordinates, that is, sampling images.

In layman’s terms: collection is the process of converting what you see into a binary stream.

1.2 Basic Concepts

1.2.1 image

“Image” is a concept of collection, frame, top field, bottom field can be called image.

Frame a frame is usually a complete image, and when scanned in a progressive manner, the signal is one frame per scan.
When collecting video signals from the top and bottom fields, the scanning mode is divided into progressive scanning and interlaced scanning. If a progressive scan is used, the result is a complete image; With interlaced scanning (odd and even lines), the scanned frame is divided into two parts. Each part is called the “field” and divided into “top field” and “bottom field” according to the order.
Interlaced scanning each frame is divided into two frames displayed alternately. Each frame is divided into top and bottom fields, usually by scanning the odd-numbered lines for the first field, and then scanning the even-numbered lines for the second field. Due to the visual persistence effect, the human eye will see smooth motion instead of flashing half-frames. But then there will be flickering, although not easy to detect, but can make the eye easily tired. This flicker is especially noticeable when the screen’s content is striped horizontally, with jagged blemishes.
Progressive scanning displays all images of each frame simultaneously. The entire scan frame is displayed each time, and if the frame rate of the progressive scan is the same as the field rate of the interlaced scan, the human eye will see a smoother image than the interlaced scan, with less flicker than the interlaced scan. Each image is scanned consecutively by an electron beam, row after row, in a process called progressive scanning.
The difference between the two is, for example, 25fps with 100 lines of image, so interlaced scanning takes 50 scans per second, but only 50 lines per scan. Progressive scanning takes only 25 scans, but 100 rows at a time. Conclusion: The frequency of interlaced scan is double that of progressive scan and the channel bandwidth is half of progressive scan. In the early days, most displays used interlaced scanning because the channel bandwidth was cut in half while the image experience was not reduced by much.
Portal: step-by-step scanning, interlaced scanning detailed explanation

1.2.2 Color model

RGB color model

RGB represents red, green and blue, and each color needs three digits, one of which takes up 1 byte and one of which takes up 3 bytes and 24 bits.

More efficient color models? YUV

YCbCr color model

As a member of YUV family, YCbCr color model is characterized by the separation of brightness signal Y from chromaticity signal U and V. When U and V are missing, only Y signal can also represent black and white image.

Y = kr\*R + kg\*G + kb\*B
Copy the code

Y is brightness, and Kr, kg, KB are weights of R, G, and B.

Cr = R -- Y; Cg = G -- Y; Cb = B -- Y;Copy the code

Question: Compared with RGB model, YCbCr model also needs 3 signals per pixel. Why is this model more efficient?

Optimization idea

Human eyes are more sensitive to brightness resolution than to color.

Based on the nature of human vision, it was obvious that we needed to start with color, so we proposed “chromaticity sampling,” which could reduce color storage by half or more. Easy to implement, less coding stress, higher returns.

To optimize the implementation

When each scan line is scanned, the transmission frequency of chromaticity value is lower than that of brightness. There are various sampling methods for color. Sampling methods are usually based on brightness value and described in the form of 4:X:Y, where X and Y are the relative number of values in each two chromaticity channels:

Here’s another example:

We have an image with an array of pixels:

There will be the following sampling optimization methods:

As can be seen intuitively in the figure above, the YCbCr color model does not require each pixel to have three components, and the color components can be effectively reduced after “chromaticity sampling”.

1.3 Image perception and acquisition

< img SRC = “ask.qcloudimg.com/draft/25578…”

width=”70%” />

Imaging sensor

A combination of sensor materials that are sensitive to specific types of detection energy through electrical power.
The input light energy is converted into a special voltage waveform.
The amplitude and spatial characteristics of the waveform are related to the physical phenomena of perception. To produce a digital image, sampling and quantization are next required.

1.4 Sampling and quantization

For example, if a black and white image (a) is a continuous image, there are several major operations required to convert it to digital form:

Sampling :(a) the image is sampled at equal intervals along the AB line segment on the figure to obtain the gray level curve (b)
Quantitative:

(c) The gray scale is divided into 8 gray levels on the right side of the figure, and then the continuous gray value of each sample is quantized into one of the 8 gray levels. Finally, the figure (d) is obtained. Quantization output by the perceptron completes the process of digital image generation by the flow.

A. Image projection to sensor array b. Image sampling and quantization results

Ii. Rendering – How is the data presented?

2.1 Principles of the Player

To play a video from the Internet, the player needs to go through the following core steps: protocol decompression, decomsealing, decoding, and audio and video synchronization.

** Protocol resolution: ** Parse the streaming media protocol data into standard encapsulated data. The streaming media protocol transmits audio and video data as well as signaling data, including playback control and network status description. Common streaming media protocols include HTTP, RTMP, and MMS.
** Decapsulation: ** Separates the standard package format data obtained by decapsulation into audio stream compressed coding data and video stream compressed coding data. Packaging format is also called container, that is, the encoded and compressed video track and audio track are put into a file according to a certain format. ** It is important to note that ** even if the same encapsulation format, its encoding method is not necessarily the same, we can see the video file from the suffix to the encapsulation format. Common packaging formats: AVI, RMVB, MP4, FLV, MKV and so on.
** decoding: ** is the audio and video compression encoding data, decoding into uncompressed audio and video raw data. Audio coding standards include AAC, MP3, AC-3, etc. Video coding standards include H.264, MPEG2, VC-1, etc. Encoding and decoding is the core and most complex part of the whole process.
** Audio and video synchronization: ** Synchronizes the decoded audio and video data according to the parameter information obtained in the process of decapsulation, and finally transmits the data to the system, which is played by the system calling hardware.

2.2 Video coding mode

Video codec is the process of digital video compression and decompression.

When selecting an audio and video coding scheme, you need to consider the following factors: video quality, bit rate, complexity of encoding algorithm and decoding algorithm, Robustness against data loss and error, ease of editing, random access, perfection of encoding algorithm design, end-to-end delay, and other factors.

2.2.1 OVERVIEW of H.26X series

The H.26X series, dominated by the International Telex and Video Union Telecommunication Standardization Organization (ITU-T), includes H.261, H.262, H.263, H.264, H.265.

H.261, mainly used in older video conferencing and video telephony systems. It was the first digital video compression standard in use. Virtually all subsequent standard video codecs are based on it.
H.262, equivalent to MPEG-2 Part II, is used in DVD, SVCD, and most digital video broadcast systems and wired distribution systems.
H.263, mainly used in video conferencing, video telephony and network video related products. H.263 represents a significant performance improvement over its predecessors in terms of compression of progressive video sources. Especially in the low bit rate end, it can greatly save bit rate on the premise of guaranteeing certain quality.
H.264, equivalent to MPEG-4 Part 10, also known as Advanced Video Coding (AVC), is a Video compression standard. It is a widely used high-precision Video recording, compression, and publishing format. This standard introduces a series of new technologies that can greatly improve compression performance and greatly surpass previous standards at both high and low bit rates.
H.265, known as High Efficiency Video Coding (HEVC), is a Video compression standard that is the successor to H.264. HEVC is believed to not only improve image quality, but also achieve two times the compression rate of H.264 (equivalent to 50% reduction of bit rate under the same picture quality), and can support 4K resolution and even ultra high definition TV, the highest resolution up to 8192×4320 (8K resolution), which is the current development trend.
Detailed explanation is to be sorted out in another article

2.2.2 Overview of MPEG series

MPEG series, by the international standards organization (ISO) under the moving image expert group (MPEG) development.

Mpeg-1 Part II, mainly used on VCD, is also used in some online videos. The quality of the codec is roughly equivalent to that of the original VHS videotape.
Mpeg-2 Part II, equivalent to H.262, is used in DVD, SVCD, and most digital video broadcast systems and wired distribution systems.
Mpeg-4 Part ii can be used for network transport, broadcasting, and media storage. It has improved compression performance over MPEG-2 Part ii and the first version of H.263.
Part 10 of MPEG-4, equivalent to H.264, is a collaborative standard between the two coding organizations.
Detailed explanation is to be sorted out in another article

2.3 Audio codec mode

In addition to video, audio also needs to be encoded, and audio is usually encoded in the following format:

Advanced Audio Coding (AAC) is a mPEG-2-based Audio Coding technology developed by Fraunhofer IIS, Dolby LABS, AT&T, and Sony in 1997. In 2000, after the emergence of the MPEG-4 standard, AAC reintegrated its features, adding SBR technology and PS technology, in order to distinguish the traditional MPEG-2 AAC, also known as MPEG-4 AAC. (AAC details to be sorted out in another article)
MP3, or MPEG-1 or MPEG-2 Audio Layer III, is a once-popular digital Audio encoding and lossy compression format designed to dramatically reduce the amount of Audio data. It was invented and standardized in 1991 by a group of engineers at the Research organization Fraunhofer-Gesellschaft in Erlangen, Germany. The popularity of MP3 has caused great impact and influence on the music industry.
WMA, Windows Media Audio, is a digital Audio compression format developed by Microsoft. It includes lossy and lossless compression formats.

Three, processing – how to process data?

Audio and video processing is the core requirement of the business, and the most freedom for developers. Through audio and video processing, a variety of cool special effects can be realized.

** Image, video common processing methods: ** beautification, cropping, zooming, rotation, overlay, codec and so on.

** Common audio processing methods: ** resampling, denoising, echo cancellation, mixing, encoding and decoding, etc

Common frameworks:

Image processing: OpenGL, OpenCV, libyuv, FFMPEG, etc.
Video codec: X264, OpenH264, FFMPEG, etc.
Audio processing: SPEexDSP, FFMPEG, etc.
Audio codec: libfaac, opus, speex, FFMPEG, etc.

(Portal: Audio and video development open source project summary)

4. Transmission – How is data transmitted?

4.1 Streaming Media Protocol

Streaming media refers to media transmitted by streaming through the Internet. The streaming media protocol is the communication between the server and the client follows but the rules. When it comes to audio and video transmission, we have to mention streaming media protocols. Common streaming media protocols include:

agreement	An overview of the	The characteristics of	Application scenarios
RTP	Real-time Transport Protocol (RTP) a network Transport Protocol. RTP specifies the standard packet format for delivering audio and video over the Internet.	Based on UDP protocol implementation	RTP is commonly used in streaming media systems (in conjunction with RTSP)
RTCP	Real-time Transport Control Protoco a sister protocol of the Real-time Transport Protocol (RTP).	RTCP provides out-of-band control for RTP media streams. RTCP does not transmit data itself, but works with RTP to package and send multimedia data. RTCP periodically transfers control data between participants in a streaming multimedia session.	Provide feedback on the Quality of Service provided by the RTP.
RTSP	Real Time Streaming Protocol defines how a one-to-many application can efficiently transmit multimedia data over an IP network.	RTSP is architecturally superior to RTP and RTCP, using TCP or UDP for data transfer	With RTSP, both the client and the server can make requests, that is, THE RTSP can be bidirectional.
RTMP	Real Time Messaging Protocol An open Protocol developed by Adobe Systems for audio, video, and data transfer between Flash players and servers.	The protocol is based on TCP. It is a protocol family, including RTMP, RTMPT, RTMPS, and RTMPE.	A network protocol designed for real-time data communication, mainly used for audio and video and data communication between Flash/AIR platform and streaming/interactive server that supports RTMP protocol.
RTMFP	(Real Time Media Flow0 Protoco) Real Time Media Flow Protocol is a new communication Protocol developed by Adobe	The protocol is based on UDP and supports C/S and P2P modes. This protocol enables direct communication between terminal users using Adobe Flash Player	Adobe Flash Player end users communicate directly with each other
HTTP	HyperText Transfer Protoco runs on top of TCP		This protocol is very familiar, and it can also be used in video services.
HLS	HTTP Live Streaming (HTTP Live Streaming) is a http-based Streaming protocol implemented by Apple. It supports Live Streaming and on-demand Streaming	Short, long media files (MPEG-TS format) that the client continuously downloads and plays. As the data transmission via the HTTP protocol, so don’t have to consider a firewall or proxy problem completely, and block the length of the file is very short, the client can choose and code rate switch, to adapt to the different bandwidth under the condition of the technology characteristics of HLS, determines its delay in general will always higher than that of ordinary live streaming protocols	It is mainly applied in iOS system and provides audio and video live broadcast and on-demand programs for iOS devices (such as iPhone and iPad).

4.2 Network video on demand service

The company	agreement	encapsulation	Video coding	Audio coding	player
CNTV	HTTP	MP4	H.264	AAC	Flash
CNTV (part)	RTMP	FLV	H.264	AAC	Flash
TV TV	HTTP	MP4	H.264	AAC	Flash
youku	HTTP	FLV	H.264	AAC	Flash
tudou	HTTP	F4V	H.264	AAC	Flash
com	HTTP	FLV	H.264	AAC	Flash
Sound yue Taiwan	HTTP	MP4	H.264	AAC	Flash
Le miniaturization:	HTTP	FLV	H.264	AAC	Flash
Sina’s video	HTTP	FLV	H.264	AAC	Flash

There are two advantages of using HTTP in network voD service:

HTTP is an application layer protocol based on TCP. Packet loss does not occur during media transmission, ensuring video quality.
HTTP is the protocol supported by most Web servers, so streaming services save money by not having to invest in additional streaming servers.

For the package formats: MP4, FLV, F4V are only containers, making little difference, but the key is the audio and video decoding methods: H.264 and AAC, which are still the most widely used coding standards.

4.3 Network Live video service

The company	agreement	encapsulation	Video coding	Audio coding	player
TV TV	RTMP	FLV	H.264	AAC	Flash
A back	RTMP	FLV	H.264	AAC	Flash
China Education Television	RTMP	FLV	H.264	AAC	Flash
Beijing Media Mobile TV	RTMP	FLV	H.264	AAC	Flash
Shanghai IPTV	RTSP+RTP	TS	H.264	MP2	The set-top box

The advantage of using RTMP as the live broadcast protocol for network video broadcast service is that it can be directly supported by Flash player, which has a high popularity in the PC era and is well combined with the browser. Therefore, this streaming media live broadcasting platform can basically realize “plug-in free live broadcasting”, which greatly reduces the user cost.

FLV, H.264, AAC and Flash are almost all adopted in packaging format, video coding, audio coding and player. FLV, RTMP and Flash are all products of Adobe, which naturally have a good combination.

4.4 summarize

The above data are old data in THE PC era. With the outbreak of mobile Internet and the popularity of H5 and client applications, the choice of technical solutions for video business in the industry is gradually changing, and we need to make appropriate technology selection based on the current actual situation and the trend of technological development.

conclusion

There is a long way to go in audio and video technology, this paper aims to build a knowledge network of audio and video knowledge, many knowledge has not been in-depth, we still need to continue to learn and practice, holding the pursuit of the ultimate spirit to explore and discover, come on, we grow together quickly!

Machine learning in action! Quick introduction to online advertising business and CTR knowledge

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!