In the original article: Blogs on the Road

Full text Reference:

  • 1. Audio and video testing suggestions (Tencent Audio and Video Lab Quality Platform Group)
  • 2. Audio and video test on Android terminal (NetEase Yunxin)
  • 3. How to evaluate and optimize the video quality of Tencent Conference?
  • 4. Detailed differences among UGC, PGC and OGC
  • 5. Audio and video quality assessment Green paper
  • 6. Voice quality assessment
  • 7. Paper on speech enhancement and quality assessment
  • VIPKID audio and video quality assessment and perception system

Congratulations to myself for completing my first pull request for a well-known open source project by doing this solution.

1, the background

Video phone function, involving real-time audio and video quality assessment.

The audio and video transmission process is as follows:

As shown in the figure, factors affecting audio and video quality include:

  • Acquisition quality of source video (hardware decision);
  • The service quality of audio and video SDK (determined by SDK provider);
  • Network condition;

Special projects for real-time transmission video quality include :(in different network environments)

  • Performance, bit rate, resistance, delay, audio and picture synchronization (guaranteed by SDK service and provided technical indicators)
  • Lag (Fluency) : mobile terminal test
  • Video quality (manual) : Evaluated using open source algorithms

Image quality is not necessarily video quality, so be careful. In doubt, see the picture below (source: Tencent Conference)

2. Audio and video quality evaluation scheme

2.1 Video evaluation scheme

Video quality assessment is dedicated to assessing the perceived quality of video by human eyes. Generally speaking, there are two assessment methods:

  • Subjective quality assessment: Relying on human eyes to view and score, this kind of score is more accurate, but time-consuming, and not convenient for large-scale deployment.
  • Objective quality assessment: mainly to calculate the quality score of injury video. To evaluate the quality of an algorithm is to measure the correlation coefficient between subjective score and objective score. Generally speaking, the higher the coefficient, the better.

Objective quality assessment algorithms can be divided into three categories, mainly depending on whether lossless source video is used as a reference.

  • Full reference: For example, PSNR is a typical full reference algorithm, which measures the quality of damaged video by comparing it with the source video at various levels.
  • No reference: Some algorithms do not use the source video, but only the video at the receiving end, to measure its own quality.
  • Partial reference: for example, a feature vector is extracted from the source video and sent to the client along with the damaged video to calculate the quality. It is not practical to make full reference for video conference, because it is impossible to send local lossless source video to the client or other places to calculate quality. What we did this time is to transform the typical real-time scene of conference into a scene that can be optimized offline using full reference algorithm.

Performance of different video algorithms on video database:

According to the survey, the open source algorithm of video quality evaluation temporarily takes Netflix’s VMAF as the criterion, and Tencent’s open source DVQA evaluation will be added later (the DVQA model is PGC video, which is not suitable for UGC user scenes).Current application scenarios of DVQA Open Source Version:

(1) Netflix VMAF

  • Official introduction/source address
  • VMAF official installation guide/actual installation has a lot of holes, recommended to see the MAC installation guide
  • Principle: Video Multimethod Assessment Fusion (VMAF for short) predicts subjective quality by combining a variety of basic quality indicators. The rationale is that each basic metric may have its own strengths and weaknesses in terms of source content characteristics, artifact types, and degree of distortion. Through the use of machine learning algorithms to basic metrics “fusion” into the final metrics ー ー in our case is a support vector machine (SVM) regression son ー ー it gives weights to each of the basic metrics, the final metrics can retain all the advantages of each metrics, and provide a more accurate the final score. s
  • Development technical description: VDK kernel feature extraction (including elementary metric calculation) is computationally intensive, in order to improve efficiency, using C language. The control code is written in Python for rapid prototyping.
  • Basic indicators for reference:
    • Visual information fidelity (VIF) : VIF is a widely used image quality measurement method. Its premise is that image quality and information fidelity loss measure complement each other. In its original form, the VIF score is measured as a loss of fidelity combined with four scales. In VMAF, we adopt a modified version of VIF, where fidelity loss per scale is used as a basic measure.
    • Detailed Loss Measurement (DLM) : DLM is an image quality measurement based on the principle of separately measuring the loss of detail that affects content visibility and the loss of redundancy that distracts the viewer. The initial measurement combines DLM and Additive Damage Measurement (AIM) to produce the final score. In VMAF, we only use DLM as a basic metric. Pay special attention to special cases, such as black boxes, where numerical calculations of the original formula break.
    • Motion: A simple way to measure the time difference between adjacent frames. This is done by calculating the average absolute pixel difference of the luminance component.
  • Use:
PYTHONPATH=python ./python/vmaf/script/run_vmaf.py \
  yuv420p 576 324 \
  python/test/resource/yuv/src01_hrc00_576x324.yuv \
  python/test/resource/yuv/src01_hrc01_576x324.yuv \
  --out-fmt json
Copy the code
  • Results:

This produces the following JSON output:

# This will produce JSON output like this:"aggregate": {
    "VMAF_feature_adm2_score": 0.93458780776205741."VMAF_feature_motion2_score": 3.8953518541666665."VMAF_feature_vif_scale0_score": 0.36342081156994926."VMAF_feature_vif_scale1_score": 0.76664738784617292."VMAF_feature_vif_scale2_score": 0.86285338927816291."VMAF_feature_vif_scale3_score": 0.91597186913930484."VMAF_score": 76.699271371151269."method": "mean"
}
Copy the code
* VMAF_score is the final score, ranging from 0 (worst) to 1 (best) * ADM2, VIF_SCALex from 0 (worst) to 1 (best) * Motion2 from 0 (at rest) to 20 (at high speed)Copy the code

2.2 Audio evaluation scheme

There are many audio quality evaluation algorithms, so PESQ and STOI are selected for audio quality evaluation from the consideration of stability and evaluation latitude. Audio quality evaluation related, and the code is shown testerhome.com/topics/2505…

(1) the PESQ

  • Git:github.com/vBaiCai/pyt…
  • Function: By inputting the original file and the file to be evaluated, the output PESQ score is between -0.5 and 4.5. The higher the score is, the better the voice quality is.
  • PESQ algorithm requires a noisy attenuation signal and an original reference signal. At the beginning, after level adjustment, input filter filtering, time alignment and compensation, and auditory transformation of the two speech signals to be compared, the parameters of the two signals are extracted respectively, and the TIME and frequency characteristics are integrated to obtain PESQ scores, which are mapped to the subjective mean opinion score (MOS). PESQ scores range from -0.5 to 4.5. The higher the score, the better the voice quality.

(2) Short-term objective intelligibility (STOI)

STOI: short time objectively intelligible, used to evaluate the intelligibility of speech with noise that is masked in the time domain or weighted by the SHORT time Fourier transform in the frequency domain. When STOI is calculated, the median value d km (,) of each audio channel kk K (=1, ) and 400ms short-time segment mm M (=1, ) is calculated by using time for its pure and mixed speech signals. Firstly, the short-time Fourier transform (STFT) is applied to the pure and noisy speech signals, and the short-time energy spectrum of the NTH time detection in the JTH band () 2 X jn, and () 2 Y jn, are obtained. Add j () 2 X jn, and () 2 Y jn across 1/3 octave intervals to obtain the energy spectrum of the KTH audio channel () 2 X kn, and () 2 Y kN, respectively. Speech energy spectrum with noise () 2 Y kn, which is limited to a signal distortion ratio of no less than −15dB. The median value d km (,) is the correlation index between () 2 X km and the speech energy spectrum with noise in k-channel m segment () () 2 Y kN n n, 1,, = . STOI score D is the average value of the intelligibility of each frequency band of noisy speech, expressed as follows: (), 1, km d d km km = ∑ STOI Scores are obtained by comparing pure speech with speech to be evaluated, and the value ranges from 0 to 1. The higher the value, the better the voice quality.

STOI scores are obtained by comparing pure speech with speech to be evaluated, with the value ranging from 0 to 1. The higher the value, the better the voice quality.

Git:github.com/mpariente/p…

2.3 Fluency evaluation

  • Fluency is generally reflected by the holdup rate. The information of holdup mainly includes The Times and time of holdup. The common definition of stutter in the field of live scene is that the frame rendering interval is greater than 1s. However, through subjective experiments, generally this value reaches 200ms, the audience can feel the lag;
  • Holdup rate = sum (>200ms holdup time)/call time;
  • Caton definitions in streaming media scenarios take a different approach.
  • Fluency evaluation principle (Android) : By obtaining frame information in GFXInfo, frame time and lag rate are counted.

2.4 Network simulation tool

By simulating different network environments, on the one hand, we can verify whether the performance indicators promised by SDK are qualified, and on the other hand, we can verify the audio and video quality in the weak network environment.

(1)QNET weak network testing tool

It is recommended to use QNET, the cost is very low, and the experience is very good. Just install the APP on the device and log in using QQ. Weak network simulation environment is stable and easy to install.

(2) Network Emulator

Weak network testing tool: Network Emulator, Microsoft open source, can achieve bandwidth, packet loss, delay, jitter, integrated network and other weak network parameters limit. Common parameters of weak network test:

(2) Facebook ATC

Docker builds weak network test environment ATC

2.5 Easy to use testing framework

FR: @rikiesxiao

  • Scikit-image is mainly used for video and image algorithms, such as PSNR and SSIM
  • QoSTestFramework also has a testing framework, which also integrates VMAF.

3. Reference materials

UGC quality evaluation: evaluation objects include short video, live broadcast, real-time video call, etc.

3.1 SDK Performance Indicators

(1) Audio and video SDK performance indicators

Tencent data source:Cloud.tencent.com/document/pr…Push flow state data:

Get playback status data:

3.2 Video quality standards and algorithms

Objective evaluation of video quality is a method to quantify the degree of picture quality change (usually decline) when a video passes through a video transmission/processing system.

(1) Comparison of video evaluation algorithms

Index analysis:

  • PLCC: Pearson linear correlation coefficient, representing the linear correlation of the model.
  • SROCC: Spearman order correlation coefficient, which is used to measure the correlation of order and represents the nonlinear correlation of the model.

Suppose there are two sets of sequences X and Y, the order of which is R(X) and R(Y), then SROCC(X, Y) = PLCC(R(X), R(Y)).

(2) Tencent conference open source DVQA

Real-time video full reference quality evaluation algorithm based on deep learning developed for Tencent conference scenario.

Tencent Conference designed a new grid using deep learning to automatically learn relevant features of video quality and then train on PGC data sets to produce a general grid.

  • Features:
    • Sufficient accuracy and distinction to measure codec performance;
    • PSNR, SSIM, MS-SSIM, VMAF, based on image quality assessment;
    • Using deep learning to automatically learn quality-related features;
    • Use transfer learning to extend existing models to new scenarios;
  • Git address: github.com/Tencent/DVQ…
  • Tencent conference online video quality scoring platform: to collect subjective data of video: mos.medialab.qq.com

Tencent Video has also developed an end-to-end automatic quality evaluation system. This is the overall framework. In fact, its strategy is relatively less complicated, that is, the sending end plays the source video, and after the controllable damage network, the receiving end captures the picture presented by the conference at the receiving end, and then calculates its quality score by combining the picture with the source video of the sending end. Absolute indexes such as performance and bit rate mentioned above can be obtained, while resistance depends more on what kind of network conditions have a particularly bad experience. Delay, lag, audio and picture synchronization, including frame rate, can be obtained by comparing these two videos.

(3) Netflix VMAF

  • The official introduction
  • Git:github.com/Netflix/vma…
  • VMAF installation guide and MAC Installation guide
  • Principle: Video Multimethod Assessment Fusion (VMAF for short) predicts subjective quality by combining a variety of basic quality indicators. The rationale is that each basic metric may have its own strengths and weaknesses in terms of source content characteristics, artifact types, and degree of distortion. Through the use of machine learning algorithms to basic metrics “fusion” into the final metrics ー ー in our case is a support vector machine (SVM) regression son ー ー it gives weights to each of the basic metrics, the final metrics can retain all the advantages of each metrics, and provide a more accurate the final score.
  • Development technical description: VDK kernel feature extraction (including elementary metric calculation) is computationally intensive, in order to improve efficiency, using C language. The control code is written in Python for rapid prototyping.
  • Basic indicators for reference:
    • Visual information fidelity (VIF) : VIF is a widely used image quality measurement method. Its premise is that image quality and information fidelity loss measure complement each other. In its original form, the VIF score is measured as a loss of fidelity combined with four scales. In VMAF, we adopt a modified version of VIF, where fidelity loss per scale is used as a basic measure.
    • Detailed Loss Measurement (DLM) : DLM is an image quality measurement based on the principle of separately measuring the loss of detail that affects content visibility and the loss of redundancy that distracts the viewer. The initial measurement combines DLM and Additive Damage Measurement (AIM) to produce the final score. In VMAF, we only use DLM as a basic metric. Pay special attention to special cases, such as black boxes, where numerical calculations of the original formula break.
    • Motion: A simple way to measure the time difference between adjacent frames. This is done by calculating the average absolute pixel difference of the luminance component.
    • Vmaf basic usage
      • Run in single mode: run_vmaf.py
      • The command format
PYTHONPATH=python ./python/vmaf/script/run_vmaf.py format width height reference_path distorted_path [--out-fmt output_format]
Copy the code
  • Command parsing:

Format can be: (1) UV420p, YUV422p, YUV444P (8-bit YUV) Yuv444p10le (10-bit little-endian YUV) Width height is the width of the video, in pixels.

  • Results the resolution

This produces the following JSON output:

# This will produce JSON output like this:"aggregate": {
    "VMAF_feature_adm2_score": 0.93458780776205741."VMAF_feature_motion2_score": 3.8953518541666665."VMAF_feature_vif_scale0_score": 0.36342081156994926."VMAF_feature_vif_scale1_score": 0.76664738784617292."VMAF_feature_vif_scale2_score": 0.86285338927816291."VMAF_feature_vif_scale3_score": 0.91597186913930484."VMAF_score": 76.699271371151269."method": "mean"
}
Copy the code
* Where VMAF_score is the final score, others are the basic index scores of VMAF. * ADM2, VIf_SCALex score range 0 (worst) to 1 (best) * Motion2 score range 0 (stationary) to 20 (high speed)Copy the code
  • Run in batch mode: run_vmaf_IN_batch.py
  • The command line tool, FFmPEG2vMAf, provides the ability to take compressed video streams as input.

3.3 Audio quality standards

PESQ and PQLQA are both recognized speech quality evaluation algorithms in the industry.

(1) Audio evaluation latitude

A. Absolute Grade Rating (MOS)

Generally, MOS should be 4 or higher, which can be considered as relatively good voice quality. If MOS is lower than 3.6, it means that most subjects are not satisfied with the voice quality. MOS test general requirements:

  1. A sufficiently diverse sample (i.e. number of auditioners and sentences) to ensure that the results are statistically significant;
  2. Control the experimental environment and equipment of each listener to keep consistent;
  3. Each listener followed the same assessment criteria.

B. Degradation Category Rating (DCR)

C. Comparative Category Rating (CCR)

(2) Audio evaluation algorithm

A. python-pesq (PESq)

  • Git:github.com/vBaiCai/pyt…
  • Function: By inputting the original file and the file to be evaluated, the output PESQ score is between -0.5 and 4.5. The higher the score is, the better the voice quality is.
  • PESQ algorithm requires a noisy attenuation signal and an original reference signal. At the beginning, after level adjustment, input filter filtering, time alignment and compensation, and auditory transformation of the two speech signals to be compared, the parameters of the two signals are extracted respectively, and the TIME and frequency characteristics are integrated to obtain PESQ scores, which are mapped to the subjective mean opinion score (MOS). PESQ scores range from -0.5 to 4.5. The higher the score, the better the voice quality.

B. SegSNR (SegSNR)

Because speech signal is a kind of short time stationary signal which changes slowly, the signal-to-noise ratio should be different in different time periods. In order to improve the problem of SNR, segmented SNR can be used.

C. Logarithmic likelihood ratio measure (LLR)

Sakakura distance measure is realized by linear predictive analysis of speech signal. ISD is based on the difference between two sets of linear prediction parameters, derived from synchronous frames of raw and processed speech respectively. LLR can be regarded as a kind of Itakura Distance (IS), but the IS Distance needs to consider the model gain. LLR does not consider the amplitude displacement caused by model gain, but pays more attention to the similarity of global spectral envelope.

D. Logarithmic spectral Distance (LSD)

E. Short-term objective intelligibility (STOI)

In the range of 0-1, the greater the value, the higher the intelligibility.

F. Weighted spectral tilt measure (WSS)

Smaller WSS means less distortion, smaller is better, range

G. Perceptual objective Speech Quality Assessment (POLQA)

POLQA, is a full reference (FR) algorithm that grades degraded or processed speech signals associated with the original signal. It compares each sample of the reference signal (speaker side) with each corresponding sample of the degraded signal (listener side). The perceived difference between the two signals was rated as difference. PQLQA’s sound quality assessment covers intelligibility, stutter and other auditory perception information. Because it is a parametrized algorithm, it is not suitable for the scene evaluation of changing sound. In addition to concerned about the evaluation value of sound quality, sound quality stability will also have a greater impact on the sense of hearing.

4.5 Audio and video processing tool FFmpeg

(1) Statistical code rate

ffmpeg -i  /Users/lizhen/Downloads/mask.mp4  -hide_banner
Copy the code

Output:

ffmpeg version 4.31. Copyright (c) 2000- 2020. the FFmpeg developers
  built with Apple clang version 11.03. (clang1103.032.62.)
  configuration: --prefix=/usr/local/Cellar/ffmpeg/4.31. --enable-shared --enable-pthreads --enable-version3 --enable-avresample --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libdav1d --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-librtmp --enable-libspeex --enable-libsoxr --enable-videotoolbox --disable-libjack --disable-indev=jack
  libavutil      56. 51.100 / 56. 51.100
  libavcodec     58. 91.100 / 58. 91.100
  libavformat    58. 45.100 / 58. 45.100
  libavdevice    58. 10.100 / 58. 10.100
  libavfilter     7. 85.100 /  7. 85.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  7.100 /  5.  7.100
  libswresample   3.  7.100 /  3.  7.100
  libpostproc    55.  7.100 / 55.  7.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/Users/lizhen/Downloads/mask.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf57. 41100.
  Duration: 00:06:19.39, start: 0.000000, bitrate: 142KB /s Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 640x360 [SAR 1:1 DAR 16:9].113 kb/s, 25 fps, 25 tbr, 12800 tbn, 50 tbc (default)
    Metadata:
      handler_name    : VideoHandler
    Stream #0:1(und): Audio: aac (HE-AAC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 24 kb/s (default)
    Metadata:
      handler_name    : SoundHandler
Copy the code

Command:

ffprobe -print_format json  -show_streams https://madialab-storage-1256380422.cos.ap-guangzhou.myqcloud.com/test/SHARPROUND1ST/SRC0002_720x1280_30_0.mp4
Copy the code

Output:

ffprobe version 4.31. Copyright (c) 2007- 2020. the FFmpeg developers
  built with Apple clang version 11.03. (clang1103.032.62.)
  configuration: --prefix=/usr/local/Cellar/ffmpeg/4.31. --enable-shared --enable-pthreads --enable-version3 --enable-avresample --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libdav1d --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-librtmp --enable-libspeex --enable-libsoxr --enable-videotoolbox --disable-libjack --disable-indev=jack
  libavutil      56. 51.100 / 56. 51.100
  libavcodec     58. 91.100 / 58. 91.100
  libavformat    58. 45.100 / 58. 45.100
  libavdevice    58. 10.100 / 58. 10.100
  libavfilter     7. 85.100 /  7. 85.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  7.100 /  5.  7.100
  libswresample   3.  7.100 /  3.  7.100
  libpostproc    55.  7.100 / 55.  7.100
{
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/Users/lizhen/Downloads/mask.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf57. 41100.
  Duration: 00:06:19.39, start: 0.000000, bitrate: 142KB /s Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 640x360 [SAR 1:1 DAR 16:9], 113 kb/s, 25 fps, 25 tbr, 12800 tbn, 50 tbc (default)
    Metadata:
      handler_name    : VideoHandler
    Stream #0:1(und): Audio: aac (HE-AAC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 24 kb/s (default)
    Metadata:
      handler_name    : SoundHandler
    "streams": [{"index": 0."codec_name": "h264"."codec_long_name": "H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10"."profile": "High"."codec_type": "video"."codec_time_base": "1/50"."codec_tag_string": "avc1"."codec_tag": "0x31637661"."width": 640."height": 360."coded_width": 640."coded_height": 368."closed_captions": 0."has_b_frames": 2."sample_aspect_ratio": "1"."display_aspect_ratio": "16:9"."pix_fmt": "yuv420p"."level": 30."chroma_location": "left"."refs": 1."is_avc": "true"."nal_length_size": "4"."r_frame_rate": "25/1"."avg_frame_rate": "25/1"."time_base": "1/12800"."start_pts": 0."start_time": "0.000000"."duration_ts": 4855808."duration": "379.360000"."bit_rate": "113664"."bits_per_raw_sample": "8"."nb_frames": "9484"."disposition": {
                "default": 1."dub": 0."original": 0."comment": 0."lyrics": 0."karaoke": 0."forced": 0."hearing_impaired": 0."visual_impaired": 0."clean_effects": 0."attached_pic": 0."timed_thumbnails": 0
            },
            "tags": {
                "language": "und"."handler_name": "VideoHandler"}}, {"index": 1."codec_name": "aac"."codec_long_name": "AAC (Advanced Audio Coding)"."profile": "HE-AAC"."codec_type": "audio"."codec_time_base": "1/44100"."codec_tag_string": "mp4a"."codec_tag": "0x6134706d"."sample_fmt": "fltp"."sample_rate": "44100"."channels": 2."channel_layout": "stereo"."bits_per_sample": 0."r_frame_rate": "0/0"."avg_frame_rate": "0/0"."time_base": "1/44100"."start_pts": 0."start_time": "0.000000"."duration_ts": 16726028."duration": "379.275011"."bit_rate": "24001"."max_bit_rate": "24001"."nb_frames": "8170"."disposition": {
                "default": 1."dub": 0."original": 0."comment": 0."lyrics": 0."karaoke": 0."forced": 0."hearing_impaired": 0."visual_impaired": 0."clean_effects": 0."attached_pic": 0."timed_thumbnails": 0
            },
           "tags": {
                "language": "und"."handler_name": "SoundHandler"}}}]Copy the code

5

(1) Video

  • Frame rate: Frame rate affects video quality far more than resolution and QP.
  • Resolution: The size of each frame is an image. For 640*480 resolution videos, it is recommended that the video bit rate be set above 700, and the audio sampling rate is 44100. A file with 128Kbps encoding rate for audio and 800Kbps encoding rate for video has a total encoding rate of 928Kbps, meaning that encoded data needs to be represented in 928K bits per second.
  • QP: quantitative parameter, reflecting the compression of spatial details. The smaller the value, the finer the quantization, the higher the image quality and the longer the bitstream generated.
  • Performance:
  • Bit rate: The number of bits of data transmitted per unit of time, usually in KBPS, or thousands of bits per second. Popular understanding is the sampling rate, the greater the sampling rate per unit time, the higher the accuracy, the more close to the original file processing.
  • Resistance:
  • Delay: An important metric in network transmission, which measures the time it takes for data to travel from one endpoint to another. We usually use milliseconds. Latency is also commonly referred to as latency, but latency is sometimes used to refer to the round-trip time between the end of the packet being sent and the end of the packet being received. This round-trip time can be measured by network monitoring tools. The difference between the sending time and the receiving time of data packets is the delay time. One-way time is delay.
  • Jitter: Due to the size of packets, route selection of network routes and many other factors, we cannot guarantee that the delay time of packets is consistent. The difference between packets and packet delay is called jitter. That is to say, because the delay value of packet is up and down, we call it jitter.
  • Caton:
  • Audio and painting synchronization:
  • YUV video format (commonly used YUV420 format in Android) : The code stream output by the general video capture chip is generally YUV data stream, and the subsequent video processing is also to encode and parse YUV data stream.
    • YUV444: There are 4 Y’s, 4 U’s and 4 V’s in the 4 pixels, and no data is discarded.
    • YUV422: The data in 4 pixels has 4 Y’s, 2 U’s and 2 V’s. The collection method is that odd pixels discard V and even pixels discard U.
    • YUV420: sampling method for discarding data in horizontal and vertical directions at the same time.
      • Sampling method: even pixels discard UV, on this basis, odd rows further discard V, even rows further discard U.
    • YUV: ffplay-video_size 1080×2220 mini_YUvj420P_1080_2220.yuv
    • Mp4 to YUV: FFmPEG-I MP4_file YUv_file
    • Mp4 Modified resolution: FFMPEG-I CLEAN_MP4-VF Scale =1080:1024 denoised_MP4-hide_banner
    • Python run_vmaf.py yuv420p 1080 2220 demo.yuv demo.yuv –out-fmt json

(2) Audio parameters (Resources)

  • Sample Rate: Refers to the number of samples or samples taken by the recording device within a unit of time. The unit is Hz(Hertz). The higher the sampling frequency, the greater the frequency range.

Some common audio sampling rates are as follows: 8kHz – Sampling rate for telephones 22.05kHz – sampling rate for radio broadcasts 44.1kHz – Audio CD, also commonly used for MPEG-1 audio (VCD, SVCD, MP3) sampling rate 48khz-minidV, digital television, DVD, DAT, film and professional audio sampling rate used for digital sound

  • Sampling Depth (Bit Depth, Sample Format, Sample Size, Sample Width), also known as Bit Depth, refers to the binary bits of digital sound signals used by the sampling card when collecting and playing sound files, or the bits contained in each Sample. There are usually 8 bits and 16 bits.
  • Channel refers to the number of sound channels used by the acquisition card when collecting, which is divided into Mono and two-channel/Stereo.
  • Bit Rate, also known as Bit Rate, refers to the number of bits transmitted Per Second (BPS). The higher the Bit Rate, the faster the data is transmitted. Bit rate in sound refers to the amount of binary data per unit time after analog sound signal is converted into digital sound signal.

Its calculation formula is: bit rate = sampling frequency * sampling bits * number of channels