Speech activity detection algorithms can be roughly divided into three categories. The first category is the simplest discriminant method based on threshold value, which has been mentioned before, referring to speech activity detection. The second type is the DETECTION method based on GMM used by WebRTC; The third type is the detection method based on deep learning, which has also been mentioned, referring to the use of LSTM for endpoint detection. Without further ado, he began to get down to business.

1. Introduction

WebRTC VAD supports 8/16/24/32/48khz sampling rate, but will be resampled to 8kHz for calculation, each frame length can be 80/10ms, 160/20ms and 240/30ms. VAD has the following four modes, representing general mode, low bit rate mode, aggressive mode and very aggressive mode respectively. In different modes, the parameters and threshold values of gaussian mixture model are different.

    enum Aggressiveness {
        kVadNormal = 0,
        kVadLowBitrate = 1,
        kVadAggressive = 2,
        kVadVeryAggressive = 3
    };
Copy the code

WebRTC uses GMM statistical model to make VAD judgment on speech, and divides 0.4khz into six frequency bands as follows: 80 250Hz, 250 500Hz, 500 1kHz, 1K 2kHz, 2K 3kHz, and 3K-4khz. The sub-band energy of these frequency bands is used as GMM related features.

II. Initialization

Talk is cheap, looking directly at the code, the WebRtcVad_InitCore function initializes the following.

  • Initial VAD status: The initial VAD status is set to voice existence. As a result, the VAD detection result of the first segment of voice in some samples is 1.

  • Parameters related to hang-over.

  • The coefficient of the downsampling filter, as I mentioned before no matter what the input sampling rate is, will eventually be downsampled to 8kHz for processing.

  • The mean and variance of GMM speech noise, kTableSize=12 represents 6 subbands in each of the two models, and Q=7.

  • Minimum vector, used to track noise.

  • Split filter coefficient, WebRTC subband division of speech is processed by split filter rather than FFT.

  • The guider mode is guider mode by default, and can be changed using the WebRtcVad_set_mode function.

    int WebRtcVad_InitCore(VadInstT *self) { int i;

    if (self == NULL) { return -1; } // Initialization of general struct variables. self->vad = 1; // Speech active (=1). self->frame_counter = 0; self->over_hang = 0; self->num_of_speech = 0; // Initialization of downsampling filter state. memset(self->downsampling_filter_states, 0, sizeof(self->downsampling_filter_states)); // Initialization of 48 to 8 kHz downsampling. WebRtcSpl_ResetResample48khzTo8khz(&self->state_48_to_8); // Read initial PDF parameters. for (i = 0; i < kTableSize; i++) { self->noise_means[i] = kNoiseDataMeans[i]; self->speech_means[i] = kSpeechDataMeans[i]; self->noise_stds[i] = kNoiseDataStds[i]; self->speech_stds[i] = kSpeechDataStds[i]; } // Initialize Index and Minimum value vectors. for (i = 0; i < 16 * kNumChannels; i++) { self->low_value_vector[i] = 10000; self->index_vector[i] = 0; } // Initialize splitting filter states. memset(self->upper_state, 0, sizeof(self->upper_state)); memset(self->lower_state, 0, sizeof(self->lower_state)); // Initialize high pass filter states. memset(self->hp_filter_state, 0, sizeof(self->hp_filter_state)); // Initialize mean value memory, for WebRtcVad_FindMinimum(). for (i = 0; i < kNumChannels; i++) { self->mean_value[i] = 1600; } // Set aggressiveness mode to default (=|kDefaultMode|). if (WebRtcVad_set_mode_core(self, kDefaultMode) ! = 0) { return -1; } self->init_flag = kInitCheck; return 0;Copy the code

    }

III. VAD Decision

The following describes the VAD processing flow of WebRTC (WebRtcVad_Process). The specific steps are as follows:

  1. Some basic checks (WebRtcVad_ValidRateAndFrameLength) are performed to check whether the VAD structure is initialized and whether the speech frame length meets the conditions.

  2. Downsampling of speech is carried out to 8kHz, and WebRTC downsampling is not in one step. Take sampling at 48kHz to 8kHz as an example (WebRtcSpl_Resample48khzTo8khz) : Firstly, 48kHz voice data is sampled to 24kHz, and then low-pass filtering is performed on 24kHz voice data (this step does not change the sampling rate). Then, 24kHz->16kHz, 16kHz->8kHz voice data is finally obtained.

  3. After obtaining 8kHz speech data, we calculated the energy of each sub-band of speech as GMM related features (WebRtcVad_CalculateFeatures). Firstly, the data of 4kHz is divided into 0 2kHz and 2k 4kHz, and then the part of 2K 4kHz is divided into 2k 3kHz and 3K 4kHz. 0 2kHz is firstly divided into 0 1kHz and 1K 2kHz, in which 0 1kHz is then divided into 0 250Hz and 250 500Hz. Finally, 80Hz high-pass filtering is performed on the part of 0 250Hz to obtain the part of 80 250Hz. Thus, six sub-bands are obtained: Six sub-bands of 80 250Hz, 250 500Hz, 500 1kHz, 1K 2kHz, 2K 3kHz, and 3K ~4kHz were calculated as GMM related features. In addition, a total_energy is calculated, which will be used later when calculating GMM (WebRtcVad_GmmProbability).

  4. Next is the calculation of the GMM part, on the principle of GMM can refer to the implementation of machine learning algorithm from zero (19) Gaussian mixture model. Different decision thresholds are selected according to the length of speech frames before calculation.

  5. Firstly, determine whether the total_energy calculated in the previous step is greater than the threshold value of energy kMinEnergy. If so, process the current frame. Otherwise, set vad_flag to 0.

  6. WebRtcVad_GaussianProbability corresponding to each subband is calculated and multiplied by the weight of the subband as the final probability of speech/noise. Here, to simplify calculation, WebRTC assumes that the Gaussian model of speech and noise is irrelevant.

  7. The log likelihood ratio (LLR) for each subband is calculated, and the likelihood ratio for each subband is compared to the threshold as a local VAD decision. The sum and threshold of log-weighted likelihood ratios of all subbands are compared to make a global VAD decision. When one of the local decision or global decision results has speech, the current frame is considered as speech frame.

  8. The result is smoothed using hangover

IV. Updation

WebRTC’s VAD is adaptive because it updates the GMM parameters after the VAD decision.

  1. Calculate the probability of local (per subband) speech and noise for updating GMM parameters.

  2. Tracking the minimum of each subband feature (WebRtcVad_FindMinimum), the function finds 16 minimum values for each feature in 100 frames. Each of these minima has an age, with a maximum of 100. If the current feature is one of the 16 minima in 100 frames, the median of the five minima is computed and returned. The minimum here will be used later to update the noise.

  3. Update GMM parameters, namely, the mean and variance of speech/noise, where the mean of noise is only updated when the current speech frame is a non-speech frame.

  4. When the speech Gaussian model and the noise Gaussian model are very similar, separate them.

V. Conclusion

WebRTC VAD can get good results in the case of high SNR, and its detection results deteriorate with the decrease of SNR. In addition, WebRTC’s VAD engineering implementation uses fixed-point arithmetic and some approximation arithmetic, which makes it not very resource-intensive but makes the code difficult to read. This article has only analyzed the general flow of VAD in WebRTC, and some of the details are left to the reader. WebRTC VAD content is very much, if there is any omission please forgive me.

The relevant Code of this article, pay attention to the public account voice algorithm group, menu bar click Code to obtain.

References:

[1]. Practical guide for real-time speech processing

[2]. A Statistical Model-Based Voice Activity Detection