【 speech recognition 】 based on MFCC GMM speech recognition MATLAB source code

A list,

MFCC(Mel-Frequency Cepstral Coefficients): Meir frequency cepstrum coefficients. Meier frequency is proposed based on the auditory characteristics of human ear, and it has a nonlinear corresponding relationship with Hz frequency. Meier frequency cepstrum coefficient (MFCC) is the Hz spectrum characteristics calculated by using this relationship between them. It is mainly used for feature extraction and dimensionality reduction of voice data. For example, if a frame has 512 dimensions (sampling points) of data, MFCC can extract the most important 40 dimensions (generally speaking) of data and achieve the purpose of dimensionality. MFCC generally goes through several steps: pre-weighting, framing, windowing, fast Fourier Transform (FFT), Meir filter banks, discrete cosine transform (DCT). The most important of these are FFT and Meyer filter banks, which perform the main reduction operations. 1. Pre-weighting will pass the sampled digital voice signal S (N) through a high Pass filter: GENERALLY, A is about 0.95. The signal after pre-weighting is:

The purpose of pre-weighting is to lift the high frequency part, so that the spectrum of the signal becomes flat and remains in the whole frequency band from low frequency to high frequency, and the spectrum can be obtained with the same SIGNal-to-noise ratio. At the same time, it is also to eliminate the effect of vocal cords and lips in the process of occurrence, to compensate the high frequency part of speech signal suppressed by the pronunciation system, and to highlight the resonance peak of high frequency.

2. Framing To facilitate speech analysis, a speech can be divided into small segments, called frames. First, N sampling points are gathered into an observation unit, called a frame. Generally, the value of N is 256 or 512, which covers about 20 to 30ms. In order to avoid excessive changes of two adjacent frames, there is an overlap region between the two adjacent frames, which contains M sample points, usually the value of M is about 1/2 or 1/3 of N. Generally, the sampling frequency of speech signal used for speech recognition is 8KHz or 16KHz. For 8KHz, if the frame length is 256 sampling points, the corresponding time length is 256/8000×1000=32ms.

3. Windowable voice is constantly changing in a long range and cannot be processed without fixed characteristics. Therefore, every frame is substituted into the window function and the value outside the window is set to 0, which aims to eliminate the signal discontinuity that may be caused at both ends of each frame. The common window functions include square window, Hamming window and Hanning window, etc. According to the frequency domain characteristics of window functions, hamming window is often used.

Multiply each frame by the Hamming window to increase the continuity of the left and right ends of the frame. Assume that the signal after frame splitting is S(n), n=0,1… , n-1,N is the size of the frame, then after multiplying the hamming window, W(N) will have the following form: Different VALUES of A will produce different Hamming Windows. In general, A is 0.46.

4. Fast Fourier Transform Because it is usually difficult to see the characteristics of the signal in the time domain transformation, it is usually converted into the energy distribution in the frequency domain to observe, different energy distribution, can represent the characteristics of different speech. So after multiplying by the Hamming window, each frame has to go through the Fast Fourier transform to get the energy distribution on the spectrum. The spectrum of each frame is obtained by fast Fourier transform of each frame signal after windowing. The power spectrum of speech signal is obtained by modulo square of the spectrum of speech signal. Suppose that the DFT of the speech signal is:

Where x(n) is the input speech signal, and n is the number of points of Fourier transform.

Here we need to first introduce Nyquist frequency, Nyquist frequency is half of the sampling frequency of discrete signal system, named after Harry Nyquist or Nyquist-Shannon sampling theorem. The sampling theorem states that aliasing can be avoided as long as the Nyquist frequency of the discrete system is higher than the highest frequency or bandwidth of the sampled signal. In the voice system, I usually take the sampling rate of 16khz, while the frequency of human occurrence is between 300Hz and 3400Hz. According to the definition of Nyquist frequency, Nyquist frequency is equal to 8khz, which is higher than the highest frequency of human occurrence, meeting the Nyquist frequency restriction conditions. FFT is calculated by taking half of the sampling rate at Nyquist frequency. Specifically, if a frame has 512 sampling points and the number of points in the Fourier transform is 512, the output number of points after FFT calculation is 257(N/2+1). Its meaning represents the component of N/2+1 point frequency from 0(Hz) to the sampling rate /2(Hz). In other words, the FFT calculation not only transfers the signal from the time domain to the frequency domain and removes the influence of points higher than the highest frequency of the sampled signal, but also reduces the dimension.

5. Mayer filter banks Because human ear has different sensitivity to different frequencies and a non-linear relationship, the spectrum is divided into multiple Mel filter banks according to the sensitivity of human ear. Within the Mel scale range, the center frequencies of each filter are linearly distributed with equal intervals, but not equally spaced within the frequency range. This is due to the conversion formula between frequency and Mel frequency, which is as follows:

So log is base log10, log.

The energy spectrum is passed through a set of mel-scale triangular filter banks, and a filter bank with M filters (the number of filters is similar to the number of critical bands) is defined. The filter adopted is triangular filter, and the center frequency is f(M), M =1,2… , M. M is usually 22 minus 26. The interval between f(m) decreases with the decrease of m value, and widens with the increase of m value, as shown in the figure:

In the formula, k refers to the subscript of the point after FFT calculation, that is, 0~257 in the previous example, f(m) also corresponds to the subscript of the point, the specific solution is as follows:

1. Determine the frequency of the lowest voice signal (generally 0Hz) and the highest frequency (generally 1/2 of the sampling rate) and the number of Mel filters

2. Calculate MEL frequencies corresponding to the lowest and highest frequencies

3. Calculate the distance between the center frequencies of two adjacent MEL filters :(highest MEL frequency – lowest MEL frequency)/(number of filters +1)

4. Convert Mel frequency of each center into frequency

5. Calculate the subscript of FFT midpoint corresponding to frequency

For example, if the sampling rate is 16khz, the lowest frequency is 0Hz, the number of filters is 26, and the frame size is 512, then the Fourier transform points are also 512. Then, the lowest Mel frequency is 0 and the highest Mel frequency is 2840.02, which are substituted into the conversion formula between Mel frequency and actual frequency. The center frequency distance is: (2840.02-0)/(26+1)=105.19, so we can get the center frequency of Mel filter bank: [0,105.19, 210.38,…, 2840.02], and then convert this group of central frequencies into the actual frequency group (it can be operated according to the formula, which is not listed here), and finally calculate the subscript of FFT points corresponding to the actual frequency group, the calculation formula is: each frequency in the actual frequency group/sampling rate *(Fourier transform points + 1). So you get FFT points subscript groups: [0,2,4,7,10,13,16,…, 256], is f (0), (1), f… , f (27). With this in mind, we calculate the output of each filter as follows: where M is the number of filters, N is the number of points in FFT (257 in our example above). After the above calculation, we get a dimension equal to the number of filters for each frame of data, reducing the dimension (in this case, 26 dimensions).

DCT is often used in signal processing and image processing for lossy data compression of signals and images, due to the strong “energy concentration” characteristic of DCT: Most of the energy of natural signals (including sound and image) is concentrated in the low-frequency part after DCT, which is actually to carry out a dimensionality for each frame of data. The formula is as follows: substitute the logarithmic energy of each of the above filters into the discrete cosine transform and find the Mel-Scale Cepstrum parameter of order L. Order L refers to the order of MFCC coefficient, usually taken as 12-16. Where M is the number of trig filters.

The cepstrum parameter of the standard MFCC only reflects the static characteristics of speech parameters, and the dynamic characteristics of speech can be described by the difference spectrum of these static characteristics. Experimental results show that the combination of dynamic and static features can effectively improve the recognition performance of the system. The difference parameters can be calculated by the following formula: where, DT represents the TTH first-order difference, Ct represents the TTH cepstrum coefficient, Q represents the order of the cepstrum coefficient, and K represents the time difference of the first derivative, which can be taken as 1 or 2. Substitute the result of the above equation to obtain the parameters of the second order difference. Therefore, the entire composition of MFCC is actually composed of: N-dimensional MFCC parameter (N/3 MFCC coefficient + N/3 first-order difference parameter + N/3 second-order difference parameter) + frame energy (this can be replaced according to demand). Frame energy, which refers to the volume (energy) of a frame, is also an important feature of speech and is very easy to calculate. Thus, the usual addition of a frame’s logarithmic energy (defined as the sum of the squares of the signals within a frame, then taking the logarithm base 10 and multiplying by 10) adds one dimension to the basic speech features of each frame, including a logarithmic energy and the remaining cepstrum parameters. And just to explain what happened with the 40 dimensions that WE started with, if the DCT is of order 13, then after the first and second order difference it’s 39 dimensions plus the frame energy it’s 40 dimensions, and of course it can be dynamically adjusted based on the actual situation.

Ii. Source code

% ====== Load wave data and do feature extraction clc,clear waveDir='trainning\'; speakerData = dir(waveDir); % dir('.') lists all subfolders and files in the current directory % dir('G:\Matlab'). % dir('*.m') Lists all subfolders and files in the specified directory % dir('*.m') lists folders and files in the current directory that match the regular expression % The resulting array of structures % name -- filename % date -- modification each element is a structure of the following form % name -- filename % date -- modification date % bytes -- number of bytes allocated to the file % isdir -- 1 if name is a directory and 0 if not % datenum -- Modify date as a MATLAB serial date number % indicates the file name, the modification date, the size, and whether it is a directory. The specific modification date % of MATLAB can extract the file name for reading and saving. speakerData(1:2) = []; speakerNum=length(speakerData); % speakerNum: number; % ====== Feature extraction fprintf('\n... '); % cd('D:\MATLAB7\toolbox\dcpr\'); For I =1:speakerNum fprintf(' speakerData(I,1).name(1:end-4)); [y, fs, nbits]=wavread(['trainning\' speakerData(i,1).name]); epInSampleIndex = epdByVol(y, fs); % endpoint detection y=y(epInSampleIndex(1):epInSampleIndex(2)); SpeakerData (I). MFCC = wave2mFCC (y, fs); Fprintf (' Done!! '); end save speakerData speakerData; % Since feature extraction is slow, you can save the data for future use if the features are not changed. graph_MFCC; % Because feature extraction is slow, data can be saved for later use if the function is not changed, fprintf('\n'); clear all; Fprintf (' Feature parameter extraction complete! \n\n Please click any key to continue... '); pause; % ====== GMM training fprintf('\n Gauss mixture model for every speaker... \n\n'); load speakerData.mat gaussianNum=12; % No. Of Gaussians in a GMM speakerNum=length(speakerData); For I =1:speakerNum fprintf('\n = %d %s training GMM...... \n', i,speakerData(i).name(1:end-4)); [speakerGmm(i).mu, speakerGmm(i).sigm,speakerGmm(i).c] = gmm_estimate(speakerData(i).mfcc,gaussianNum); Fprintf (' Done!! '); end fprintf('\n'); save speakerGmm speakerGmm; pause(10); clear all; Fprintf (' Gaussian mixture model training over! \n\n Please click any key to continue... '); pause; % ====== recognition fprintf('\n \n\n'); load speakerData; load speakerGmm; [filename, pathname] = uigetfile('*.wav','select a wave file to load'); if pathname == 0 errordlg('ERROR! No file selected! '); return; end wav_file = [pathname filename]; [testing_data, fs, nbits]=wavread(wav_file); pause(10); match= MFCC_feature_compare(testing_data,speakerGmm); Disp (' Model under test matching, please wait 10 seconds! ') pause(10); [max_1 index]=max(match); If length(filename)>7 fprintf('\n\n\n speaker is %s. ',speakerData(index).name(1:end-4)); Else fprintf('\n\n\n the speaker is %s. ',filename(1:end-4)); endCopy the code