In any Automatic Speech recognition system, the first step is to extract features. In other words, we need to extract the discernible components of the audio signal, and then throw away the rest of the clutter, such as background noise, mood and so on.

Knowing how speech is made helps us a lot in understanding speech. People produce sound through the vocal tract, and the shape of the vocal tract Determines what kind of sound it makes. The shape of vocal tract includes tongue, teeth and so on. If we can accurately know this shape, we can accurately describe the resulting phoneme. The shape of the channel is shown in the envelope of the short-time power spectrum of speech. MFCCs is a feature that accurately describes this envelope.

MFCCs (Mel Frequency Cepstral Coefficents) is a feature widely used in automatic speech and speaker recognition. It was developed in 1980 by Davis and Mermelstein. From then on. In the field of speech recognition, MFCCs stands out in terms of artificial features and has never been surpassed (as for feature Learning in Deep Learning, that’s another story).

Ok, so here we have a very important key word: the shape of the vocal tract, and we know that it is important and that it can be shown in the envelope of the short-time power spectrum of speech. Well, what is the power spectrum? What is an envelope? What is MFCCs? Why does it work? How do you get it? Let’s take it slow.

 

1. Spectrogram

We’re dealing with speech signals, so how you describe it matters. Because different descriptions project it differently. So what kind of description is good for us to observe, good for us to understand? So let’s start with something called a spectrogram.

Here, the speech is divided into many frames, and each frame corresponds to a spectrum (calculated by short-time FFT) that represents the relationship between frequency and energy. In actual use, there are three kinds of spectrum graph, namely linear amplitude spectrum, logarithmic amplitude spectrum, and self-power spectrum (logarithmic amplitude spectrum of the amplitude of each spectrum line has been logarithmic calculation, so its ordinate unit is dB (dB). The purpose of this transformation is to make those components with lower amplitudes higher relative to those with higher amplitudes, so as to observe the periodic signal hidden in the low amplitude noise.

  

We first represent the spectrum of one frame of speech through coordinates, as shown on the left. Now we rotate the spectrum on the left by 90 degrees. I get the middle graph. These amplitudes are then mapped to a grayscale representation. , 0 indicates black, 255 indicates white. The larger the amplitude, the darker the corresponding area. So that gives you the rightmost graph. So why do we do this? The purpose is to add the dimension of time, so that the spectrum of a speech rather than a frame can be displayed, and static and dynamic information can be visually seen. The advantages will be shown later.

And that gives us a spectrum over time, which is the spectrogram of speech signals.

Below is a phonogram of a piece of speech, and the dark areas are peaks of the spectrum (Formants).

So why do we represent speech in a spectrogram?

First, the properties of Phones can be better observed in this way. In addition, sound can be better recognized by observing formants and their transitions. Hidden Markov Models are used to model acoustic spectra implicitly to achieve good recognition performance. Another function is that it can intuitively evaluate the quality of THE TTS system (text to Speech) by directly comparing the matching degree of the synthesized speech with the natural speech spectrogram.

The FFT spectrum of each frame is obtained by time-frequency transformation of speech frames, and then the spectrum of each frame is arranged in time order to obtain the time-frequency-energy distribution diagram. It shows the change of frequency center of speech signal over time intuitively.

 

 

Ii. Cepstrum Analysis

Here is a spectrum of speech. Peaks show the principal frequency elements of sounds, and we call these formants, where it is as well as the discernible properties of sound (as well as personal IDENTIFICATION cards). So it’s particularly important. You can use it to recognize different sounds.

If it’s so important, then we need to extract it! We need to extract not only the location of the formants, but also the process by which they change. Therefore, we extracted the Spectral Envelope. The envelope is a smooth curve connecting these formants.

We can think of the original spectrum as consisting of two parts: the envelope and the spectrum details. We’re using the logarithmic spectrum, so it’s in dB. So now we need to separate these two parts, so that we can get the envelope.

So how do we separate them? In other words, given log X[k], how do I find log H[k] and log E[k] such that log X[k] = log H[k] + log E[k]?

To achieve this goal, we need to Play a Mathematical Trick. What is this Trick? It’s doing FFT on the spectrum. Taking the Fourier transform on the spectrum is the same thing as Inverse Fourier FFT (IFFT). One thing to note is that we’re dealing with the logarithmic domain of the spectrum, which is part of Trick. At this point, doing IFFT on the logarithmic spectrum is equivalent to describing the signal on a pseudo-frequency axis.

As we can see from the diagram above, the envelope is primarily a low-frequency component (and we need to change our thinking here, so instead of thinking of the horizontal axis as frequency, we can think of it as time), and we think of it as a sinusoidal signal with four cycles per second. So we give it a peak at 4Hz above the pseudo-axis. And the details of the spectrum are mainly high frequencies. We think of it as a sinusoidal signal with 100 cycles per second. So we give it a peak at 100Hz above the pseudo-axis.

Add them together and you get the original spectral signal.

In fact, we already know log X[k], so we can get X[k]. As can be seen from the figure, h[k] is the low frequency part of x[k], then we can get H [k] by passing x[k] through a low pass filter! Right, so at this point we can separate them, and we get the h[k] that we want, which is the envelope of the spectrum.

X [k] is actually Cepstrum. And the h[k] that we care about is the low frequency part of the cepstrum. H [K] describes the spectrum envelope, which is widely used in speech recognition to describe features.

So to summarize cepstrum analysis, it’s actually a process like this:

1) The spectrum of the original speech signal is obtained by Fourier transform: X[k]=H[k]E[k];

Consider only amplitude is: | X [k] [k] | | = | H | E | [k];

2) we are on both sides of the exponential: log | | X [k] | | = log | | H [k] | | + log | | E [k] | |.

3) Take the inverse Fourier transform of both sides to get: x[k]= H [k]+e[k].

There’s actually a technical name for this called homomorphic signal processing. Its purpose is to transform nonlinear problems into linear problems. Corresponding to the above, the original speech signal is actually a volume signal (the sound track is equivalent to a linear time-invariant system, through which the generation of sound can be understood as an excitation), and the first step is to transform it into a multiplicative signal through convolution (the convolution in the time domain is equivalent to the product in the frequency domain). In the second step, the multiplicative signal is converted into an additive signal by taking logarithms. In the third step, the inverse transformation is carried out to restore it to a rolling signal. At this time, although the sequence before and after is in the time domain, but they are obviously different in the discrete time domain, so the latter is called cepstrum frequency domain.

To sum up, cepstrum is the spectrum of the Fourier transform of a signal obtained by logarithmic operation and inverse Fourier transform. Its calculation process is as follows:

The following sections have not been sorted out


 

 

3. Mel-frequency Analysis

All right, so let’s see what we just did, okay? Give us a piece of speech, and we can get its spectral envelope (the smooth curve connecting all the resonant peaks). However, experiments on human auditory perception show that human auditory perception is focused only on certain specific regions, rather than the whole spectral envelope.

Mel frequency analysis is based on human auditory perception experiments. Experimental observations show that the human ear acts like a filter bank, focusing only on certain frequency components (human hearing is frequency selective). In other words, it only lets signals from certain frequencies pass, and ignores signals from certain frequencies it doesn’t want to sense at all. But these filters are not uniformly distributed along the frequency axis, in the low frequency region there are many filters, they are more densely distributed, but in the high frequency region, the number of filters becomes less, the distribution is very sparse.

Human auditory system is a special nonlinear system, which has different sensitivity in response to different frequency signals. In the extraction of speech features, human auditory system is very good, it can not only extract semantic information, but also extract the speaker’s personal characteristics, which are far beyond the existing speech recognition system. If the speech recognition system can simulate the characteristics of human auditory perception processing, it is possible to improve the speech recognition rate.

The Mel Frequency Cepstrum Coefficient (MFCC) takes into account human auditory characteristics and maps the linear spectrum to the Mel nonlinear spectrum based on auditory perception, and then converts to the Cepstrum.

The formula for converting normal frequency to Mel frequency is:



As can be seen from the figure below, it can transform the disunified frequency into a unified frequency, that is, a unified filter bank.

In Mel frequency domain, the perception of tone is linear. For example, if the Mel frequency of two speeches is twice different, the pitch of the two sounds will sound twice different to the human ear.

 

4. Mel-frequency Cepstral Coefficients

We pass the spectrum through a set of Mel filters to get the Mel spectrum. The formula is stated as log X[k] = log (mel-spectrum). At this point we perform cepstrum analysis on log X[k] :

1) Take logarithms: log X[k] = log H[k] + log E[k].

2) Inverse transformation: x[k] = H [k] + E [k].

The cepstrum coefficient h[k] obtained on the Mel spectrum is called the Mel frequency cepstrum coefficient, referred to as MFCC.

Now let’s summarize the process of extracting MFCC features :(the specific mathematical process is too much on the Internet, I don’t want to post here)

1) Pre-weighting, framing and windowing the voice; (Some preprocessing to enhance speech signal performance (SNR, processing accuracy, etc.)

2) For each short-term analysis window, the corresponding spectrum is obtained by FFT; (Obtain the spectrum distributed in different time Windows on the time axis)

3) The above spectrum is obtained through THE Mel filter bank; (Through Mel spectrum, the linear natural spectrum is converted into Mel spectrum reflecting human auditory characteristics)

4) Cepstrum analysis is performed on the Mel spectrum (logarithm is taken and inverse transformation is made. The actual inverse transformation is generally achieved by DCT discrete cosine transform, and the second to 13th coefficients after DCT are taken as MFCC coefficients) to obtain the Mel frequency cepstrum coefficient MFCC, which is the feature of this frame of speech. (Cepstrum analysis to obtain MFCC as speech feature)

At this point, speech can be described by a series of cepstrum vectors, each vector is the MFCC feature vector of each frame.

Then the speech classifier can be trained and recognized by these cepstrum vectors.

 

5. References

[1] Here’s another good tutorial:

Practicalcryptography.com/miscellaneo…

[2] The main reference of this article is: cmU tutorial:

www.speech.cs.cmu.edu/15-492/slid…

[3] C library for computing Mel Frequency Cepstral Coefficients (MFCC)

function varargout = GUI(varargin)

% GUI MATLAB code for GUI.fig

% GUI, by itself, creates a new GUI or raises the existing

% singleton*.

%

% H = GUI returns the handle to a new GUI or the handle to

% the existing singleton*.

%

% GUI('CALLBACK',hObject,eventData,handles,...) calls the local

% function named CALLBACK in GUI.M with the given input arguments.

%

% GUI('Property','Value',...) creates a new GUI or raises the

% existing singleton*. Starting from the left, property value pairs are

% applied to the GUI before GUI_OpeningFcn gets called. An

% unrecognized property name or invalid value makes property application

% stop. All inputs are passed to GUI_OpeningFcn via varargin.

%

% *See GUI Options on GUIDE's Tools menu. Choose "GUI allows only one

% instance to run (singleton)".

%

% See also: GUIDE, GUIDATA, GUIHANDLES


% Edit the above text to modify the response to help GUI



% Begin initialization code - DO NOT EDIT

gui_Singleton = 1;

gui_State = struct('gui_Name', mfilename, ...

'gui_Singleton', gui_Singleton, ...

'gui_OpeningFcn', @GUI_OpeningFcn, ...

'gui_OutputFcn', @GUI_OutputFcn, ...

'gui_LayoutFcn', [] , ...

'gui_Callback', []);

if nargin && ischar(varargin{1})

gui_State.gui_Callback = str2func(varargin{1});

end


if nargout

[varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:});

else

gui_mainfcn(gui_State, varargin{:});

end

% End initialization code - DO NOT EDIT



% --- Executes just before GUI is made visible.

function GUI_OpeningFcn(hObject, eventdata, handles, varargin)

% This function has no output args, see OutputFcn.

% hObject handle to figure

% eventdata reserved - to be defined in a future version of MATLAB

% handles structure with handles and user data (see GUIDATA)

% varargin command line arguments to GUI (see VARARGIN)


% Choose default command line output for GUI

handles.output = hObject;


% Update handles structure

guidata(hObject, handles);


% UIWAIT makes GUI wait for user response (see UIRESUME)

% uiwait(handles.figure1);



% --- Outputs from this function are returned to the command line.

function varargout = GUI_OutputFcn(hObject, eventdata, handles)

% varargout cell array for returning output args (see VARARGOUT);

% hObject handle to figure

% eventdata reserved - to be defined in a future version of MATLAB

% handles structure with handles and user data (see GUIDATA)


% Get default command line output from handles structure

varargout{1} = handles.output;



% --- Executes on button press in pushbutton1.

function pushbutton1_Callback(hObject, eventdata, handles)

%% 载入语音库

% 数据库路径

dirName = './wav/Database';

dirName = uigetdir(dirName);

if isequal(dirName, 0)

return;

end

handles.dirName = dirName;

guidata(hObject, handles);

set(handles.text1,'string','语音库选择完毕!')

% hObject handle to pushbutton1 (see GCBO)

% eventdata reserved - to be defined in a future version of MATLAB

% handles structure with handles and user data (see GUIDATA)



% --- Executes on button press in pushbutton2.

function pushbutton2_Callback(hObject, eventdata, handles)

%% 提取特征参数

if isequal(handles.dirName, 0)

msgbox('请选择音频库目录', '提示信息', 'modal');

return;

end

S = GetDatabase(handles.dirName);

handles.S = S;

guidata(hObject, handles);

set(handles.text1,'string','特征参数提取完毕!')

% hObject handle to pushbutton2 (see GCBO)

% eventdata reserved - to be defined in a future version of MATLAB

% handles structure with handles and user data (see GUIDATA)



% --- Executes on button press in pushbutton3.

function pushbutton3_Callback(hObject, eventdata, handles)

%% 选择测试文件

file = './wav/Test/1.wav';

[Filename, Pathname] = uigetfile('*.wav', '打开新的语音文件',...

file);

if Filename == 0

return;

end

fileurl = fullfile(Pathname,Filename);

[signal, fs] = audioread(fileurl);

plot(signal); title('待识别语音信号', 'FontWeight', 'Bold');

handles.fileurl = fileurl;

handles.signal = signal;

handles.fs = fs;

guidata(hObject, handles);

%% 播放测试文件

if isequal(handles.fileurl, 0)

msgbox('请选择音频文件', '提示信息', 'modal');

return;

end

sound(handles.signal, handles.fs);

set(handles.text1,'string','选择语音完毕!')

% hObject handle to pushbutton3 (see GCBO)

% eventdata reserved - to be defined in a future version of MATLAB

% handles structure with handles and user data (see GUIDATA)



% --- Executes on button press in pushbutton4.

function pushbutton4_Callback(hobject, eventdata, handles)

%% 识别

set(handles.text1,'string','识别中。。。')

pause(3)

if isequal(handles.fileurl, 0)

msgbox('请选择音频文件', '提示信息', 'modal');

return;

end

%if isequal(handles.S, 0)

% msgbox('请计算音频库MFCC特征', '提示信息', 'modal');

% return;

%end

S = handles.S;

[num, MC] = Reco(S, handles.fileurl);

result = S(num).name;

result = result(1:2);

set(handles.edit1,'string',result);

%判断垃圾的种类

if strcmp(result,'废纸')||strcmp(result,'瓶子')||strcmp(result,'塑料')||strcmp(result,'毛毯')||strcmp(result,'剪刀')||strcmp(result,'床单')||strcmp(result,'罐头')||strcmp(result,'纸盒')||strcmp(result,'纸箱')||strcmp(result,'塑料')||strcmp(result,'镜子')||strcmp(result,'酒瓶')

set(handles.edit2,'string','可循环利用垃圾')

elseif strcmp(result,'剩菜')||strcmp(result,'果皮')||strcmp(result,'剩饭')||strcmp(result,'菜叶')||strcmp(result,'果壳')||strcmp(result,'骨头')||strcmp(result,'贝壳')||strcmp(result,'羽毛')||strcmp(result,'鱼鳞')||strcmp(result,'果核')||strcmp(result,'菜梗')

set(handles.edit2,'string','厨余垃圾')

elseif strcmp(result,'电池')||strcmp(result,'灯管')||strcmp(result,'电池')||strcmp(result,'药品')||strcmp(result,'化妆品')||strcmp(result,'杀虫剂')||strcmp(result,'胶片')||strcmp(result,'农药')||strcmp(result,'相纸')||strcmp(result,'油漆')||strcmp(result,'矿物油')

set(handles.edit2,'string','有害垃圾')

else

set(handles.edit2,'string','其他垃圾')


end

% hObject handle to pushbutton4 (see GCBO)

% eventdata reserved - to be defined in a future version of MATLAB

% handles structure with handles and user data (see GUIDATA)



% --- Executes on button press in pushbutton5.

function pushbutton5_Callback(hObject, eventdata, handles)

clc

close

% hObject handle to pushbutton5 (see GCBO)

% eventdata reserved - to be defined in a future version of MATLAB

% handles structure with handles and user data (see GUIDATA)




function edit1_Callback(hObject, eventdata, handles)

% hObject handle to edit1 (see GCBO)

% eventdata reserved - to be defined in a future version of MATLAB

% handles structure with handles and user data (see GUIDATA)


% Hints: get(hObject,'String') returns contents of edit1 as text

% str2double(get(hObject,'String')) returns contents of edit1 as a double



% --- Executes during object creation, after setting all properties.

function edit1_CreateFcn(hObject, eventdata, handles)

% hObject handle to edit1 (see GCBO)

% eventdata reserved - to be defined in a future version of MATLAB

% handles empty - handles not created until after all CreateFcns called


% Hint: edit controls usually have a white background on Windows.

% See ISPC and COMPUTER.

if ispc && isequal(get(hObject,'BackgroundColor'), get(0,'defaultUicontrolBackgroundColor'))

set(hObject,'BackgroundColor','white');

end




function edit2_Callback(hObject, eventdata, handles)

% hObject handle to edit2 (see GCBO)

% eventdata reserved - to be defined in a future version of MATLAB

% handles structure with handles and user data (see GUIDATA)


% Hints: get(hObject,'String') returns contents of edit2 as text

% str2double(get(hObject,'String')) returns contents of edit2 as a double



% --- Executes during object creation, after setting all properties.

function edit2_CreateFcn(hObject, eventdata, handles)

% hObject handle to edit2 (see GCBO)

% eventdata reserved - to be defined in a future version of MATLAB

% handles empty - handles not created until after all CreateFcns called


% Hint: edit controls usually have a white background on Windows.

% See ISPC and COMPUTER.

if ispc && isequal(get(hObject,'BackgroundColor'), get(0,'defaultUicontrolBackgroundColor'))

set(hObject,'BackgroundColor','white');

end
Copy the code