Speech analysis based on MATLAB cepstrum analysis and MFCC coefficient calculation

A list,

1. Meyer frequency cepstrum coefficient (MFCC)

In any Automatic Speech recognition system, the first step is to extract features. In other words, we need to extract the discernible components of the audio signal, and then throw away the rest of the clutter, such as background noise, mood and so on.

Knowing how speech is made helps us a lot in understanding speech. People produce sound through the vocal tract, and the shape of the vocal tract Determines what kind of sound it makes. The shape of vocal tract includes tongue, teeth and so on. If we can accurately know this shape, we can accurately describe the resulting phoneme. The shape of the channel is shown in the envelope of the short-time power spectrum of speech. MFCCs is a feature that accurately describes this envelope.

MFCCs(Mel Frequency Cepstral Coefficents) is a feature widely used in automatic speech and speaker recognition. It was developed in 1980 by Davis and Mermelstein. From then on. In the field of speech recognition, MFCCs stands out in terms of artificial features and has never been surpassed (as for feature Learning in Deep Learning, that’s another story).

Ok, so here we have a very important key word: the shape of the vocal tract, and we know that it is important and that it can be shown in the envelope of the short-time power spectrum of speech. Well, what is the power spectrum? What is an envelope? What is MFCCs? Why does it work? How do you get it? Let’s take it slow.

2. Spectrogram

We’re dealing with speech signals, so how you describe it matters. Because different descriptions project it differently. So what kind of description is good for us to observe, good for us to understand? So let’s start with something called a spectrogram.

Here, the speech is divided into many frames, and each frame corresponds to a spectrum (calculated by short-time FFT) that represents the relationship between frequency and energy. In actual use, there are three kinds of spectrum graph, namely linear amplitude spectrum, logarithmic amplitude spectrum, and self-power spectrum (logarithmic amplitude spectrum of the amplitude of each spectrum line has been logarithmic calculation, so its ordinate unit is dB(dB). The purpose of this transformation is to make those components with lower amplitudes higher relative to those with higher amplitudes, so as to observe the periodic signal hidden in the low amplitude noise.

We first represent the spectrum of one frame of speech through coordinates, as shown on the left. Now we rotate the spectrum on the left by 90 degrees. I get the middle graph. These amplitudes are then mapped to a grayscale representation. , 0 indicates black, 255 indicates white. The larger the amplitude, the darker the corresponding area. So that gives you the rightmost graph. So why do we do this? The purpose is to add the dimension of time, so that the spectrum of a speech rather than a frame can be displayed, and static and dynamic information can be visually seen. The advantages will be shown later.

And that gives us a spectrum over time, which is the spectrogram of speech signals.

Below is a phonogram of a piece of speech, and the dark areas are peaks of the spectrum (Formants).

So why do we represent speech in a spectrogram?

First, the properties of Phones can be better observed in this way. In addition, sound can be better recognized by observing formants and their transitions. Hidden Markov Models are used to model acoustic spectra implicitly to achieve good recognition performance. Another function is that it can intuitively evaluate the quality of THE TTS system (text to Speech) by directly comparing the matching degree of the synthesized speech with the natural speech spectrogram.

3. Cepstrum Analysis

Here is a spectrum of speech. Peaks show the principal frequency elements of sounds, and we call these formants, where it is as well as the discernible properties of sound (as well as personal IDENTIFICATION cards). So it’s particularly important. You can use it to recognize different sounds.

If it’s so important, then we need to extract it! We need to extract not only the location of the formants, but also the process by which they change. Therefore, we extracted the Spectral Envelope. The envelope is a smooth curve connecting these formants.

We can think of the original spectrum as consisting of two parts: the envelope and the spectrum details. We’re using the logarithmic spectrum, so it’s in dB. So now we need to separate these two parts, so that we can get the envelope.

So how do we separate them? In other words, given log X[k], how do I find log H[k] and log E[k] such that log X[k] = log H[k] + log E[k]?

To achieve this goal, we need to Play a Mathematical Trick. What is this Trick? It’s doing FFT on the spectrum. Taking the Fourier transform on the spectrum is the same thing as Inverse Fourier FFT (IFFT). One thing to note is that we’re dealing with the logarithmic domain of the spectrum, which is part of Trick. At this point, doing IFFT on the logarithmic spectrum is equivalent to describing the signal on a pseudo-frequency axis.

As we can see from the diagram above, the envelope is primarily a low-frequency component (and we need to change our thinking here, so instead of thinking of the horizontal axis as frequency, we can think of it as time), and we think of it as a sinusoidal signal with four cycles per second. So we give it a peak at 4Hz above the pseudo-axis. And the details of the spectrum are mainly high frequencies. We think of it as a sinusoidal signal with 100 cycles per second. So we give it a peak at 100Hz above the pseudo-axis.

Add them together and you get the original spectral signal.

In fact, we already know log X[k], so we can get X[k]. As can be seen from the figure, h[k] is the low frequency part of x[k], then we can get H [k] by passing x[k] through a low pass filter! Right, so at this point we can separate them, and we get the h[k] that we want, which is the envelope of the spectrum.

X [k] is actually Cepstrum. And the h[k] that we care about is the low frequency part of the cepstrum. H [K] describes the spectrum envelope, which is widely used in speech recognition to describe features.

So to summarize cepstrum analysis, it’s actually a process like this:

1) The spectrum of the original speech signal is obtained by Fourier transform: X[k]=H[k]E[k];

Consider only amplitude is: | X [k] [k] | | = | H | E | [k];

2) we are on both sides of the exponential: log | | X [k] | | = log | | H [k] | | + log | | E [k] | |.

3) Take the inverse Fourier transform of both sides to get: x[k]= H [k]+e[k].

There’s actually a technical name for this called homomorphic signal processing. Its purpose is to transform nonlinear problems into linear problems. Corresponding to the above, the original speech signal is actually a volume signal (the sound track is equivalent to a linear time-invariant system, through which the generation of sound can be understood as an excitation), and the first step is to transform it into a multiplicative signal through convolution (the convolution in the time domain is equivalent to the product in the frequency domain). In the second step, the multiplicative signal is converted into an additive signal by taking logarithms. In the third step, the inverse transformation is carried out to restore it to a rolling signal. At this time, although the sequence before and after is in the time domain, but they are obviously different in the discrete time domain, so the latter is called cepstrum frequency domain.

To sum up, cepstrum is the spectrum of the Fourier transform of a signal obtained by logarithmic operation and inverse Fourier transform. Its calculation process is as follows:

4. Mel-frequency Analysis

All right, so let’s see what we just did, okay? Give us a piece of speech, and we can get its spectral envelope (the smooth curve connecting all the resonant peaks). However, experiments on human auditory perception show that human auditory perception is focused only on certain specific regions, rather than the whole spectral envelope.

Mel frequency analysis is based on human auditory perception experiments. Experimental observations show that the human ear acts like a filter bank, focusing only on certain frequency components (human hearing is frequency selective). In other words, it only lets signals from certain frequencies pass, and ignores signals from certain frequencies it doesn’t want to sense at all. But these filters are not uniformly distributed along the frequency axis, in the low frequency region there are many filters, they are more densely distributed, but in the high frequency region, the number of filters becomes less, the distribution is very sparse.

Human auditory system is a special nonlinear system, which has different sensitivity in response to different frequency signals. In the extraction of speech features, human auditory system is very good, it can not only extract semantic information, but also extract the speaker’s personal characteristics, which are far beyond the existing speech recognition system. If the speech recognition system can simulate the characteristics of human auditory perception processing, it is possible to improve the speech recognition rate.

The Mel Frequency Cepstrum Coefficient (MFCC) takes into account human auditory characteristics and maps the linear spectrum to the Mel nonlinear spectrum based on auditory perception, and then converts to the Cepstrum.

The formula for converting normal frequency to Mel frequency is:

As can be seen from the figure below, it can transform the disunified frequency into a unified frequency, that is, a unified filter bank.

5 Mel-Frequency Cepstral Coefficients

We pass the spectrum through a set of Mel filters to get the Mel spectrum. The formula is stated as log X[k] = log (mel-spectrum). At this point we perform cepstrum analysis on log X[k] :

1) Take logarithms: log X[k] = log H[k] + log E[k].

2) Inverse transformation: x[k] = H [k] + E [k].

The cepstrum coefficient h[k] obtained on the Mel spectrum is called the Mel frequency cepstrum coefficient, referred to as MFCC.

Now let’s summarize the process of extracting MFCC features :(the specific mathematical process is too much on the Internet, I don’t want to post here)

1) Pre-weighting, framing and windowing the voice;

2) For each short-term analysis window, the corresponding spectrum is obtained by FFT;

3) The above spectrum is obtained through THE Mel filter bank;

4) Cepstrum analysis is performed on the Mel spectrum (logarithm is taken and inverse transformation is made. The actual inverse transformation is generally achieved by DCT discrete cosine transform, and the second to 13th coefficients after DCT are taken as MFCC coefficients) to obtain the Mel frequency cepstrum coefficient MFCC, which is the feature of this frame of speech.

At this point, speech can be described by a series of cepstrum vectors, each vector is the MFCC feature vector of each frame.

Then the speech classifier can be trained and recognized by these cepstrum vectors.

Ii. Source code

% cepstrum calculation and display clear all; clc; close all; [y,fs]=wavread('C3_4_y_1.wav');
y=y(1:1000);
N=1024; % Sampling frequency and FFT length len=length(y); time=(0:len- 1)/fs; % time scale figure(1), subplot 311; plot(time,y,'k'); % Draw the signal waveform title('(a) Signal waveform '); axis([0 max(time) - 1 1]);
ylabel('value'); xlabel(['time/s' 10]); grid;

nn=1:N/2; ff=(nn- 1)*fs/N; % Calculate the frequency scale z=Nrceps(y); % Find cepstrum figure(1), subplot 312; plot(time,z,'k'); % Draw the cepstrum title('(b) Cepstrum of signal '); axis([0 time(512) 0.2 0.2]); grid; 
ylabel('value'); xlabel(['Invert frequency /s' 10]); % DCT coefficient calculation and recovery clear all; clc; close all; f=50; % signal frequency fs=1000; % Sampling frequency N=1000; % sample number n=0:N- 1;
xn=cos(2*pi*f*n/fs); % form cosine sequence y= DCT (xn); % DCT num=find(abs(y)<5); % find the amplitude after cosine transformation is less than5The range of y (num) =0; % to amplitude less than5The amplitude of the interval is set to0zn=idct(y); % DCT inverse subplot211; plot(n,xn,'k'); % Draw xn graph title('(a) Original signal '); xlabel(['sample' 10 ]); ylabel('value');
subplot 212; plot(n,zn,'k'); % Draw the map of zn title('(b) Reconstructed signal '); xlabel(['sample' 10 ]); ylabel('value'); Draw the frequency response curve of Mel filter bank clear all; clc; close all; % calls the melbankm function in00.5Interval design24Delta window function bank= Melbankm (24.256.8000.0.0.5.'t'); bank=full(bank); bank=bank/max(bank(:)); % amplitude normalized df=8000/256; % Calculated resolution ff=(0:128)*df; % frequency coordinate scalefor k=1 : 24% to draw24Plot (ff,bank(k,:),'k'); hold on;
end
hold off; grid;
xlabel('frequency/Hz'); ylabel('Relative amplitude'% MFCC calculation program clear all; clc; close all; [x1,fs]=wavread('C3_4_y_4.wav'); % Read signal c3_4_y_4.wav wlen=200; Long inc = % frame80; % frame num =8; X1 =x1/ Max (abs(x1)); % amplitude normalization time=(0:length(x1)- 1)/fs;
subplot 211; plot(time,x1,'b') 
title('(a) Speech signal ');
ylabel('value'); xlabel(['time/s' ]);  
ccc1=Nmfcc(x1,fs,num,wlen,inc);
fn=size(ccc1,1) +4; Cn =size(ccc1,2);
z=zeros(1,cn);
Copy the code

3. Operation results

Fourth, note

Version: 2014 a

Speech analysis based on MATLAB cepstrum analysis and MFCC coefficient calculation

A list,

Ii. Source code

3. Operation results

Fourth, note

Related Posts

Stacked deconvolution network for image semantic segmentation tip effect

The content analysis and thinking of problem C of mathematical modeling competition

GloVe training word vector detailed flow