Speech Recognition Based on MATLAB Dynamic Time Warping (DTW) Isolated Word speech recognition

A list,

Dynamic Time Warping (DTW) has a certain history (proposed by Itakura, a Japanese scholar), and its purpose is relatively simple: it is a method to measure the similarity of two Time series with different lengths. It is also widely applied, mainly in template matching, such as isolated word speech recognition (recognition of whether two segments of speech represent the same word), gesture recognition, data mining and information retrieval.

1 overview

Time series is a common representation of data in most disciplines. For time series processing, a common task is to compare the similarity of two sequences. In time series, the length of two time series that need to be compared may not be the same, which is manifested in the different speed of speech in the field of speech recognition. Because speech signals are quite random, even if the same person speaks the same sound at different times, it is impossible to have a complete length of time. And different phonemes within the same word are pronounced at different speeds. For example, some people make the “A” sound very long, or the “I” sound very short. In these complex cases, the distance (or similarity) between two time series cannot be effectively calculated using the traditional Euclidean distance. 2. Principle of DTW method

In time series, the length of two time series that need to be compared may not be the same, which is manifested in the different speed of speech in the field of speech recognition. And different phonemes within the same word are pronounced at different speeds. For example, some people make the “A” sound very long, or the “I” sound very short. In addition, different time series may only have displacement on the time axis, that is, in the case of reduced displacement, the two time series are consistent. In these complex cases, the distance (or similarity) between two time series cannot be effectively calculated using the traditional Euclidean distance.

DTW calculates the similarity between two time series by extending and shortening the time series:

As shown in the figure above, the upper and lower solid lines represent two time series, and the dotted lines between time series represent similar points between two time series. DTW uses the sum of the distances between all these similarity points, called the Warp Path Distance, to measure the similarity between two time series.

2 DTW calculation method:

Order to compute the similarity of two time series for X and Y, the length of the | | and | Y | X respectively.

Warping paths

The form of the consolidation path is W=w1,w2… , wK, a Max (| | | | X, Y) < = K < = | | | X + Y |.

Wk is of the form (I,j), where I represents the I coordinate in X and j represents the j coordinate in Y.

Attribute path W must begin from w1 = (1, 1), to wK = a (| | | | X, Y) at the end, to ensure that each of the X and Y coordinates are in the W.

In addition, the I and j of W (I,j) in W must be monotonically increasing to ensure that the dotted lines in FIG. 1 do not intersect. The so-called monotonically increasing refers to:

The figure above is Cost Matrix D, where D(I,j) represents the distance of the merging path between two time series of length I and j.

Ii. Source code

Trimmed_X = my_vad(x) %; Input is input voice, output is useful signal Ini =0.1; % Initial silent time Ts =0.01; % Window duration Tsh =0.005; % Frame duration Fs =16000; % Sampling frequency Counter1 =0; The following four arguments are used to find the start and end points counter2 =0;
counter3 = 0;
counter4 = 0;
ZCRCountf = 0; % Stores the zero pass detection result ZCRCountb =0;     
ZTh = 40; % Zero threshold w_SAM = fix(Ts*Fs); O_sam = fix(Tsh*Fs); LengthX = length(x); segs = fix((lengthX-w_sam)/o_sam)+1; % Number of frames sil = fix((ini-ts)/Tsh)+1; Win = hamming(w_sam); Limit = o_sam*(segs- 1) +1; % Start position of last frame FrmIndex =1:o_sam:Limit; % Start position of each frame ZCR_Vector = zeros(1,segs); % Record the number of zeros crossed in each frame % short zeros crossedfor t = 1:segs
    ZCRCounter = 0; 
    nextIndex = (t- 1)*o_sam+1;
    for r = nextIndex+1:(nextIndex+w_sam- 1)
        if (x(r) >= 0) && (x(r- 1) > =0)
         
        elseif (x(r) > 0) && (x(r- 1) < 0)
         ZCRCounter = ZCRCounter + 1;
        elseif (x(r) < 0) && (x(r- 1) < 0)
         
        elseif (x(r) < 0) && (x(r- 1) > 0)
         ZCRCounter = ZCRCounter + 1;
        end
    end
    ZCR_Vector(t) = ZCRCounter; End % short mean amplitude Erg_Vector = zeros(1,segs);
for u = 1:segs
    nextIndex = (u- 1)*o_sam+1;
    Energy = x(nextIndex:nextIndex+w_sam- 1).*win;
    Erg_Vector(u) = sum(abs(Energy));
end

IMN = mean(Erg_Vector(1:sil)); % Mean silent energy (noise mean) IMX = Max (Erg_Vector); % Maximum short-time mean amplitude I1 =0.03* (IMX-IMN) + IMN; %I1, I2 is the initial energy threshold I2 =4 * IMN;
ITL = 100*min(I1,I2); % lower limit of energy threshold, the front coefficient is changed according to the actual situation to obtain appropriate results ITU =10* ITL; % energy threshold upper limit IZC = mean(ZCR_Vector(1:sil));  
stdev = std(ZCR_Vector(1:sil)); % Standard deviation of zero crossing rate in silent stage IZCT = min(ZTh,IZC+2*stdev); % Zero crossing threshold indexi = zeros(1,lengthX); indexj = indexi; indexk = indexi; indexl = indexi; % Searches for the portion that exceeds the upper limit of the energy thresholdfor i = 1:length(Erg_Vector)
    if (Erg_Vector(i) > ITU)
        counter1 = counter1 + 1;
        indexi(counter1) = i;
    end
end
ITUs = indexi(1); % The first frame whose energy exceeds the upper threshold % searches for the part whose energy exceeds the lower thresholdfor j = ITUs:- 1:1
    if (Erg_Vector(j) < ITL)
        counter2 = counter2 + 1;
        indexj(counter2) = j;
    end
end
start = indexj(1) +1; % first level decision start frame Erg_Vectorf = fliplr(Erg_Vector); % symmetry the energy matrix with respect to the center, if it is a row of vectors is equivalent to reverse the process % repeat the above process is equivalent to finding the end framefor k = 1:length(Erg_Vectorf)
    if (Erg_Vectorf(k) > ITU)
        counter3 = counter3 + 1; indexk(counter3) = k; Scores1 = zeros(end end % Scores1 = zeros(1,N);                
Scores2 = zeros(1,N);
Scores3 = zeros(1,N); % s1 = load('Vectors1.mat');
fMatrixall1 = struct2cell(s1);
s2 = load('Vectors2.mat');
fMatrixall2 = struct2cell(s2);
s3 = load('Vectors3.mat'); fMatrixall3 = struct2cell(s3); % calculation DTWfor i = 1:N
    fMatrix1 = fMatrixall1{i,1};
    fMatrix1 = CMN(fMatrix1);
    Scores1(i) = myDTW(fMatrix1,rMatrix);
end

for j = 1:N
    fMatrix2 = fMatrixall2{j,1};
    fMatrix2 = CMN(fMatrix2);
    Scores2(j) = myDTW(fMatrix2,rMatrix);
end
Copy the code

3. Operation results

Fourth, note

Version: 2014 a

Speech Recognition Based on MATLAB Dynamic Time Warping (DTW) Isolated Word speech recognition

A list,

Ii. Source code

3. Operation results

Fourth, note

Related Posts

A selection of bizarre products at CES 2021

Use Python for exception detection

Python Data Analysis Toolkit pandas