Abstract: The purpose of this paper is to introduce the process of acoustic features of speech transformation in detail, and the application of different acoustic features in different models in detail.

Do you really understand the principles behind speech Features? , the author: white horse across the pingchuan.

Voice data is often used in artificial intelligence tasks, but voice data can not be directly input into the model training like image tasks, and there is no obvious characteristic change in the long time domain, so it is difficult to learn the characteristics of voice data. In addition, the time domain data of speech is usually composed of 16K sampling rate, that is, 16000 sampling points per second. The training data directly input to the time domain sampling point is large and it is difficult to train the actual effect. Therefore, the speech task is usually to convert speech data into acoustic features as the input or output of the model. Therefore, this paper describes the process of acoustic features of speech transformation in detail, and describes the application of different acoustic features in different models in detail.

Understanding how sounds are made in the first place is a great help in understanding them. People produce sound through the vocal tract, and the shape of the vocal tract determines what kind of sound is produced. The shape of the vocal tract includes the tongue, teeth and so on. If we know exactly what that shape is, then we can make an accurate description of the phonemes produced. The shape of the vocal tract is usually shown in the envelope of the short-time power spectrum of speech. Then how to get the power spectrum, or on the basis of the power spectrum to get the spectrum envelope, is the characteristics of speech.

1. Time domain diagram

Figure 1: Time domain diagram of audio

In the time domain diagram, the speech signal is directly represented by its time waveform. Figure 1 is the audio time domain diagram opened with Adobe Audition. The quantization accuracy of this speech waveform is 16bit, and the starting position of each sound can be obtained from the diagram, but it is difficult to see more useful information. However, if we zoom in at 100ms, we can get the image shown in figure 2 below.

Figure 2: Short time domain diagram of audio

From the above we can see in a short time dimension, the voice is the time domain waveform of a certain period, the different pronunciation often corresponds to the change of different period, so we can be in a short time domain waveform by Fourier transform into frequency domain graph, to observe the cycle of audio features, so as to obtain useful audio features.

Short – time Fourier transform (STFT) is the most classical time – frequency domain analysis method. The short time Fourier transform, as the name suggests, is the Fourier change of a short time signal. Since the speech waveform only shows a certain periodicity in the short time domain, the use of short-time Fourier transform can more accurately observe the speech changes in the frequency domain. The schematic diagram of Fourier change implementation is as follows:

Figure 3: Schematic diagram of Fourier transform from time domain to frequency domain

The figure above shows how to transform the time domain waveform into the frequency domain spectrum by Fourier transform. However, it is difficult to apply the Fourier transform algorithm because of its complexity of O(N^2). In computer applications, the fast Fourier Transform (FFT) is more used. The transformation of reasoning prove reference [zhihu] (zhuanlan.zhihu.com/p/31584464)

2. Get audio features

From the above one, we can know the specific method principle of acquiring audio frequency domain features, and how to operate the original audio into the audio features of model training, which still requires a lot of auxiliary operations. The specific process is shown in Figure 4 below:

Figure 4: Flow chart of audio to audio features

(1) preemphasis

Preweighting is essentially passing the speech signal through a high-pass filter:

H(z)=1 – \mu z^{-1}H(z)=1−μz−1

Where \muμ is usually taken as 0.97. The purpose of pre-weighting is to lift the high frequency part, so that the spectrum of the signal becomes flat, keeping it in the whole band from low frequency to high frequency, and using the same SNR to calculate the spectrum. At the same time, it is to eliminate the effect of vocal cords and lips during the process, to compensate for the high frequency part of the speech signal which is suppressed by the pronunciation system, and to highlight the high frequency formant.

Pre_emphasis = 0.97 emphasized_signal = np.append(original_emphasis [0], emphasized_signal = np.append(original_emphasis [0], original_signal[1:] - pre_emphasis * original_signal[:-1])Copy the code

(2) frames

Since the Fourier transform requires that the input signal be stationary, it makes no sense to take the Fourier transform of an unstable signal. From the above, we can know that speech is unstable in a long time, and it is periodic in a short time. That is, speech is unstable in a macro sense. When your mouth moves, the characteristics of the signal change. But from the micro point of view, in a relatively short time, the mouth does not move so fast, and the speech signal can be regarded as stable, and can be intercepted to do The Fourier transform. Therefore, the operation of frame splitting should be carried out, that is, the interception of short speech fragments.

So how long is a frame? The frame length must meet two conditions:

  • Macroscopically, it must be short enough to ensure that the signal within the frame is stationary. As mentioned above, the change of mouth shape is the cause of signal instability, so the mouth shape should not change significantly during a frame, that is, the length of a frame should be less than the length of a phoneme. At normal speed, the duration of the phoneme is about 50~200 ms, so the frame length is generally less than 50 ms.
  • At the micro level, it has to include enough periods of vibration, because the Fourier transform has to analyze the frequency, and it has to be repeated enough times to analyze the frequency. The fundamental frequencies of speech, around 100 hz for male voices and 200 hz for female voices, translate into periods of 10 ms and 5 ms. Since a frame has to contain multiple cycles, so generally take at least 20 milliseconds.

Note: The frame intercage is not a strict fragment, but a frame shift concept, that is, determine the size of the frame window, each time according to the size of the frame shift, short time audio clips are intercepted, usually the size of the frame shift is 5-10ms, the size of the window is usually 2-3 times of the frame shift, namely 20-30ms. The reason why the frame shift is set is mainly for the subsequent window operation. The specific framing process is as follows:

Figure 5: Schematic diagram of audio frame

(3) the window

Before taking the Fourier transform, the extracted frame signal must be “windowed”, that is, multiplied by a “window function”, as shown in the figure below:

Figure 5: Audio window schematic

The purpose of the window is to gradient the amplitude of a frame to 0 at both ends. Gradiens are good for the Fourier transform, making the peaks on the spectrum thinner and less likely to be lumped together (the term is “spectrum leakage mitigation”). The cost of windowing is that the ends of a frame are weakened and less valued than the central part. To make up for this, the frames do not intercept back to back, but overlap one another. The time difference between the starting positions of two adjacent frames is called frame shift.

Usually we use the Hamming window for windowing. We multiply each frame by the Hamming window to increase the continuity between the left and right ends of the frame. Suppose the signal after framing is S(n), n=0,1… , N – 1, NS (N), N = 0, 1,… N−1, where N is the size of the frame, then multiply by the Hamming window:

W (n, a) = (1 – a) – a \ times \ cos (\ frac {2 \ PI n}} {n – 1), \ \ leq quad 0 n \ leqN – 1 W (n, a) = 1 – (a) – a * cos 12 PI n) (n -, 0 n n – 1 or less or less

Different values of A will produce different Hamming Windows, which are generally 0.46.

S^{‘}(n) = S(n) \times W(n, a)S′(n)=S(n)×W(n,a)

Implementation code:

N = 200 x = np.arange(N) y = 0.54 * np.ones(N) -0.46 * np.cos(2*np.pi*x/(n-1) np.hamming(frame_length)Copy the code

(4) Fast Fourier Transform FFT

Because it is often difficult to see the characteristics of the signal when it is transformed in the time domain, it is often converted into the energy distribution in the frequency domain to observe, and the different energy distribution can represent the characteristics of different speech. After multiplying by the Hamming window, each frame has to go through the fast Fourier transform to get the energy distribution on the spectrum. The spectrum of each frame is obtained by using the fast Fourier transform of each frame signal. We’ve already seen how fast Fourier transform works, but I’m not going to tell you much about it. Detailed reasoning and implementation to promote reference zhihu FFT (zhuanlan.zhihu.com/p/31584464) note: audio through fast Fourier transform to return here is plural, the frequency, amplitude of the real said of the imaginary part represents the frequency of phase.

There are many libraries that contain FFT functions. Let’s list a few:

import librosa
import torch
import scipy

x_stft = librosa.stft(wav, n_fft=fft_size, hop_length=hop_size,win_length=win_length)
x_stft = torch.stft(wav, n_fft=fft_size, hop_length=hop_size, win_length=win_size)
x_stft = scipy.fftpack.fft(wav)
Copy the code

To get the amplitude and phase spectrum of the audio, it is necessary to take the modulus square of the spectrum to get the power spectrum of the speech signal, which is often called the linear spectrum in speech synthesis.

(5) Mel spectrum

The frequency range that the human ear can hear is 20-20000Hz, but the human ear has no linear perception relation to the scale unit Hz. For example, if we adapt to a pitch of 1,000 Hz, if we raise the pitch to 2,000 Hz, our ears will only detect a slight increase in frequency, not a doubling of frequency at all. Therefore, ordinary frequency scale can be transformed into MEL frequency scale to make it more consistent with people’s auditory perception. The mapping relationship is shown as follows:

MEL (f) = 2595 x log10 (1 + 700 f)

f=700 \times (10^{\frac{m}{2595}}-1)

F = 700 x (102595 m – 1)

In computer, the transformation of linear coordinates to Merr coordinates is usually achieved by using a band pass filter, and the triangular band pass filter is commonly used. The triangular band pass filter has two main functions: to smooth the spectrum, and eliminate the effect of harmonics, and highlight the original speech formant. The schematic diagram of triangular bandpass filter construction is shown as follows:

Figure 6: Schematic diagram of triangular bandpass filter construction

This is a schematic diagram of the non-equal triangular bandpass filter construction, because human beings are weak to the high frequency energy perception, so the low frequency energy conservation is significantly larger than the high frequency. The construction code of triangular bandpass filter is as follows:

low_freq_mel = 0
high_freq_mel = (2595 * numpy.log10(1 + (sample_rate / 2) / 700))  # Convert Hz to Mel
mel_points = numpy.linspace(low_freq_mel, high_freq_mel, nfilt + 2)  # Equally spaced in Mel scale
hz_points = (700 * (10**(mel_points / 2595) - 1))  # Convert Mel to Hz
bin = numpy.floor((NFFT + 1) * hz_points / sample_rate)
fbank = numpy.zeros((nfilt, int(numpy.floor(NFFT / 2 + 1))))

for m in range(1, nfilt + 1):
    f_m_minus = int(bin[m - 1])   # left
    f_m = int(bin[m])             # center
    f_m_plus = int(bin[m + 1])    # right

    for k in range(f_m_minus, f_m):
        fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
    for k in range(f_m, f_m_plus):
        fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
filter_banks = numpy.dot(pow_frames, fbank.T)
filter_banks = numpy.where(filter_banks == 0, numpy.finfo(float).eps, filter_banks)  # Numerical Stability
filter_banks = 20 * numpy.log10(filter_banks)  # dB
Copy the code

And then you just multiply the linear spectrum times the trigonometric bandpass filter and take the log to get the MEL spectrum. Generally speaking, this is the end of audio feature extraction for speech synthesis tasks. MEL spectrum as audio feature basically meets the needs of some speech synthesis tasks. But you also need to do it again in the speech recognition discrete cosine transform (DCT transform), since different Mel filter is a intersection, thus they are related, we can use the DCT transform to remove the correlation can improve the accuracy of recognition, but need to keep this relationship in the speech synthesis, so only recognition DCT transformation needs to be done. Which DCT transform principle explanation can refer to zhihu DCT (zhuanlan.zhihu.com/p/85299446)

Want to know more about AI technology dry goods, welcome to huawei cloud AI zone, AI programming Python and other six actual combat camp for you to learn for free. (six combat battalion link: su. Modelarts. Club/qQB9)

Click follow to learn about the fresh technologies of Huawei Cloud