Voice recognition in mobile terminal based on TensorflowLite

Most of the existing voice recognition is implemented on the server side, which brings two problems:

1) When the network is poor, it will cause a large delay and bring poor user experience.

2) When the traffic volume is heavy, it occupies a lot of server resources.

In order to solve the above two problems, we choose to realize the voice recognition function on the client. This paper uses machine learning method to recognize human voice. The framework used is Google’s tensorflowLite framework, which is as small as its name suggests. While ensuring accuracy, the size of the frame is only about 300KB, and the model generated after compression is one fourth of the Tensorflow model [1]. Therefore, the tensorflowLite framework is suitable for use on the client side.

In order to improve the recognition rate of human voice, it is necessary to extract audio features as input samples of machine learning framework. The feature extraction algorithm used in this paper is Mayer cepstrum algorithm based on human ear hearing mechanism [2].

As it is time-consuming to use voice recognition on the client, a lot of optimization needs to be done in the project. The optimization aspects are as follows:

Instruction set acceleration: the introduction of ARM instruction set, multi-instruction set optimization, acceleration operation.
Multi-threaded acceleration: Multi-threaded concurrent processing is used for time-consuming operations.
Model acceleration: Select a model that supports NEON optimization and preload the model to reduce the preprocessing time.
Algorithm acceleration: I) Reduce the audio sampling rate. II) Select the human voice frequency band (20Hz ~20khz) and eliminate the non-human voice frequency band. III) Reasonable windowing and slicing to prevent over calculation. IV) Mute detection to reduce unnecessary time fragments.

1. An overview of the

1.1 Human voice recognition process

Human voice recognition is divided into two parts: training and prediction. Training refers to the generation of predictive models, and prediction is the use of models to produce predictive results.

Firstly, I will introduce the training process, which is divided into the following three parts:

Based on Meier cepstrum algorithm, sound features are extracted and converted into spectral images.
The neural network model was trained by taking the human spectrum as positive samples and the non-human sounds such as animal sounds and noises as negative samples.
Based on the files generated by training, a runnable prediction model is generated on the end.

In short, the process of human voice recognition training is divided into three parts: extracting voice features, model training and generating end-model. Finally, it is the part of human voice recognition: first extract the voice features, then load the training model to obtain the prediction results.

1.2 Artificial intelligence framework

In November 2017, Google announced the launch of TensorFlow Lite at I/O, a lightweight solution for TensorFlow for mobile and embedded devices. Can run on multiple platforms, from rack servers to small IoT devices. But with the widespread use of machine learning models in recent years, there is a need to deploy them on mobile and embedded devices. TensorFlow Lite allows low-latency inference of machine learning models on the device side.

This paper is based on tensorflowLite, an artificial intelligence learning system developed by Google. Its name comes from its operating principle. Tensor means n-dimensional arrays, Flow means calculations based on data Flow diagrams, TensorFlow is the calculation of tensors flowing from one end of a Flow diagram to the other. TensorFlow is a system that transmits complex data structures to an artificial intelligence neural network for analysis and processing.

The following figure shows the architectural design of tensorflowLite [1] :

Figure 1.1 TensorFlow Lite architecture diagram

2. Meyer cepstrum algorithm

2.1 an overview of the

The voice recognition algorithm in this chapter, Meir cepstrum algorithm [2], is divided into the following steps, which will be described in detail in subsequent sections.

Input sound file, parse into raw sound data (time domain signal).
Time-domain signals are converted into frequency-domain signals by short-time Fourier transform (STFT) and windowed frames.
By means of Mayer spectrum transformation, the frequency is converted into a linear relationship that the human ear can perceive.
Through Meyer cepstrum analysis, THE DC signal component and sinusoidal signal component are separated by DCT transform [3].
Feature vectors of sound spectrum are extracted and converted into images.

In order to meet the short-term stability of speech in the time domain, Mayer spectrum transform is to transform the frequency perception of human ear into a linear relationship, cepstrum analysis focuses on understanding the Fourier transform, any signal can be decomposed into a DC component and the sum of several sinusoidal signals by Fourier transform.

Figure 2.1 Time domain signal of sound

Figure 2.1 is the time domain signal of sound, and it is difficult to see the frequency change rule intuitively. Figure 2.2 is the frequency domain signal of sound, which reflects the volume and frequency of sound. Figure 2.3 is the sound feature through meir cepstrum, which can extract sound

Figure 2.2 Frequency domain signal of sound

Figure 2.3 Cepstrum characteristics of sound

Figure 2.4 Implementation process of Meyer cepstrum algorithm

2.2 STFT

Sound signal is a one-dimensional time domain signal, intuitively it is difficult to see the frequency change law. If you transform it into the frequency domain by Fourier transform, you can see the frequency distribution of the signal, but you lose the time domain information, you can’t see how the frequency distribution changes over time. In order to solve this problem, many time-frequency analysis methods came into being. Short-time Fourier transform, wavelet transform, Wigner distribution and so on are commonly used in time and frequency domain analysis methods.

FIG. 2.5 Schematic diagram of FFT transform and STFT transform

The Fourier transform gives you the spectrum of the signal. Signal spectrum is widely used, signal compression and noise reduction can be based on spectrum. The Fourier transform, however, assumes that the signal is stationary, that the statistical properties of the signal do not change with time. The sound signal is not a stationary signal, and many signals appear for a long period of time and then disappear immediately. If you take the Fourier transform of all of this signal, it doesn’t reflect the change in sound over time.

The short time Fourier transform (STFT) used in this paper is the most classical time-frequency domain analysis method. The short time Fourier Transform (STFT) is a mathematical transform related to the Fourier transform (FT), which is used to determine the frequency and phase of the local sinusoidal wave of a time-varying signal. The idea is to select a time-frequency localized window function, assume that the analysis window function H (t) is stationary in a short time interval, make f(t) H (t) is stationary in different finite time widths, and then calculate the power spectrum at different time. STFT uses a fixed window function. The window functions commonly used are Hanning window, Hamming window, Blackman-Haris window and so on. In this paper, hemming window is a kind of cosine window, which can well reflect the attenuation relationship of energy with time at a certain time.

Therefore, the STFT formula in this paper is in the original Fourier transform formula:The window function is added on the basis of the formula, so the STFT formula is transformed into

Among them,Is the Hamming window function.

Figure 2.6 STFT transform based on Hamming window

2.3 Mayer spectrum

The spectrogram is usually a very large graph, and in order to obtain the sound features of the appropriate size, it is often transformed into the Meyer spectrum through the Meyer scale filter bank. What is a Meyer filter bank? It starts with the Meyer scale.

The Meyer scale, named by Stevens, Volkmann, and Newman in 1937. As we know, the unit of frequency is Hertz (Hz), and the range of frequency that the human ear can hear is 20-20,000 Hz. However, the human ear does not have a linear perception relationship to the scale unit of Hz. For example, if we adapt to a 1000Hz pitch, if we increase the pitch frequency to 2000Hz, our ears will only detect a slight increase in frequency, not a doubling of frequency. If the ordinary frequency scale is converted to mayer frequency scale, the mapping is as follows:

After the above formula, the frequency perception of human ear becomes a linear relationship [4]. That is, on the Meyler scale, if the meyler frequencies of two sounds are twice different, the tones perceived by the human ear are about twice different.

Let’s look at the mapping from Hz to MEL frequency (MEL). Since they are log, when the frequency is small, the MEL frequency varies rapidly with Hz; When the frequency is very high, the Mayer frequency rises very slowly and the slope of the curve is very small. This shows that the human ear is sensitive to low frequency tones, while the human ear is insensitive to high frequency tones, which is inspired by meyer scale filter banks.

FIG. 2.7 Schematic diagram of frequency to Mayer frequency

As shown in the figure below, 12 triangular filters form a filter bank, with dense filters at low frequencies and high threshold values, while sparse filters at high frequencies and low threshold values. It corresponds to the objective law that the higher the frequency, the duller the ear. The filter form shown in the figure above is called MEL-filter bank with same Bank area, which is widely used in the field of human voice (speech recognition, speaker recognition) and other fields.

FIG. 2.8 Schematic diagram of Meyer filter bank

2.4 Meyercepstrum

Based on the Meir logarithmic spectrum of 2.3, the DC signal component and sinusoidal signal component are separated by DCT transformation, and the final result is called Meir cepstrum.

Among them,

Since the meir cepstrum output is a vector, it cannot be shown in pictures, so it needs to be converted into an image matrix. Need to output the range of vectorsLinear transformation to the range of the graph

FIG. 2.9 Schematic diagram of drawing color scale

2.5 Algorithm processing speed optimization

As the algorithm needs to be implemented on the client, the speed needs to be improved [5]. Optimization aspects are as follows:

1) Instruction set acceleration: Since the algorithm has a large number of addition and multiplication matrix operations, the ARM instruction set is introduced to perform multi-instruction set optimization and accelerate the operation. The speed can be increased by 4~8 times [6].

2) Algorithm acceleration: I) Select vocal frequency band (20HZ~20KHZ) and eliminate non-vocal frequency band to reduce redundant calculation.

II) Reduce the audio sampling rate. Since the human ear is not sensitive to too high a sampling rate, reducing the sampling rate can reduce unnecessary data calculation.

III) Reasonable windowing and slicing to prevent over calculation.

IV) Mute detection to reduce unnecessary time fragments.

3) Sampling frequency acceleration: If the sampling frequency of audio is too high, select down-sampling, and the highest processing frequency is set to 32kHZ.

4) Multithreading acceleration: split audio into multiple segments and use multithreading for parallel processing. The number of threads is allocated according to the capacity of the machine. The default is 4 threads.

Figure 2.10 Parameters selected at the engineering end of the algorithm

3. Human voice recognition model

3.1 Model selection

Convolutional Neural Networks (CNN) is a feedforward Neural network, and its artificial neurons can respond to part of the surrounding elements within the coverage area, and have excellent performance in large-scale image processing.

In the 1960s, Hubel and Wiesel proposed convolutional neural network when they studied the neurons used for local sensitivity and direction selection in the cat cortex and found that their unique network structure could effectively reduce the complexity of feedback neural network. At present, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification. Because the network avoids the complex pre-processing of images and can directly input original images, it has been widely used. The new recognition machine proposed by K.Fukushima in 1980 is the first implementation network of convolutional neural networks. Subsequently, more researchers improved the network. Among them, the representative research achievement is the “improved cognitive machine” proposed by Alexander and Taylor, which integrates the advantages of various improved methods and avoids time-consuming error back propagation.

Generally, the basic structure of CNN consists of two layers, one of which is the feature extraction layer. The input of each neuron is connected to the local acceptance domain of the previous layer, and the local features are extracted. Once the local feature is extracted, the location relationship between it and other features is determined. The other is the feature mapping layer. Each computing layer of the network is composed of multiple feature maps. Each feature map is a plane on which all neurons have equal weights. In feature mapping structure, sigmoid, Relu and other functions with small influence function kernel are used as activation functions of convolutional network to make feature mapping have displacement invariance. In addition, the number of network free parameters is reduced because the neurons on a mapping surface share weights. Each convolutional layer in convolutional neural network is followed by a computing layer for local average and quadratic extraction. This unique two-fold feature extraction structure reduces the feature resolution.

CNN is mainly used to identify 2d figures with displacement, scaling and other forms of distortion invariance. Because the feature detection layer of CNN learns through training data, explicit feature extraction is avoided when using CNN, and implicit learning is carried out from training data. Moreover, since the weights of neurons on the same feature mapping surface are the same, the network can learn in parallel, which is also a big advantage of convolutional network compared with networks connected with neurons. Convolution weights of neural network with its local Shared special structure in terms of speech recognition and image processing has its unique superiority, its layout is closer to real biological neural networks, a weight sharing reduces the complexity of the network, especially the multidimensional network input vector image can directly input this feature to avoid the data in the process of feature extraction and classification the complexity of the reconstruction.

Figure 3.1 Inception- V3 model

The most important improvement of V3 is decomposition. The 7×7 convolutional network is decomposed into two one-dimensional convolution (1×7,7×1), and so is 3×3 (1×3, 3X1). Such advantages can not only accelerate calculation, but also further increase the network depth. The nonlinearity of the network has been increased, the network input has been changed from 224×224 to 299×299, and the module of 35×35/17×17/8×8 has been designed more carefully.

The tensorFlow Session module can be used to realize the training and prediction functions at the code level. See the official website of TensorFlow for specific usage methods [10].

Figure 3.2 Usage of tensorFlow Session

3.2 Model Samples

In supervised machine learning, samples are generally divided into three independent train sets, validation sets and test sets. Among them, the training set is used to estimate the model, the verification set is used to determine the network structure or the parameters controlling the complexity of the model, and the test set is used to verify the performance of the optimal model.

Specific definitions are as follows:

Training set: Learn sample data set and build a classifier by matching some parameters. Establish a classification method, mainly used to train the model.

Verification set: Adjust the parameters of classifier for the learned model, such as selecting the number of hidden units in the neural network. Validation sets are also used to determine the network structure or the parameters that control the complexity of the model to prevent model overfitting.

Test set: mainly tests the resolution of trained models (recognition rate, etc.)

According to the Meir cepstrum algorithm in Chapter 2, sample files for sound recognition can be obtained. The human spectrum is taken as positive samples, and non-human sounds such as animal sounds and noises are taken as negative samples, which are trained by the Inception- V3 model.

In this paper, TensorFlow is adopted as the training framework, and 5000 samples of human voice and non-human voice are selected as the test set, and 1000 samples are selected as the verification set.

3.3 Model training

After sample preparation is complete, the Inception- V3 model can be used for training. When the training model converges, a PB model can be generated for use on the end. When selecting a model, armeabi-V7A or later can be compiled to enable NEON optimization by default. That is, USE_NEON macro can be enabled to achieve the effect of instruction set acceleration. For example, more than half of the operations in CNN network are convolution (CONV) operations, which can be accelerated at least 4 times by instruction set optimization.

FIG. 3.3 Convolution processing function

The Lite model is then generated using the ToCO tool provided by TensorFlow, which can be invoked directly on the client side using the tensorflowLite framework.

Figure 3.4 ToCO tool invocation interface

3.4 Model prediction

Meier cepstrum algorithm is used to extract features from sound files and generate predictive images. Then the lite model generated by the training can be used to predict, and the prediction results are shown as follows:

Figure 3.5 Prediction results of the model

References:

[1] www.tensorflow.org/mobile/tfli… [2] Liu Liyan. Research on speaker recognition based on MFCC and IMFCC [D]. [3] Wang Y, Wang Y, Wang Y, et al. A novel approach to speaker recognition based on MFCC and LPCC [J]. Yu Ming, YUAN Yuqian, DONG Hao, Wang Zhe. 2006(04) [4] Study on Noise Dependent Speaker Identification in Noisy Environment [C]. Kumar Pawan,Jakhanwal Nitika,Chandra Mahesh. International Conference on Devices and Communications. 2011 [5] github.com/weedwind/MF… [6] baike.baidu.com/item/ARM instruction set… [7] www.tensorflow.org/api_docs/py…

Stamp here
ArchSummit Global Architect Summit