Detailed convolutional Neural Network (CNN) application in speech recognition

Welcome to Tencent cloud community, get more Tencent mass technology practice dry goods oh ~

Author: Hou Yixin

preface

Summarizing the current development status of speech recognition, DNN, RNN/LSTM and CNN are several mainstream directions in speech recognition. In 2012, Microsoft Teachers Deng Li and Yu Dong introduced the Feed Forward Deep Neural Network FFDNN (Feed Forward Deep Neural Network) into the acoustic model modeling, and used the output layer probability of FFDNN to replace the output probability calculated by GMM in the previous GMM-HMM. Leading the trend of DNN-HMM hybrid system. LSTM (LongShort Term Memory) is one of the most widely used structures in speech recognition. It can model the long-term correlation of speech and improve the recognition accuracy. Bidirectional LSTM network can obtain better performance, but it also has the problems of high training complexity and high decoding delay, especially in the industrial real time identification system.

Looking back on the development of speech recognition in the past year, deep CNN is definitely a hot keyword, and many companies have invested a lot of research in this area. In fact, CNN has been used in speech recognition for a long time. Ossama Abdel-Hamid introduced CNN into speech recognition in 2012 or 2013. At that time, the convolution layer and pooling layer appear alternately, and the scale of the convolution kernel is relatively large, and the number of LAYERS of CNN is not large, which is mainly used to process and process features so that they can be better used for DNN classification. With the application of CNN in the image field, VGGNet, GoogleNet and ResNet provide more ideas for CNN in speech recognition, such as pooling layer after multi-layer convolution, and reducing the size of convolution kernel can enable us to train deeper CNN model with better effect.

1 why CNN is used for speech recognition

In general, speech recognition is based on speech spectrum after time-frequency analysis, and speech time-spectrum has structural characteristics. In order to improve the speech recognition rate, it is necessary to overcome the diversity of speech signals, including the diversity of speakers (speakers themselves and speakers), the diversity of environment and so on. A convolutional neural network provides translation invariant convolution in time and space. Applying the idea of convolutional neural network to the acoustic modeling of speech recognition, the invariance of convolution can be used to overcome the diversity of speech signals. From this point of view, it can be considered that the time spectrum obtained from the analysis of the whole speech signal is treated as an image, and the deep convolutional network widely used in the image is used to identify it.

From the practical point of view, CNN is also relatively easy to achieve large-scale parallel computing. Although many small matrix operations are involved in CNN convolution operation, the operation is very slow. However, the accelerated calculation of CNN is relatively mature. For example, Chellapilla et al. proposed a technology that can transform all these small matrices into the product of a large matrix. Some general frameworks such as Tensorflow and Caffe also provide CNN parallelization acceleration, which makes it possible for CNN to try in speech recognition.

The following will introduce the application of CNN in speech recognition from “shallow” to “deep”.

2 CLDNN

When mentioning the application of CNN in speech recognition, it is necessary to mention the CONVOLUTIONAL, LONG short-term MEMORY,FULLY CONNECTED DEEP NEURAL NETWORKS CLDNN [1], There are two layers of CNN application in CLDNN, which can be regarded as the representative of shallow CNN application. CNN and LSTM can achieve better performance improvement than DNN in speech recognition tasks. For modeling ability, CNN is good at reducing frequency domain changes, LSTM can provide long-term memory, so it is widely used in time domain, while DNN is suitable for feature mapping to independent space. In CLDNN, CNN, LSTM and DNN are combined into a single network to achieve better performance than a single network.

The general structure of CLDNN network is that the input layer is time-domain related, several layers of CNN are connected to reduce the frequency domain variation, the output of CNN is injected into several layers of LSTM to reduce the time-domain variation, and the output of the last layer of LSTM is input to the fully connected DNN layer, so as to map the feature space to the output layer that is easier to classify. There have been previous attempts to integrate CNN LSTM and DNN, but generally the three networks are trained separately and finally integrated through the fusion layer, while CLDNN trains the three networks at the same time. The experiment proves that the performance of LSTM will be improved if the LSTM input has better features. Inspired by this, the author uses CNN to reduce the changes in the frequency domain to make the LSTM input more adaptive features, and adds DNN to increase the depth between the hidden layer and the output layer to achieve stronger prediction ability.

2.1 CLDNN network structure

[image upload failed…(image-da38d8-1512119524932)]

Fig 1. CLDNN Architecture

The network structure diagram is shown in FIG. 1. Assume that the center frame is [image uploading failed…(image-9D6E84-1512119524932)]. Then the input feature sequence is [[image uploading failed…(image-b8a414-1512119524932)],.. [image uploading failed…(image-9B953D-1512119524932)]], and the feature vector uses the 40-dimensional Log Mayer feature.

The CNN part consists of two layers of CNN, with 256 feature maps in each layer. The first layer adopts 9×9 time-frequency domain filter, and the second layer adopts 4×3 filter. The pooling layer adopts max-pooling strategy. The pooling size of the first layer is 3, and the second layer CNN is not connected to the pooling layer.

Since the output dimension of the last layer of CNN is large, and the size is feature-mapstimeFrequency, a linear layer is followed after CNN and before LSTM to reduce dimension. The experiment also proves that dimension reduction parameters have no great influence on accuracy, and the output of the linear layer is 256 dimensions.

CNN is followed by two LSTM layers, each LSTM layer adopts 832 cells and 512 dimension mapping layer to reduce dimension. The output status label is delayed 5 frames, when the DNN output information can better predict the current frame. As the input feature of CNN expands l frame to the left and R frame to the right, the author sets R to 0 to ensure that LSTM will not see more than 5 frames in the future. Finally, after modeling in frequency domain and time domain, the output of LSTM is connected to several fully connected DNN layers.

Referring to the application of CNN in the image field, the author also tried the short and long time features [image upload failed…(image-B2F830-1512119524932)]. The input features of CNN were directly input to LSTM as a partial input, and the output features of CNN were directly input to DNN as a partial input.

2.2 Experimental Results

For CLDNN structure, we do a series of experiments with our own Chinese data. The experimental data is 300h Chinese noisy speech, all model input features are 40d Fbank feature, frame rate 10ms. The cross entropy CE criterion was adopted for model training, and the network output was 2W multiple states. As the input of CNN needs to set l and r,r is set to 0, l is the optimal solution through experiment 10, and the default l=10 and r=0 in subsequent experimental results.

LSTM is 3 layers with 1024 cells, and Project is 512. CNN+LSTM and CNN+LSTM+DNN have slightly adjusted network parameters, as shown in the figure below. In addition, a group of experiments were added, which combined two-layer CNN and three-layer LSTM. But increasing the number of LSTM layers does not help.

[image upload failed…(image-1f5035-1512119524932)]

Fig.2 experimental structure of CLDNN

method	WER
LSTM	13.8
CNN LSTM + 2 layer	14.1
CNN LSTM + 3 layer	13.6
CNN+LSTM+DNN	13.0
LSTM+DNN	13.2

Table 1 Results of Test set 1

method	WER
LSTM	21.6
CNN LSTM + 2 layer	21.8
CNN LSTM + 3 layer	21.5
CNN+LSTM+DNN	20.6
LSTM+DNN	20.8

Table 2 Results of Test set 2

3 deep CNN

In the past year, there have been great breakthroughs in speech recognition. IBM, Microsoft, Baidu and other institutions have successively launched their Deep CNN model to improve the accuracy of speech recognition. The Residual Highway network is proposed so that we can train the neural network deeper. In the process of Deep CNN, there were roughly two strategies: one was the acoustic model based on Deep CNN structure in HMM framework, CNN could be VGG, Residual CNN network structure, or CLDNN structure. The other is the end-to-end structure that has been very popular in the past two years, such as the end-to-end modeling using CNN or CLDNN in CTC framework, or coarse-grained modeling unit technologies such as Low Frame Rate and Chain model recently proposed.

For the input end, it can be roughly divided into two kinds: input the characteristics of the traditional signal processing, use different filter processing, and then carry out left and right or jump frame extension.

[image upload failed…(image-aa1B0-1512119524932)]

Fig 3.Multi-scale input feature. Stack 31140

The second is to input the original spectrum directly and process the spectrum map as an image.

Fig 4. Frequency bands input

3.1 Baidu Deep Speech

Baidu applied Deep CNN to speech recognition research, using VGGNet and Deep CNN structure with Residual connection, and combined LSTM and CTC end-to-end speech recognition technology, which reduced the recognition error rate by more than 10% (90% of the original error rate).

Previously, The model algorithm of Baidu Voice has been constantly updated every year, from DNN, to differentiation model, to CTC model, and now to Deep CNN. An acoustic model based on LSTM-CTC has also been available in all voice-related products since the end of 2015. The highlights are as follows: 1) CNN model based on Moore subband in 2013;2) Sequence Discriminative Training in 2014;3) speech recognition based on LSTM-HMM in early 2015;4) At the end of 2015 5) In 2016, Baidu is developing Deep Speech3 based on Deep CNN, which is said to train with big data, with tens of thousands of hours of parameter tuning and even 100,000 hours of product making.

[image upload failed…(image-6d9FAA-1512119524932)]

Fig5. Baidu speech recognition development

Baidu found that the deep CNN structure can not only significantly improve the performance of HMM speech recognition system, but also that of CTC speech recognition system. End-to-end modeling with only deep CNN has relatively poor performance, so it is a relatively good choice to combine cyclic hidden layer such as LSTM or GRU with CNN. The performance can be improved by using the small kernel of 3*3 in the VGG structure or by using Residual connection. The number of layers and filters of the convolutional neural network will significantly affect the modeling ability of the whole model. Baidu needs to adopt different DeepCNN model configurations to achieve optimal performance.

Therefore, Baidu believes that: 1) In the model structure, DeepCNN helps the model to have good translation invariance in the time-frequency domain, which makes the model more robust (anti-noise);2) On this basis, DeepLSTM and CTC focus on the classification of sequences, integrating long-term information through the loop connection structure of LSTM. 3) In the DeepCNN study, the time axis of the convolution structure and the number of filters play a very important role in the performance of the speech recognition model trained for databases of different sizes. 4) In order to train an optimal model on tens of thousands of hours of speech database, a large number of model hyperparameter tuning work is needed, relying on the high-performance computing platform of multi-machine and multi-GPU. 5) The end-to-end speech recognition engine based on DeepCNN also increases the computational complexity of the model to a certain extent. The hardware developed by Baidu also enables the model to serve the majority of speech recognition users.

3.2 the IBM

In 2015, IBM Watson announced a major milestone in English conversational speech recognition: the system achieved an 8% word error rate (WER) in the Switchboard database, a very popular benchmark. In May 2016, the IBM Watson team again announced that their system had achieved a 6.9% word error rate in the same task. The decoding part of the system was HMM, and the language model was an illuminating neural network language model. The acoustic model mainly includes three different models, namely, the cyclic neural network with Maxout activation, the deep convolutional neural network with 3*3 convolutional kernel, and the bidirectional long and short-term memory network. Let’s look at their internal structures in detail below.

[Image upload failed…(image-3da5C3-1512119524932)]

Fig.6. IBM Deep CNN framework

The very deep convolutional neural network is inspired by the VGG network entered by Imagenet in 2014. The central idea is to replace the large convolutional kernel with a smaller 3*3 convolutional kernel. By stacking multi-layer convolutional network before the pooling layer and adopting ReLU activation function, the same perception region can be obtained. At the same time, it has the advantages of fewer parameters and more nonlinear.

As shown in the figure above, the first on the left is the most classical convolutional neural network, which only uses two convolutional layers and contains a pooling layer between them. The convolutional kernel of the convolutional layer is also large, 99 and 43, and there are many feature surfaces of convolution, 512 feature surfaces of convolution.

Left 2, left 3 and left 4 are all structures of deep convolutional neural networks. It can be noticed that, different from classical convolutional neural networks, the number of feature surfaces of convolution increases from 64 to 128 and then to 256, and the pooling layer is placed before the number of feature surfaces of convolution increases. The smaller 33 convolution kernel is used for all convolutional kernels, and the pooling size of the pooling layer increases from 21 to 2*2.

The number of parameters of the rightmost 10-conV is the same as that of the leftmost classical convolutional neural network, but the convergence speed is fully 5 times faster, although the computational complexity is increased.

3.3 Microsoft

On the industry standard Switchboard speech recognition task in September 2016, Microsoft researchers achieved the lowest word error rate (WER) of 6.3% in the industry. The development of acoustic and speech models based on neural networks, combining several acoustic models, applies ResNet to speech recognition.

In October 2016, a team from Microsoft’s AI & Research division reported that their speech recognition system achieved a 5.9% power error rate comparable to or even lower than that of professional stenographers. At 5.9%, the word error rate is the same as in human shorthand for the same conversation, and it’s the lowest ever recorded for a line-switchboard speech recognition task. The milestone means that for the first time, a computer can be as good as humans at recognizing words in conversation. A new method, spatial Smoothing method, and latch-free MMI acoustic training are combined to systematically use convolution and LSTM neural networks.

While there are numerical benchmarks for accuracy breakthroughs, Microsoft’s research is more academic, done on the standard switchboard, a spoken language database, which has only 2,000 hours.

3.4 Google

According to Mary Meeker’s Annual Internet Report, Google’s machine learning-based speech recognition system achieved 95% word accuracy in the English language as of March 2017, which is close to human speech recognition accuracy. On a quantitative basis, Google’s performance has improved by 20% since 2013.

[image upload failed…(image-e5163-1512119524932)]

[image upload failed…(image-cf6e2B-1512119524932)]

Figure 7. Google speech recognition performance development

As can be seen from Google’s articles on various conferences in recent years, Google has mainly adopted various methods and model integration for deep CNN, such as Network-in-Network (NiN), Batch Normalization (BN). Convolutional LSTM (ConvLSTM) is a fusion of methods. Take, for example, the structure Google presented at the 2017 ICASSP conference

[image upload failed…(image-256399-1512119524932)]

Fig 8. [5] includes two convolutional layer at the bottom andfollowed by four residual block and LSTM NiN block. Each residual blockcontains one convolutional LSTM layer and one convolutional layer.

3.5 IFLYtek DFCNN

In 2016, after proposing a new framework of feedforward Sequential Memory Network (FSMN), Furthermore, IFLYTEK proposes a Deep Fully Convolutional Neural Network (DFCNN) speech recognition framework, which uses a large number of Convolutional layers to directly model the whole sentence speech signal, and better expresses the long-term correlation of speech.

DFCNN structure as shown in the figure below, it not only is the spectrum signal input, further directly to a voice into an image as input, the first to the Fourier transform of each frame voice, then time and frequency as two dimensions of the image, and then through a lot of convolution and pool (pooling) layer combination, on the whole sentence speech modeling, The output unit corresponds directly to the final recognition result such as syllable or Chinese character.

Fig. 9. DFCNN frame [image uploading failed…(image-917B82-1512119524932)]

First of all, from the point of the input, traditional voice characteristics after Fourier transform with a variety of filter bank to extract the features of artificial design, cause information loss in the frequency domain, the information loss is especially clear in the high frequency region, and the traditional voice characteristics to consider must use very large amount of calculation of the frame, no doubt caused temporal information on the damage, This is especially true when the speaker is speaking quickly. Therefore, DFCNN directly takes the spectrogram as input, which has natural advantages over other speech recognition frameworks that take traditional speech features as input. Secondly, from the perspective of model structure, DFCNN is different from CNN in traditional speech recognition. It borrows from the network configuration with the best effect in image recognition. Each convolution layer uses a small convolution kernel of 3×3, and a pooling layer is added after multiple convolution layers, which greatly enhances the expression ability of CNN. By accumulating a large number of such convolutional pooling layer pairs, DFCNN can see very long history and future information, which ensures that DFCNN can well express the long-term relevance of speech and is more robust than RNN network structure. Finally, from the point of view of the output end, DFCNN can be perfectly combined with the recently hot CTC scheme to realize the end-to-end training of the whole model, and its special structure such as pooling layer can make the above end-to-end training more stable.

4 summarizes

Due to the translation invariance of CNN convolution in the frequency domain, and the proposal of deep CNN networks such as VGG and residual network, CNN has brought new development to CNN, making CNN become one of the most popular directions of speech recognition in recent two years. The application of deep CNN has also developed from 2-3 layers shallow network to more than 10 layers deep network. From HM-CNN framework to end-to-end CTC framework, various companies have also made remarkable achievements in the application of deep CNN.

To sum up, the development trend of CNN is as follows:

1 for deeper and more complex networks, CNN is generally taken as the first several layers of the network, which can be understood as extracting features with CNN, followed by LSTM or DNN. At the same time, a variety of mechanisms are combined, such as attention Model and ResNet technology.

2 End to End recognition system, using End to End technology CTC, LFR and so on.

3. Coarse-grained modeling units tend to be larger and larger from state to phone to character.

However, CNN also has limitations. [2,3] studies show that convolutional neural network is most helpful for tasks with small differences in training sets or data. For most other tasks, the relative word error rate is generally only reduced within the range of 2% to 3%. In any case, AS one of the important branches of speech recognition, CNN has great research value.

References:

[ 1 ] Sainath,T.N, Vinyals, O., Senior, O.,Sak H:CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP NEURAL NETWORKS

[2] Sainath,T.N., Mohamed,A.r., Kingsbury,B., Ramabhadran,B.:DEEP CONVOLUTIONAL NEURAL NETWORKS FOR LVCSR.In:Proc. International Conference on Acoustics, Speech and signal Processing(ICASSP),pp.8614-8618(2013)

[ 3 ] Deng, L.,Abdel-Hamid,O.,Yu,D.:A DEEP CONVOLUTIONAL NEURAL NETWORK USING HETEROGENEOUS POOLING FOR TRADING ACOUSTIC INVARIANCE WITH PHONETIC CONFUSION.In:Proc. International Conference on Acoustics, Speech and signal Processing(ICASSP),pp.6669-6673(2013)

[ 4 ] Chellapilla, K.,Puri, S., Simard,P.:High Performance Convolutional Neural Networks for Document Processing.In: Tenth International Workshop on Frontiers in Handwriting Recognition(2006)

[ 5 ]Zhang, Y., Chan ,W., Jaitly, N.:VERY DEEP CONVOLUTIONAL NETWORKS FOR END-TO-END SPEECH RECOGNITION.In:Proc. International Conference on Acoustics, Speech and signal Processing(ICASSP 2017)

reading

Using RNN to train Seq2Seq is past, CNN is the future? Tensorflow is used to construct CNN to carry out sentiment analysis practice and change the general CNN acceleration design of “big power and small heart” for AI

This article has been published by Tencent Cloud Technology community authorized by the author