At the wuhan Summit of the Cloud Computing Conference held recently, the “AI cashier” equipped with DFSMN voice recognition model accurately recognized users’ voice orders in a noisy environment in a competition with a human shop assistant, and ordered 34 cups of coffee in just 49 seconds. Ticket vending machines equipped with the voice-recognition technology have also been installed on the Shanghai subway.

Xie Lei, a famous voice recognition expert and professor of Northwestern Polytechnical University, said: “The DFSMN model of Alibaba’s open source is groundbreaking in the steady improvement of speech recognition accuracy, which is one of the most representative achievements of deep learning in the field of speech recognition in recent years, and has a huge impact on the global academic circle and AI technology application.”



Figure: Alibaba has opened source its own DFSMN speech recognition model on GitHub

Speech recognition technology has always been an important part of human-computer interaction technology. With speech recognition, machines can understand speech just like humans, and then think, understand and respond.

In recent years, with the application of deep learning technology, the performance of speech recognition system based on deep neural network has been greatly improved and started to be practical. Speech input, speech transcription, speech retrieval and speech translation based on speech recognition have been widely used.

At present, the mainstream speech recognition system generally adopts the acoustic Model based on Deep Neural Networks and Hidden Markov Model (DNN-HMM), whose Model structure is shown in Figure 1. The input of the acoustic model is the spectral features extracted from the traditional speech waveform after windowing and framing, such as PLP, MFCC and FBK. The output of the model generally adopts acoustic modeling units of different granularity, such as mono-phone, single-phoneme state, tri-phoneme state and so on. From input to output, different neural network structures can be used to map the acoustic features of the input to obtain the posterior probabilities of different output modeling units, and then decode with HMM to obtain the final recognition results.

Feedforward Fully connected Neural Networks (FNN) is the earliest network structure adopted. FNN realizes one-to-one mapping from fixed input to fixed output, but its defect is that it cannot effectively utilize the long-term correlation information inherent in speech signal. An improved approach is to employ Recurrent Neural Networks (RNN) based on long-short Term Memory (LSTM). Lstm-rnn can store the historical information in the nodes of the hidden layer through the loop feedback connection of the hidden layer, so that the long-term correlation of the speech signal can be effectively utilized.



Figure 1. Block diagram of speech recognition system based on DNN-HMM

Furthermore, bidirectional recurrent neural network (BidirectionalRNN) can effectively utilize the history and future information of speech signal, which is more conducive to the acoustic modeling of speech. Compared with feedforward full-connection neural network, the speech acoustic model based on recurrent neural network can achieve significant performance improvement. However, compared with the feedforward full-connection neural network model, the recurrent neural network is more complex and often contains more parameters, which leads to the training and testing of the model requires more computing resources.

In addition, the speech acoustic model based on bidirectional recurrent neural network is not suitable for real-time speech recognition tasks because of the large delay problem. Some existing improved models, such as Latency Controlled LSTM (LCBLSTM) [1-2], And Feedforward SequentialMemory Networks (FSMN) [3-5]. Last year we launched the first lcBLSTM-based voice recognition acoustic model in the industry. With ali’s large-scale computing platform and big data, acoustic model modeling was conducted by using multi-machine, multi-card, 16-bit quantization and other training and optimization methods, and the relative recognition error rate decreased by 17-24% compared with FNN model.

FSMN is a network structure proposed recently. By adding some learnable memory modules to the hidden layer of FNN, the long-term correlation of speech can be effectively modeled. Compared with LCBLSTM, FSMN can not only control delay more conveniently, but also obtain better performance and require less computing resources. However, the standard FSMN is difficult to train very deep structures, resulting in poor training results due to the problem of gradient disappearance. The model of deep structure has been proved to have better modeling ability in many fields. Therefore, we propose an improved FSMN model, which is called DeepFSMN (DFSMN). Furthermore, we combined LFR (LowFrame Rate) technology to build an efficient real-time speech recognition acoustic model. Compared with the LCBLSTM acoustic model we launched last year, the relative performance can be improved by more than 20%, and the training and decoding speed can be 2-3 times faster. It can significantly reduce the amount of computing resources required by the actual application of our system.



FIG. 2. FSMN model structure and comparison with RNN

The structure of FSMN model [3], which was first proposed, is shown in Figure 2 (a). It is essentially a feedforward fully connected neural network, which models the surrounding context information by adding some memory blocks beside the hidden layer, so that the model can model the long-term correlation of timing signals. The memory module adopts the tap delay structure shown in Figure 2 (b) to encode the hidden layer output of the current moment and the previous N moments through a set of coefficients to obtain a fixed expression. FSMN is inspired by the theory of filter design in digital signal processing: Any Infinite Impulse Response (IIR) filter can be approximated by a higher-order Finite Impulse Response (FIR) filter. From the perspective of filter, the cyclic layer of RNN model shown in FIG. 2 (c) can be regarded as the first-order IIR filter in FIG. 2 (d). The memory module as shown in Figure 2 (b) adopted by FSMN can be regarded as a higher-order FIR filter. Thus, FSMN can also model the long-term correlation of signals as effectively as RNN. At the same time, because FIR filter is more stable than IIR filter, FSMN training will be simpler and more stable than RNN.

According to the selection of encoding coefficient of memory module, it can be divided into: 1) scalar FSMN (sFSMN); 2) Vector FSMN (vFSMN). SFSMN and vFSMN, as their names suggest, use scalar and vector as encoding coefficients of memory modules respectively. SFSMN and vFSMN memory modules are expressed as follows:



The above FSMN only considers the impact of historical information on the current moment, which can be called one-way FSMN. When we consider the influence of both historical information and future information on the current moment, we can extend the one-way FSMN to get the two-way FSMN. The encoding formula of bidirectional sFSMN and vFSMN memory modules is as follows:



here










FIG. 3. CFSMN structure block diagram

Compared with FNN, FSMN needs to take the output of the memory module as the additional input of the next hidden layer, which will introduce additional model parameters. The more nodes the hidden layer contains, the more parameters are introduced. Combined with the idea of low-rank matrix factorization, the study [4] proposed an improved FSMN structure, called CompactFSMN (cFSMN), which is a structural block diagram of cFSMN whose first implicit layer contains memory module.

For cFSMN, by adding a low dimensional linear projection layer behind the hidden layer of the network, and adding memory modules on these linear projection layers. Furthermore, cFSMN makes some changes to the encoding formula of the memory module by explicitly adding the output of the current moment to the expression of the memory module, so that only the expression of the memory module is used as the input of the next layer. This can effectively reduce the number of model parameters and speed up the network training. Specific unidirectional and bidirectional cFSMN memory modules are expressed as follows:



FIG. 4 is the network structure block diagram of deep-FSMN (DFSMN) further proposed by us, in which the first box on the left represents the input layer and the last box on the right represents the output layer. By adding skip connection between cFSMN memory modules (represented by red box), the output of low-level memory modules can be directly accumulated to high-level memory modules. In this way, in the training process, the gradient of the high-level memory module will be directly assigned to the low-level memory module, so that the problem of gradient disappearance caused by the depth of the network can be overcome, and the deep network can be trained stably. We also made some modifications to the expression of memory module. By referring to the idea of dilation convolution [6], we introduced some stride factors into the memory module, and the specific calculation formula is as follows:



Among them






For the real-time speech recognition system, we can control the delay of the model by flexibly setting the future order. In extreme cases, when we set the future order of each memory module to 0, we can realize an acoustic model without delay. For some tasks, we can tolerate a certain delay, and we can set a smaller future order.

Compared with the previous cFSMN, DFSMN we proposed has the advantage that it can train a deep network through jump connections. As for the original cFSMN, each hidden layer has been split into a two-layer structure through the low-rank decomposition of the matrix, so for a network containing four cFSMN layers and two DNN layers, the total number of layers will reach 13. Therefore, using more cFSMN layers will increase the number of layers and cause the problem of gradient disappearance in training. Lead to the instability of training. DFSMN, which is proposed by us, avoids the problem of gradient disappearance of deep network through jump connection, so that the training deep network becomes stable. It should be noted that jump connections can be added not only between adjacent layers, but also between non-adjacent layers. The jump link itself can be a linear transformation or a nonlinear transformation. In specific experiments, we can realize training of DFSMN network containing dozens of layers, and obtain significant performance improvement compared with cFSMN.

From the initial FSMN to cFSMN can not only effectively reduce the parameters of the model, but also achieve better performance [4]. Further, based on cFSMN, DFSMN proposed by us can significantly improve the performance of the model. The following table compares the performance of acoustic models based on BLSTM, cFSMN and DFSMN in a 2000 hour English task.

Model

BLSTM

cFSMN

DFSMN

WER%

10.9

10.8

9.4

As can be seen from the above table, DFSMN model can achieve a 14% error rate reduction compared with BLSTM acoustic model on the task of 2000 hours, which significantly improves the performance of the acoustic model.




FIG. 5. Block diagram of lFR-DFSMN acoustic model

In the current acoustic model, the input is the acoustic features extracted from each frame of speech signal. The duration of each frame of speech is usually 10ms, and there is an output target corresponding to each input frame of speech signal. Recently, a modeling scheme of LowFrame Rate (LFR) [7] was proposed: by binding speech frames at adjacent moments as input, the target output of these speech frames was predicted to obtain an average output target. In specific experiments, three (or more) frames can be spliced without losing the performance of the model. Thus, the input and output can be reduced by one third or more, which can greatly improve the efficiency of acoustic score calculation and decoding for speech recognition system service. We combined LFR with the DFSMN proposed above to build the speech recognition acoustic model based on LFR-DFSMN as shown in Figure 5. After several groups of experiments, we finally determined that a DFSMN with 10 cFSMN layers +2 DNN layers was adopted as the acoustic model, and LFR was adopted for input and output to reduce the frame rate to one third of the original. The identification results are compared with the best LCBLSTM baseline we launched last year as shown in the table below.

CER%

Product line A

Product line B

LFR-LCBLSTM

18.92

10.21

LFR-DFSMN

15.00 (+ 20.72%)

8.04 (21.25%)

By combining LFR technology, we can achieve triple recognition acceleration. It can be seen from the above table that lFR-DFSMN model can reduce the error rate by 20% compared with LFR-LCBLSTM model in practical industrial scale applications, demonstrating better modeling characteristics for large-scale data.

Actual speech recognition services usually face very complex speech data. The acoustic model of speech recognition must cover as many possible scenarios as possible, including all kinds of conversations, all kinds of sound channels, all kinds of noise and even all kinds of accents, which means a huge amount of data. How to use massive data to train acoustic model quickly and put service on line is directly related to the corresponding speed of business.

We used Alibaba’s MaxCompute computing platform and multi-machine and multi-card parallel training tool. Under the condition of 8-machine 16GPU cards and 5000 hours of training data, the training speed of LFR-DFSMN acoustic model and LFR-LCBLSTM is shown as follows:


The time required to process an EPOCH

LFR-LCBLSTM

10.8 hours

LFR-DFSMN

3.4 hours

Compared with baseline LCBLSTM model, DFSMN can achieve 3 times of training speed improvement per epoch. To train lfr-dfsmn on 20,000 hours of data, model convergence generally requires only 3-4 epoches. Therefore, in the case of 16GPU card, we can complete training of lfr-dfsmn acoustic model with 20,000 hours of data in about 2 days.

To design a more practical speech recognition system, we not only need to improve the recognition performance of the system as much as possible, but also need to consider the real-time performance of the system, so as to provide users with better experience. In addition, we also need to consider the service cost in practical applications, so the power consumption of speech recognition system also has certain requirements. Traditional FNN systems, which require the use of framing techniques, typically have a decoding delay of 5-10 frames, about 50-100ms. The LCBLSTM system, which was launched last year, solved the problem of the whole sentence delay of BLSTM, and finally controlled the delay at about 20 frames, about 200ms. For some online tasks with higher requirements on delay, the delay can be controlled within 100ms with a small loss of recognition performance (about 0.2%-0.3% absolute value), which can fully meet the needs of various tasks. Compared with the best FNN, LCBLSTM can achieve a relative performance improvement of more than 20%, but the recognition speed on the same CPU is slow (i.e. high power consumption), which is mainly caused by the complexity of the model.

Our latest LFR-DFSMN can accelerate the recognition speed by more than 3 times through LFR technology, and further DFSMN can reduce the model complexity by about 3 times compared with LCBLSTM. The following table is the recognition time required by different models counted on a test set. The shorter the time, the lower the computational power required:

model

The time required for the entire test set to be identified

LCBLSTM

956 seconds

DFSMN

377 seconds

LFR-LCBLSTM

339 seconds

LFR-DFSMN

142 seconds

As for the decoding delay of LFR-DFSMN, we can reduce the delay by reducing the order of memory module filter to look into the future. In specific experiments, we verified different configurations. When the delay of LFR-DFSMN was controlled at 5-10 frames, only 3% performance was lost.

In addition, compared with the complex LFR-LCBLSTM model, THE LFR-DFSMN model is characterized by model simplification. Although there are 10 DFSMN layers, the overall model size is only half that of the LFR-LCBLSTM model, and the model size is reduced by 50%.


The original post was published on June 7, 2018

Author: Zhang Shiliang