Comparison of mainstream acoustic models

 

directory

An overview of the

Basic concept

Speech frame

Speech recognition system

Mainstream acoustic modeling techniques

HMM

DNN-HMM

FFDNN

CNN

RNN and LSTM

CTC

Other modeling techniques

Language modeling techniques

Voice wake up technology

About the future


An overview of the

Speech recognition modeling is an indispensable part of speech recognition, because different modeling techniques usually mean different recognition performance, so this is the direction of speech recognition team focus on optimization. Because of this, speech recognition models emerge in an endless stream. Among them, the language model includes N-Gram, RNNLM, etc., and the acoustic model includes HMM, DNN, RNN and other models…

Simply put, the acoustic model’s task is to describe the physical changes of speech, while the language model expresses the linguistic knowledge contained in natural language. In this paper, Chen Weilai, head of the voice technology department of Sogou Voice Interaction Center, will share with you the evolution of speech recognition modeling technology under the current wave of artificial intelligence, hoping to help you clarify the mainstream recognition modeling context and the thinking behind.

Sogou Zhiyin engine is an independent research and development of Sogou Company focused on natural interaction of intelligent voice technology, officially released on August 3, 2016, the technology set of speech recognition, semantic understanding, voice interaction, as well as provide services and other functions, not only can listen to speak, Can also understand and think, this article will combine the use of speech recognition modeling technology in zhiyin engine to explain.

Figure 1 Sogou bosom friend engine

 

Basic concept

Speech frame

Considering the short-time stability of speech, the speech signal needs to be windowed and divided into frames in the front-end signal processing, and the recognition features are extracted by frame, as shown in Figure 2 for details. (Editor’s note: Speech features are extracted frame by frame from speech signals for acoustic model modeling.)

FIG. 2 Division of speech frames

 

Speech recognition system

Speech signal after dealing with the front-end signal processing, endpoint detection, frame by frame extracting phonetic features, the types of the characteristics of the traditional MFCC, PLP and FBANK characteristics, sent to the decoder, good feature extracting in acoustic model, language model and the pronunciation dictionary, under the guidance of find the matching word sequence as recognition results output, The overall process is shown in Figure 3. The recognition formula is shown in Figure 4. It can be seen that the acoustic model mainly describes the likelihood probability of features under the pronunciation model. Language model mainly describes the connection probability between words. Pronunciation dictionary is mainly to complete the conversion between words and sounds, among which the acoustic model modeling unit generally chooses the three-phoneme model, taking “Sogou speech” as an example.

sil-s+ou1 s-ou1+g ou1-g+ou3 g-ou3+y ou3-y+u3 y-u3+y u3-y+in1 y-in1+sil

Figure 3 Speech recognition system flow

 

Figure 4 Principle of speech recognition

 

It should be noted that the input feature vector X represents the features of speech.

Mainstream acoustic modeling techniques

In recent years, with the rise of deep learning, the use of acoustic model for nearly 30 years of voice recognition HMM (hidden markov model) is gradually replaced by a neural network (depth) within DNN, model accuracy also have changed by leaps and bounds, the whole acoustic modeling unit from modeling, model structure and modeling process three dimensions have obvious changes, As shown in Figure 5:

Figure 5. Summary of acoustic modeling evolution

 

Among them, the characteristics of the super deep neural network learning ability greatly simplifies the process of feature extraction, reduces the dependence on modeling for expert experience, thus modeling process gradually from before complex multi-step process to a simple end-to-end modeling process, thus the impact of modeling unit is gradually from the state, three phonemes model to syllables, words, such as evolution of larger units, The model structure has changed from the classical GMM-HMM to DNN+CTC (DNN generally refers to deep neural network), and the intermediate state of evolution is the mixed model structure of DNN-HMM.

HMM

HMM was first created in the 1970s. It has been spread and developed in the 1980s and become an important direction of signal processing. It has been successfully used in the fields of speech recognition, behavior recognition, character recognition and fault diagnosis.

In detail, the classical HMM modeling framework is as follows:

Figure 6. HMM Modeling framework

 

Among them, the output probability is modeled by Gaussian mixture model GMM, as shown in the figure below:

DNN-HMM

In 2012, Microsoft Teachers Deng Li and Yu Dong introduced the Feed Forward Deep Neural Network FFDNN (Feed Forward Deep Neural Network) into the acoustic model modeling, and used the output layer probability of FFDNN to replace the output probability calculated by GMM in the previous GMM-HMM. Leading the trend of DNN-HMM hybrid system, many researchers used FFDNN, CNN, RNN, LSTM and other network structures to model the output probability, and achieved good results, as shown in Figure 7.

Figure 7 dnN-HMM hybrid modeling framework

 

In the MODELING framework of DNN-HMM, the input feature adopts the mode of framing around the current frame to realize the modeling of the long-term correlation between the model and the timing signal, while the model output maintains the Trihone shared state (Senone) frequently used by THE GMM-HMM. In the continuous speech recognition with large Vocabulary in Chinese, the state number is generally set at about 10,000. See Figure 8.

Figure 8 DNN-HMM modeling process

 

FFDNN

The model structure of FFDNN is as follows:

FIG. 9 FFDNN modeling process

 

CNN

Editor’s note: Actually, CNN was only used for image recognition at the beginning, and was not used for speech recognition system until 2012.

FIG. 10 CNN modeling process

 

RNN and LSTM

The phenomenon of coarticulation of speech indicates that the acoustic model needs to take into account the long-term correlation between speech frames. Although the ABOVE DNN-HMM models the context information by framing, after all, the number of splicing frames is limited and the modeling ability is not strong, so RNN(recurrent neural network) is introduced to enhance the ability of long-term modeling. The input of RNN hidden layer not only receives the output of the previous hidden layer, but also receives the output of the previous hidden layer as the current input. Through the loop feedback of RNN hidden layer, the long-term historical information is retained, which greatly enhances the memory ability of the model. The timing characteristics of speech are also well described by RNN. However, the simple structure of RNN is easy to cause problems such as gradient disappearance/explosion when BPTT(Backpropagation Through Time) is carried out in model training. Therefore, LSTM(short and long-term memory model) is introduced on the basis of RNN. LSTM is a special RNN. The long time information is modeled by Cell and the special structure of three gated neurons, and the gradient problem of RNN is solved. The practice also proves that the long time modeling ability of LSTM is better than that of ordinary RNN.

FIG. 11 RNN structure

 

Figure 12. RNN to LSTM

 

CTC

The modeling technology in the model of training need to satisfy a condition, is the training data in each frame must be predetermined corresponding annotation, which corresponding to the number of output state within DNN training sequence and annotation characteristics sequence must be of equal length, and in order to get mark, need to use the existing model alignment on the training data sequence and sequence, However, preparation of annotations in training based on big data is time-consuming, and the accuracy of the model used for alignment is often biased, and there will be errors in the annotations used in training. Therefore, the Connectionist Temporal Classification (CTC) criterion is introduced to solve the problem of unequal length between annotation sequences and feature sequences. The model boundary of speech features is automatically learned through the forward-backward algorithm. The combination of this criterion with neural networks for temporal modeling, such as LSTM, can be used directly for end-to-end modeling, upending the nearly 30-year-old HMM framework used for speech recognition.

CTC criterion introduces the blank category to absorb the confusion within the pronunciation unit and highlight the difference between the model and other models. Therefore, CTC has a very obvious peak effect. Figure 13 is the output probability distribution after using triphone-LSTM-CTC model to recognize the speech content of “Sogou Speech”. It can be seen that most areas are absorbed by blank, and the identified triphones correspond to distinct spikes.

FIG. 13 CTC peak effect demonstration

 

It can be expected that end-to-end recognition technologies based on CTC or referencing CTC concepts such as LFMMI will gradually become mainstream and HMM frameworks will gradually be replaced.

Other modeling techniques

Language modeling techniques

At present, RNNLM technology has been gradually introduced into speech recognition, through the modeling of longer history information, RNNLM compared to the traditional use of N-GRAM technology on the recognition performance has a better improvement, but considering the large vocabulary of speech recognition, if the complete replacement of N-gram will bring a large increase in computing and computing time, So RNNLM is used in the soulmate engine to reorder the N-Best candidate list of n-gram recognition output.

Voice wake up technology

For the current fixed wakeup word model in bosom friend engine, end-to-end wakeup word modeling is carried out based on DNN, as follows:

Figure 14. End-to-end voice wake up process

 

Although this method has achieved a very low false wake rate, it has obvious disadvantages. The wakeup word cannot be customized. Therefore, in the acoustic acoustic engine, DNN is used to extract Bottleneck Feature and train the wakeup model based on HMM, which has achieved better results than the traditional MFCC method.

About the future

Despite the large increase in speech recognition modeling capabilities, but the far field, noise, accent, pronunciation habits (swallow) and other issues still exist, is in favor of Wu En da, the accuracy of development from 95% to 99%, while only 4% of the gap, but could change the way people interact, will realize the change of rarely used often use.

At present, the cost of acquiring raw voice data is getting lower and lower. The industry is using tens of thousands of hours of labeled data to update the model. In the future, it will be possible to use hundreds of thousands of training data.

At the data screening level, unsupervised, weakly supervised and semi-supervised data are used for training, while more efficient data selection for annotation. The soulmate engine has adopted the method of active learning for data screening.

Computing level: the cluster based on heterogeneous computing can efficiently complete model training on large data, and the upgrading of computing capacity has been extended from offline training to online testing;

Model level: The learning of large data requires models with stronger capabilities. At present, composite structures based on multiple model structures (such as CNN-LSTM-DNN) have proved the feasibility, and subsequent sequence learning frameworks based on Encoder- attention-decoder have also been combined with speech recognition.

And speech recognition, though now can achieve high accuracy, but accuracy from 95% to 99% or even 100% of the cross is a process from quantitative change to qualitative change, also determine the mainstream voice interaction can be one of the most important aspects of the interaction, but at present the speech recognition of some old problems still exist, technically still don’t have the ability to completely solve the, Therefore, besides technology, product innovation is also very important, which can effectively make up for the lack of accuracy.

To bosom friend engine as an example, it provides voice correction solution for this problem, in recognition of the mistakes can use natural voice, such as user want to say “my name is Chen wei,” recognition is “my name is Chen Hui”, by voice say “ear east Chen wei wei” will correct recognition as a result, the current with the iterative rounds of products, Voice modification has a success rate of 80%, and has been applied to the voice interaction of zhiyin Engine, while the sogou iOS input method is also integrated with the voice modification ability.