Abstract: Since the hidden Markov chain HMM model was used for speech recognition in the 1980s, it has always been the mainstream method of practical speech recognition system.

Introduction to share this article from huawei cloud community the new voice (4) : traditional voice recognition technology introduction | | acoustic/language hidden markov chain model | WFST decoding “, author: yellow spicy chicken.

Since the hidden Markov chain HMM model was used for speech recognition in the 1980s, it has always been the mainstream method of practical speech recognition system.

1. Acoustic model

1.1 Hidden Markov chain model HMM and acoustic model

According to a blog post on the mentioned, P (X | W) P (X ∣ W) is the corresponding acoustic model, you first need to consider is that u relationship between voice and text makes not one to one correspondence between the sequence of the two. The hidden Markov chain model can solve this problem.

Figure 1 Hidden Markov chain model (I and E represent start and end time)

Such as P (X │ W) = P (x_1, x_2, x_3 | w_1, w_2) P (X │ W) = P (x1, x2, x3 ∣ w1 and w2) can be expressed as the form of hidden markov chain, the figure of ww is the implicit state of HMM, xx is the HMM observation, implicit state number and the number of observations is not affected by each other constraints, This solves the problem of variable length of input and output, and has:

P (X │ W) = P (w_1) P (x_1 | w_1) P (w_2 | w_1) P (x_2 | w_2) P (w_2 | w_2) P (x_3 | w_2) P (X │ W) = P (w1) P (x1 ∣ w1) P (w2 ∣ w1) P (x2 ∣ w2) P (w2 ∣ w2) P (x3 ∣ w2)

Among them, the initial state of the HMM probability P (w_1) P (w1) and the state transition probability (P (w_2 | w_1) P (w2 ∣ w1), P (w_2 | w_2)) P (w2 ∣ w2)) can use the conventional statistical methods always calculated from the sample, The main difficulty is that the HMM emission probability (P (x_1 | w_1) P (x1 ∣ w1), P (x_2 | w_2) P (x2 ∣ w2) and P (x_3 | w_2) P (x3 ∣ w2)) on the calculation of Therefore, the acoustic model problem was further refined to the learning of HMM Emission Probabolity.

Another issue that needs to be addressed is the granularity of the base unit of text. For speech, frame granularity can be controlled by adjusting the width of the processing window. For text, the granularity of word level is too broad and general, so as shown in the figure, we decompose it:

FIG. 2 Relationship among Phone, Triphone and Senone (#N, #N^3N3, #3N^3N3 indicate order of magnitude)

Words are made up of phonemes (phones); For example, / L-d-sil // L −d−sil/ and/u-d-L // U −d− L/are called Triphone.

Each of these three factors can in turn be modeled with a separate three-state HMM, so the basic unit of text programming the tiny HMM states. Since many trisemes do not appear in the corpus or the number is small, and the trisemes states can be shared eventually through decision trees, the final number of retained trisemes states for a language with N phonemes is far less than 3N^33N3, usually several thousand, and they are generally defined as Senones. Each frame and each Senone corresponding relation is expressed as the launch of three phoneme HMM probability P (x_i | s_j) P (xi ∣ sj), including s_jsj said the first jj Senone, with the span of the matching frame (x_ixi) usually take 25 ms, interframe step move frequently from 10 ms. Senone is a variety of phonemes defined by mathematical model, without direct auditory experience. #NN is the number of Phone. #N^3N3 and #3N^3N3 are the possible orders of Triphone and Senone respectively.

Sentence to Word, Word to Phone, Phone to Triphone, each Triphone is modeled by HMM, and a long chain of HMM formed by connecting all related HMM end to end according to the pronunciation order represents Sentence.

All P (X | W) P (X ∣ W) is the HMM observation sequence X the probability of long chain. Because the number of phones is fixed, the basic set of all Triphone HMM in the system is also fixed. The long chains corresponding to different WWS are different because the long chains contain different triphones, but they use the same dictionary.

With Phone, c p said said Triphone, shows a p can correspond to multiple c, p (X | W) p (X ∣ W) conversion relationship is similar to the following:

P (X │ W) = P (x_1,… , x_t │ w_1,… , w_l), w_1 = {p_1, p_2,… } \ = P (x_1,… , x_t │ p_1,… , p_m), p_1 = c_1, p_2 = c_2, p_3 = c_3,… \ = P (x_1,… , x_t │ c_1,… , c_m), c_1 = {s_1, s_2, s_3,… } \ = P (x_1,… , x_t │ s_1,… The s_o), o m > n = > lP (X │ W) = P (x1,… The xt │ w1,… , wl), w1 = (p1, p2,… = P (x1,… The xt │ p1,… , PM), p1 = c1, p2 = c2 and c3 = p3,… = P (x1,… , xt │ c1,… , cm), c1 = s1, s2, s3,… = P (x1,… , xt │ s1,… ,so),o>n=m>l

Acoustic modeling based on type, although the granularity of the refined, but the problem is still given the HMM, produce the probability of an observation sequence, just the HMM longer, in the end still need to launch probability P (x_i | s_j) P (xi ∣ sj) modeling.

Conclusion: The design of the acoustic model of speech recognition is a disassembly process from large to small and from macro to micro, while the decoding of speech recognition is a reverse process: from Frame to Senone, from Senone to Triphone, then to Phone, finally to Word and Sentence.

1.2 GMM – the HMM model

According to the above, the HMM in the emission probability P (x_i | s_j) P (xi ∣ sj) modeling directly affect the acoustic model is good or bad.

Gaussion Mixture Model (GMM) is the most commonly used statistical Model. Given sufficient sub-Gaussian number,GMM can fit any probability distribution, so GMM becomes the preferred emission probability Model.

Each GMM corresponds to a Senone and is represented by its respective Probability Density Function (PDF).

The following figure shows the GMM-HMM structure of a single three-phoneme:

Figure 3 GMM-HMM structure of a three – phoneme

GMM regards each frame as an isolated point in space, and there is no dependence between points, so GMM ignores the timing information in the voice signal. Besides, MFCC(Mel Frequency Cepstral Coeffcient) features with low correlation among dimensions within the frame are more conducive to GMM modeling.

GMM training is completed, by comparing each PDF, can calculate emission probability P (x_i | s_j) P (xi ∣ sj), in combination with the initial state probability of HMM, the state transition probability, the formula for computing the HMM to calculate P (X | W) P (X ∣ W).

1.3 within DNN – the HMM model

GMM is generated Model (generative Model), focus on the internal distribution of data can be directly solving P (x_i | s_j) P (xi ∣ sj), And P (x_i | s_j) = P (s_i | x_j) P (x_j)/P (s_j) P (xi ∣ sj) = P (si ∣ xj) P (xj)/P (sj), because P (x_j) P (xj) save not, P (s_j) P (sj) can be determined by conventional statistical methods, Problem is further reduced to calculate P (s_i | x_j) P (si ∣ xj), it is a typical classification problem, is also best discriminant model, the depth of the performance of the neural network. P (s_i | x_j) P (si ∣ xj) is the Likelihood probability (Likelihood), P (s_j) is the prior probability, P (s_i | x_j) is a posteriori probability.

DNN Is used to classify problems. It is supervised learning and requires labels. Since the speech training set is usually the corresponding relationship between the speech and the whole text, the frame level label is not clearly indicated. Therefore, additional algorithms need to be used to label the data set, and the method chosen is GMM above. GMM is good at capturing the internal relationship between known data, and the labels produced have high reliability. The following figure shows the structure of the basic DNN-HMM acoustic model. The speech features are used as the input of DNN, and the output of DNN is used to calculate the emission probability of THE HMM.

Figure 4 Classical structure of DNN-HMM

Compared with THE GMM-HMM structure, the only difference between DNN-HMM and it is that the emission probability in the structure is calculated by DNN instead of GMM.

2. Language model

The problem to be solved by the language model is how to calculate P(W). Common methods are based on N-gram Grammer or RNN.

2.1 N-Gram language model

Language models were typical Autoregressive models, in which the probability of a given sequence W=[W_1, W_2… W_m]W=[W1, W2… WM] was expressed as

P (W) = P (w_1, w_2,… W_m) \ = ∏ _ {I = 1} ^ mP (w_i | w_1, w_2… And w_ \ {1} I -) ∝ ∏ _ {I = 1} ^ mP (w_i | w_} {I – n + 1, w_ {I – n + 2},… And w_ {1} I -) P (W) = P (w1, w2,… , wm) = I = 1 ∏ mP (wi ∣ w1 and w2… , wi – 1) ∏ ∝ I = 1 mP (wi ∣ wi – n + 1, wi – n + 2,… , wi – 1)

The above equation makes the hypothesis of “distant relatives are better than near neighbors”, namely the so-called N-gram model, which assumes that the occurrence probability of the current word is only related to the n-1 words before the word, and each factor in the equation needs to be calculated statistically from a certain number of text corpus, which is the training process of language model. And need to list all possible P (w_i | w_} {I – n + 1,… And w_ {1} I -) P (wi ∣ wi – n + 1,… , wi – 1).

The calculation process can be simplified to calculate the proportion of the occurrence of corresponding word strings in the corpus, i.e

P (w_i │ w_} {I – n + 1, w_} {I – n + 1,… And w_ = \ frac {{1} I -) count (w_} {I – n + 1, w_ {I – n + 2},… , w_i)} {count (w_} {I – n + 1, w_ {I – n + 2},… And w_} {1} I -) P (wi │ wi – n + 1, wi – n + 1,… , wi – 1) = count (wi – n + 1, wi – n + 2,… , wi – 1) count (wi – n + 1, wi – n + 2,… ,wi)

Count represents the number of word strings appearing in the corpus. Some word strings do not appear in the training text due to insufficient training corpus or uncommon word strings, which can be processed by smoothing algorithm.

2.2 RNN language model

As can be seen from the subformula of the probability calculation formula above, the current result is dependent on the previous information, so it is natural to use a one-way cyclic neural network for modeling.

The general practice is to use historical words in a sentence to predict the current word.

Figure 5. Basic structure of RNN language model

As shown in figure 5, as the language of RNN model, the basic structure, the output layer are wide, each output node corresponding to a word, the whole output layer covers language model used in the glossary, so its training is essentially trainer training, and the output of each node according to produce the probability of the node word, namely, P (w_i | w_} {I – n + 1,… And w_ {1} I -) P (wi ∣ wi – n + 1,… ,wi−1), P(W)P(W) can be solved according to the formula.

3. Decoder

We aim the speech recognition is a choice made P | X (W) = P (X | W) P (W) (W) ∣ X P = P (X ∣ W) P (W) of the largest WW, so the decoding essentially a search problem, The optimal path searching can be carried out uniformly with Weighted Finite State converter (WFST).

WFST is composed of state nodes and edges, and the edge has a corresponding input and output symbols and weight, form for x: y/wx: y/w, said the edge of the input symbols for x, output symbols to y, weight of w, weights can be defined as the probability (bigger is better), punishment (as small as possible), etc., from the beginning to the end state ownership usually add up, A complete path must run from the start time to the end time.

Figure 6. Example OF WFST for the language model

The figure above is an example of a language model represented as WFST. Sentences consist of words, which in the case of N-gram LM can be denoted as WFST and denoted as G. It can be seen that the input symbol and output symbol of G are the same, they are both words, and the subsequent weight is converted from the probability value in the language model. According to the figure, The sentence “Using data is better” scored 1+0.66+0.5+0.7=2.86, and the sentence “Using intuition is worse” scored 1+0.33+1+0.3=2.63.

Figure 7. Example pronunciation dictionary WFST

Above is an example of a pronunciation dictionary representation as WFST. Since words are made up of phonemes, they can be represented as WFST and denoted as L. The ε in the figure is a placeholder indicating that there is no input or output. According to the graph, the word “data=/dey t ax/” has a score of 1+0.5+0.3+1=2.8, while the word “dew=/d uw/” has a score of 1+1=2, and “dew” is more likely if the weight is defined as punishment.

In this way, WFST with input Triphone and output Phone is defined as C, WFST with input Senone and output Triphone is defined as H, so far, we get four WFST, namely H, C, L and G. Since the former is the output and the input of the latter, they can be fused into a WFST, so that we can realize from Senone to Triphone (H), Triphone to Phone©, Phone to Word(L), Word to Sentence(G). This is called a Decoding Graph.

Final decoding, only need to GMM or within DNN, can use HCLG decode, phonetic characteristics given sequence X, can be calculated by GMM or within DNN P (s_i | x_j) P (si ∣ xj), with the aid of HCLG, | P (W) X ∝ P P (X | W) (W) P (W) ∣ X ∝ P P (X ∣ W) (W) calculation will become simple, the weight of the W path together (assuming for punishment), minus a state for the launch of input probability to get the final score, the score is smaller, that the greater the chance of the phonetic transcription for X W.

Reference:

1. Basic Law on Speech Recognition – Tsinghua University Speech and Language Technology Center [PDF]

Click to follow, the first time to learn about Huawei cloud fresh technology ~