Speech recognition model based on RNN and CTC is used to explore the solution of context shift

Abstract: In the work introduced in this paper, we demonstrate a speech recognition model based on RNN and CTC, in which the DECODING based on WFST can effectively fuse dictionary and language models.

This article is shared from The Huawei Cloud community “How to solve the context shift? Proprietary domain end-to-end ASR Road (III)”, the original author: Xiaoye0829.

In this article we describe a job that combines CTC with WFST (WeightedFinite-State Transducers) : EESEN: End-to-end SPEECH RECOGNITION USING DEEP RNN MODELS and Wfst-based DECODING “.

In this work, the acoustic model is modeled using RNN to predict context-free phonemes or characters, and then CTC to align speech and label. What makes this article different is that it proposes a general decoding method based on WFST that incorporates dictionary and language models when decoding CTC. In this approach, CTC labels, dictionaries, and language models are encoded into a WFST and synthesized into a comprehensive search graph. This WFST based approach can easily handle blank tags in CTC and beam search.

In this blog post, we will not cover RNN and CTC. The main focus is on how to use WFST for decoding modules. A WFST is a finite-State Acceptor (FSA), and each transition state has an input symbol, an output symbol, and a weight.

Above is a schematic diagram of the language model WFST. The weight on the arc is the probability of getting the next word when given the previous word. Node 0 is the start node, and node 4 is the end node. A path in WFST contains a sequence of emission from input symbols to output symbols. Our decoding method uses CTC labels, lexicons, and language models to represent the WFST of the component, and then using highly optimized FST libraries such as OpenFST, we can effectively fuse these WFST into a single search graph. Let’s start by showing how to start building a single WFST.

1. Grammar. A Grammar WFST encodes the sequence of words that a language allows. Above is a simplified language model. It has two sequences: “How are you” and “How is it”. The basic symbolic unit of WFST is word, and the weights on the arc are the probabilities of the language model. With this REPRESENTATION in WFST form, CTC decoding can in principle utilize any language model that can be converted to WFST. According to the representation in Kaldi, the WFST of this language model is represented as G.

2. Lexicon. A lexicon WFST encodes a mapping from a sequence of lexicon units to words. There are two corresponding cases for this dictionary, based on the modeling unit of the label corresponding to the RNN. If label is a phoneme, then the dictionary is the same standard dictionary as the traditional hybrid model. If label is character, then the dictionary simply contains the spelling of each word. The difference between the two cases is that the spelling dictionary can be expanded more easily to include any OOV (outside of the vocabulary) words. In contrast, the extended phoneme dictionary is less intuitive, relies on some Grapheme-to-phoneme method or model, and is prone to errors. The dictionary WFST is represented as L, and the figure below shows two examples of dictionary construction L:

The first example shows the construction of a phoneme dictionary. If the entry in the phoneme dictionary is “is IH Z”, the following example shows the construction of a spelling dictionary, “is I S”. For spelling dictionaries, there is another complex problem to deal with. When using character as the CTC tag, we usually insert an extra space between two words to model the word interval before the original transliteration. In decoding, we allow Spaces to selectively appear at the beginning and end of a word. This situation can be easily handled by WFST.

In addition to English, we also show a Chinese dictionary entry here.

3. Token. The third WFST maps a sequence of FRAMe-level CTC tags to a single dictionary unit (phoneme or character). For a dictionary unit, token level WFST is used to subsume all possible frame level tag sequences. Thus, this WFST allows the appearance of a blank label ∅, as well as the repetition of any non-blank labels. For example, after five frames, the RNN model might output three label sequences: “AAAAA”, “∅AA∅”, “∅AAA∅”, and “∅AAA∅”. Token WFST maps these three sequences to A dictionary unit: “A”. The figure below shows the WFST of a phoneme “IH” that allows for the appearance of the blank label and the repetition of the non-blank label “IH”. We represent the WFST of this token as T.

4. Search graph. After compiling each of the three WFST, we combined them into a comprehensive search graph. First, the dictionary WFST L and the syntax WFST G are synthesized. In this process, determinization and minimization are used. These two operations are used to compress the search space and speed up decoding. This composite WFST LG is then combined with the token’s WFST to generate the search graph. S = T о min (det (LоG))). The search diagram S encodes the process of mapping from a sequence of CTC labels corresponding to a speech frame to a sequence of words. Specifically, it is first to parse the words in the language model into phonemes to form the LG map. RNN then outputs the label (phoneme or blank) corresponding to each frame and searches the LG graph according to the label sequence.

When decoding a mixed DNN model, we need to scale the posterior state from DNN using a prior state, which is usually estimated by the forced alignment in the training data. A similar process is used to decode the CTC trained model. Specifically, we run the final RNN model through the whole training set, and the labels with the largest posteriori are selected as the frame-level alignment, and then we use this alignment to estimate the label prior. However, this method does not perform well in our experiment, partly because the output of the CTC trained model after SoftMax layer shows a high peak distribution (that is, the CTC model tends to output a single non-empty label, so many spikes will appear in the whole distribution). The corresponding label of most frames is blank label, while non-blank labels only appear in a narrow area, which makes the prior distribution estimate dominated by the number of blank frames. Instead, we estimate a more robust label prior from the label sequence in the training set, that is, we calculate a prior from the enhanced label sequence. Assuming the original label was: “IH Z”, then the enhanced label might be “∅IH ∅ Z” and so on. By counting the number of labels distributed on each frame, we can get the prior information of labels.

With the WFST based approach described above, let’s look at the experimental part. After the posterior distribution is regular, the fraction of this acoustic model needs to be reduced with a scaling factor between 0.5 and 0.9, and the optimal scaling value is determined experimentally. The experiment in this article was conducted in WSJ. The best model used in this article is a phoneme-based RNN model that achieves a WER of 7.87% on the EVAL92 test set when using the dictionary and language model, and rapidly increases to 26.92% when using only the dictionary. The following figure shows the comparison between the Eesen model and the traditional hybrid model. From this table, we can see that the Eesen model is a little worse than the mixed HMM/DNN model. But on larger data sets, such as Switchboard, ctC-trained models can achieve better results than traditional models.

A significant advantage of Eesen is that the decoding speed is much faster than the hybrid HMM/DNN model. This acceleration comes from a large reduction in the number of states. As can be seen from the decoding speed in the table below, Eesen achieved a speed acceleration of more than 3.2 times. Moreover, the TLG diagram used in the Eesen model is significantly smaller than the HCLG diagram used in the HMM/DNN, which also saves disk space for storing the model.

In summary, in the work presented in this paper, we demonstrate an RNN and CTC based speech recognition model in which WFST based decoding is able to effectively fuse dictionary and language models.

Click follow to learn about the fresh technologies of Huawei Cloud

Speech recognition model based on RNN and CTC is used to explore the solution of context shift

Related Posts

Spark machine learning: RDD, DataFrame and Dataset API

Python +C++

Install Anaconda and Pycharm for Ubuntu 20.04