Abstract:In the work presented in this paper, we demonstrate a speech recognition model based on RNN and CTC, in which WFST-based decoding can effectively integrate the dictionary and language model.

This article is shared from Huawei Cloud Community “How to Solve Context Shift? The Way to End-to-End ASR for Proprietary Domain (III)”, the original author: Xiaoye0829.

In this paper we introduce a combined CTC and WFST (Weighted Finite-State Sensers) approach: EESEN: End-to-end SPEECH RECOGNITION USING DEEP RNN MODELS AND WFST-based DECODING.

In this work, the acoustic model is modeled by using RNN to predict context-free phonemes or characters, and then using CTC to align speech and label. A unique feature of this paper is that it proposes a general decoding method based on WFST, which can integrate dictionary and language model into CTC decoding. In this method, CTC labels, dictionaries, and language models are encoded into a WFST and then synthesized into a comprehensive search graph. This WFST – based approach makes it easy to handle blank tags and beam search in CTC.

In this blog post, we will not cover RNN and CTC. A module that focuses on how to decode using WFST. A WFST is a finite-state acceptor (FSA), with each transition state having an input symbol, an output symbol, and a weight.

Above is a schematic of a language model, WFST. The weight on the arc is the probability of transmitting the next word given the previous word. Node 0 is the start node and node 4 is the end node. A path in the WFST contains a sequence of emission from input symbols to output symbols. In our decoding method, CTC labels, lexicons, and language models are used to represent WFSTs of other components. Then, using a highly optimized FST library, such as OpenFST, we can effectively integrate these WFSTs into a single search graph. Let’s begin by describing how to start building a single WFST.

  • 1. Grammar. A Grammar WFST encodes a sequence of words that the language allows. Above is a simplified language model. It has two sequences: “How are you” and “How is it”. The basic symbolic unit of WFST is WORD, and the weight on the arc is the probability of the language model. With this representation in WFST form, CTC decoding can in principle take advantage of any language model that can be converted into WFST. As represented in Kaldi, the WFST of the language model is represented as G.
  • 2. A dictionary WFST encodes a mapping from a sequence of dictionary units to words. There are two corresponding cases for this dictionary, depending on the RNN’s corresponding label’s modeling unit. If label is a phoneme, then the dictionary is the same standard dictionary as the traditional hybrid model. If label is character, then the dictionary simply contains the spelling of each word. The difference between these two cases is that the spelling dictionary can be easily extended to include any OOV (word outside the vocabulary). In contrast, extending the phoneme dictionary is not intuitive, it relies on some grapheme-to-phoneme approach or model, and is error-prone. The dictionary WFST is represented as L, and the following figure shows two examples of the dictionary building L:

The first example shows the construction of a phoneme dictionary. If the entry of a phoneme dictionary is “is IH Z”, the following example shows the construction of a spelling dictionary, “is I s”. For spelling dictionaries, there is another complication that we need to deal with. When we label a CTC with CHARACTER, we usually insert an extra space between two words to model the spacing of words before the original transcript. When decoding, we allow Spaces to appear selectively at the beginning and end of a word. This situation can be easily handled by the WFST.

In addition to English, we also present here an entry for a Chinese dictionary.

  • 3. Token. The third WFST maps a frame-level sequence of CTC tags to a single dictionary unit (phoneme or character). For a dictionary unit, the Token-level WFST is used to categorize all possible frame-level tag sequences. Therefore, this WFST allows for the appearance of blank labels ∅, as well as the repetition of any non-blank labels. For example, after entering 5 frames, the RNN model might output three label sequences: “AAAAA”, “∅AA∅”, and “∅AA∅”. Token WFST maps these three sequences to A single dictionary unit: “A”. The figure below shows the WFST of a phoneme “Ih” that allows the occurrence of a blank

    tag and the repetition of a non-blank tag “Ih”. We represent the WFST of this token as T.
  • 4. Search Chart: After compiling the three WFSTs separately, we synthesized them into a comprehensive search chart. Firstly, dictionary WFST L and syntax WFST G are synthesized. In this process, determinization and minimization are used to compress the search space and speed up decoding. This composite WFST LG is then combined with Token’s WFST to generate the search graph. The order of total FST operations is: S = T min (Det (L G)). The search graph S encodes the mapping from a sequence of CTC tags corresponding to a speech frame to a sequence of words. Specifically, words in the language model are first parsed into phonemes to form LG graphs. Then RNN outputs the corresponding label (phoneme or blank) for each frame, and searches the LG image according to this label sequence.

When decoding the hybrid DNN model, we need to scale the posterior state from the DNN using the prior state, which is usually obtained by the forced alignment estimate in the training data. A similar process is used to decode the CTC-trained model. Specifically, we run the final RNN model on the whole training set, and the labels with the maximum posterior are selected as the frame level alignment. Then, using this alignment, we estimate the prior of the label. However, this method did not perform well in our experiment, partly because the output of the model trained by CTC showed a high peak distribution after the softmax layer (that is, the CTC model tended to output a single non-empty label, so the whole distribution would have many spikes). This is reflected in the fact that most of the frames correspond to a blank label, while non-blank labels only appear in a very narrow area, which makes the prior distribution estimation be dominated by the number of blank frames. Instead, we estimate the more robust label priors from the tag sequences in the training set, that is, calculate the priors from the enhanced tag sequences. Assuming the original label was: “IH Z”, the enhanced label might be “∅ IH ∅ Z” and so on. By counting the number of tag distributions on each frame, we can get the prior information of the tag.

The WFST-based approach is described above, so let’s take a look at the experimental part. After a posteriori distribution regularization, the fraction of the acoustic model needs to be reduced with a scaling factor between 0.5 and 0.9, and the best scaling value is determined experimentally. The experiment in this paper was carried out on WSJ. The best model used in this paper is a phoneme-based RNN model. On the EVAL92 test set, when the dictionary and language model are used, the model reaches 7.87% of Wer, and when the dictionary is only used, the Wer rapidly rises to 26.92%. The following figure shows the comparison between the Eesen model in this paper and the traditional Hybrid model. From this table, we can see that the EESEN model is somewhat worse than the hybrid HMM/DNN model. However, on larger data sets, such as Switchboard, the CTC-trained model can achieve better results than the traditional model.

One significant advantage of EESEN is that the decoding speed is significantly faster compared to the hybrid HMM/DNN model. This acceleration comes from a large reduction in the number of states. As can be seen from the decode speed in the table below, Eesen achieves more than 3.2 times the decode speed acceleration. Also, the TLG diagrams used in the EESEN model are significantly smaller than the HCLG diagrams used in the HMM/DNN model, which also saves disk space for storing the model.

In summary, in the work presented in this paper, we demonstrate a speech recognition model based on RNN and CTC, in which WFST-based decoding can effectively integrate the dictionary and the language model.

Click on the attention, the first time to understand Huawei cloud fresh technology ~