Abstract:In this article, we show that CLAS, an end-to-end contextual ASR model consisting of a full neural network, integrates contextual information by mapping all contextual phrases. In experimental evaluation, we found that the proposed CLAS model exceeded the standard Shallow Fusion bias method.

This article is shared from Huawei Cloud Community “How to Solve Context Bias? The Way to End-to-End ASR for Proprietary Domain (II)”, the original author: Xiaoye0829.

Here we present a piece of work on end-to-end ASR for a proprietary domain, Deep Context: End-to-End Contexttual Speech Recognition, also from the same research team at Google.

In ASR, what a user says depends on the context in which he or she is speaking, usually represented by a series of n-gram words. In this work, we also investigate how to apply this contextual information in an end-to-end model. The core approach of this article can be seen as a Contextual LAS[1] model, which is optimized by combining the N-gram embedding with the LAS model. In the beam search, the independently trained N-gram and LAS models were allowed to be Shallow Fusion.

In the work of this paper, we consider the dynamic integration of context information in the process of recognition. In traditional ASR systems, a mainstream approach to incorporating contextual information is to use an independently trained online rescoring framework that dynamically adjusts the weight of a small number of n-grams that are relevant to the context of a particular scene. It is important to be able to extend this technique to the ASR model of Seq2Seq. To achieve the goal of skewering the identification process according to a specific task, there have been previous attempts to incorporate a separate LM into the identification process, commonly known as Shallow Fusion or Cold Fusion. In the work [2], Shallow Fusion’s approach was used to construct a Contextual LAS, in which the output probability of the LAS was modified by a specific WFST constructed from the speaker’s context and the effect was improved.

Previous work used an externally independently trained LM for online rescoring, which ran counter to the benefits of joint optimization of the SEQ2SEQ model. Therefore, in this article, we propose a Contextual LAS (CLAS) that provides a series of Contextual phrases to enhance recognition. Our approach is to first map each phrase into a fixed dimension of words embedded, and then use an Attention mechanism in each step of the model output prediction to abstract the available contextual information. Our method can be viewed as a generalization of the streaming keyword discovery technique [3], which allows a variable number of contextual phrases to be used in reasoning. Our proposed model does not require specific contextual information during training, and does not require careful adjustment of the weight of the rescore, and can still fit into the OOV vocabulary.

This article will explain the standard LAS model, the standard contextual LAS model, and our proposed modified LAS.

The LAS model is a Seq2Seq model, which includes an encoder and a decoder with an attention mechanism. When decoding each word, the attention mechanism will dynamically calculate the weight of each input implicit state, and obtain the current attention vector through weighted linear combination. The input X of this model is a speech signal, and the output Y is Graphemes (Character, containing a~z, 0~9,< space>, <comma>, <period>, <apostrophe>,<unk>).

The output of LAS is as follows:

This formula relies on the state vector hx of the encoder, the state dt of the hidden layer of the decoder, and the Ct modeled as the context vector, which uses an attention gate to aggregate the output of the decoder state and the encoder.

In the standard contextual LAS model, we assume that a series of word-level offset phrases are known in advance. And compile them into a WFST. This word level WFST G may consist of a Speller FST S. S can turn a bunch of graphemes or word-pieces into words. Therefore, we can obtain a context language model LM C= MIN (DET (S D G)). The fraction Pc(y) from this context language model can then be used in the decoding process to enhance the standard log probability term.

Here, λ is an adjustable parameter that controls for the influence of the contextual language model on the overall model score. The total score in this formula applies only at the word level. As shown in the figure below:

Therefore, if the relevant word (word) does not appear in BEAM, then this technique will not improve the effect. Furthermore, we observed that although this method works well when the number of contextual phrases is small (e.g., yes, no, cancel), it does not work well when the context phrase contains a lot of nouns (e.g., song title, contact person). Therefore, as shown in Figure C, we explore applying weights to the subword units of each word. To avoid manually setting the weight of the prefix word (which matches the prefix, but not the entire phrase), we also include a subtractive cost, such as the negative weight in Figure C.

Below we began to introduce context LAS model proposed in this paper, it can use the offset phrases Z provides a series of additional context information, to effectively model P (y | x, Z). A single element in Z is a phrase such as a contact person, song title, etc. that is relevant to a particular context. Assume that these context phrases can be expressed as: Z = Z1, Z2… And zinc. These offset phrases are used to bias the model towards the output of a particular phrase. However, not all offset phrases are related to the speech being processed at the moment. The model needs to determine which phrases are likely to be related and use these phrases to modify the model’s target output distribution. We use a bias-encoder to enhance the LAS and encode these phrases as HZ ={H0Z, H1Z… HNz}. We distinguish the sound-dependent vectors with the superscript z. Hiz is the mapping vector of ZI. Since all offset phrases may not be relevant to the current speech, we include an additional learnable vector, h0z = HNBZ, which corresponds to no offset, that is, no offset phrases are used in the output. This option allows the model to ignore all offset phrases. The offset encoder is composed of a multi-layer LSTM network. HIZ sends the corresponding embedding sequence of ZI neutron words to the offset encoder, and uses the final state of LSTM as the output feature of the entire phrase. We then use an extra attention to calculate HZ, using the following formula, when input to DECODER, CT = [CTX; CTZ]. Everything else is the same as the traditional LAS model.

It is worth noting that the formula above explicitly models the probability of seeing each particular phrase at the current moment given the speech and the previous output.

Now let’s look at the experimental part. The experiment was carried out on 25,000 hours of English data. The data set used a room simulator to add noise and obfuscations of different intensities, and manually interfered with normal speech, making the SNR between 0 and 30dB. The noise sources came from YouTube and recordings of noisy environment in daily life. The structure of the Encoder consists of 10 unidirectional LSTM layers, each layer has 256 cells. The bias encoder consists of a single-layer LSTM with 512 units. The decoder consists of four layers of LSTM, each layer having 256 cells. The test set of the experiment is as follows:

First, in order to check whether the offset module we introduced will affect the decoding without the offset phrase. We compared our CLAS model with the ordinary LAS model. The CLAS model used random offset phrases during training, but did not provide offset phrases during testing. Surprisingly, CLAS also achieved better performance than LAS when no offset phrases were provided.

We further compared different online rescoring schemes, which differed in how they assigned weight to subword units. As you can see from the table below, the best models have offsets on each subword unit that help preserve words in BEAM. All of the following online rescoring experiments are offset on subword units.

Next, we compared the effects of CLAS with the various schemes above:

As you can see from this table, CLAS significantly outperforms the traditional approach and does not require any additional hyperparameter adjustments.

Finally, we combined CLAS with the traditional approach, and we saw that both bias control and online rescoring helped improve the results.

In this article, we show that CLAS, an end-to-end contextual ASR model consisting of a full neural network, integrates contextual information by mapping all contextual phrases. In experimental evaluation, we found that the proposed CLAS model exceeded the standard Shallow Fusion bias method.

[1] Chan, William, et al. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition.” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.

[2] Ian Williams, Anjuli Kannan, Petar Aleksic, David Rybach, and Tara N. Sainath, “Contextual speech recognition in end-to-end neural network systems using beam search,” in Proc. of Interspeech, 2018.

[3] Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. McGraw, Streaming small-footprint Keyword Spotting Using sequence-to-sequence Models, In Proc. ASRU, 2017.

Click on the attention, the first time to understand Huawei cloud fresh technology ~