Baidu voice synthesis model Deep Voice3

This is the third day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021

INTRODUCTION

Deep Voice3 is a new fully convolutional TTS architecture proposed by Baidu. Baidu’s main work is divided into the following five aspects:

A fully convolution character-to-spectrogram architecture is proposed, which can be computed in parallel and is faster than the architecture using cyclic elements
Deep Voice3 training is very fast and can be extended to the LibriSpeech speech dataset, which contains 820 hours of audio data from 2,484 speakers
Monotonic attention behavior can be produced, avoiding common errors in SEQ2SEQ speech synthesis
The quality of several waveform synthesis methods, including WORLD, Griffin-Lim, and WaveNet, are compared
Describes the implementation of the Deep Voice3 inference kernel, which can provide up to 10 million inferences per day on a single GPU

ARCHITECTURE

Deep Voice3 is capable of converting various text features (such as characters, phonemes, stress) into various vocoder parameters such as mayer spectrum, linear logarithmic spectrum, fundamental frequency, spectral envelope, etc. These vocoder parameters can be used as inputs to a waveform synthesis model

The Deep Voice3 architecture consists of three components:

Encoder: completely composed of convolution, used to extract text features
Decoder: Also fully composed of convolution, the extracted text features are decoded into low-dimensional audio features in an autoregressive way by using the multi-hop convolutional attention mechanism **
Converter: Again entirely constructed of convolution, it hides the state from the decoder to predict the parameters of the final vocoder (depending on the vocoder selection). Unlike a decoder, a converter is non-causal, so it can rely on future context information

The goal of optimization is a linear combination of decoder and converter losses. The authors separate decoders and converters and apply them to multi-task training because it allows better learning of attention in practice. Specifically, the loss of Mayer spectrum prediction guides the training of attention mechanism, because the training of attention utilizes the gradient of Mayer spectrum prediction and vocoder parameter prediction

TEXT PREPROCESSING

Uppercase all letters
Remove all punctuation marks
The end of each sentence consists of and only a period or question mark
Replace the Spaces between words with special separators that indicate how long the speaker pauses between words. There are four special types of separators: ambiguous words, received pronunciation and space characters, short pauses between words, and long pauses between words. For example,” Either way, you should shoot very slowly,” “Either way%you should shoot/very slowly%.” with % for a long pause and/for a short pause. Pause times can be marked manually or by text audio allocator

CONVOLUTION BLOCKS

The convolution block contains a one-dimensional convolution filter, a gated learnable nonlinear element, a residual connection, and a scaling factor of 0.5\ SQRT {0.5}0.5. In order to introduce speaker related features, speaker features are added to the output of the convolution filter as bias after the SoftSign activation function. The standard normal distribution is used to initialize the weights of the convolution filter in the convolution block

Softsign function:
$y=F(x)=\frac{x}{1+|x|}.$

ENCODER

The encoder network starts with text encoding, converting characters or phonemes into a trainable vector representation of HEH_ehe. Heh_ehe is then sent into the full connection layer to project onto the target dimension. The PreNet output is fed into a series of convolution blocks to extract time-dependent text information. Finally, they are projected back into the Text Embedding dimension to create the attention key vector HKH_kHK. The attentional value vector hV =0.5(HK +he) h_V =\ SQRT {0.5}(h_K + h_E) hV =0.5(HK + HE) was calculated from the attentional key vector and text embedding to consider the local information in HEH_EHE and the long-term context information in HKH_kHK. The key vector HKH_kHK is used by each attention block to calculate the attention weight, and the final context vector is calculated as the weighted average of the value vector HVH_VhV

DECODER

The decoder predicts the following R (r>1) frame Meyerspectrum in an autoregressive pattern. Causal convolution or called masked convolution is used in the decoder because the data at the later time cannot be utilized

Meyerspectrum data is first passed through PreNet and then transformed into a Query matrix through the Casual Convolution layer. Encoder output Key and Value matrix for attention operation. In this way, multiple layers are accumulated, and the next R frame mehr spectrum is predicted through the full connection layer, and whether the prediction should be stopped is also predicted (similar to Tacotron2). The Loss function is L1 Loss and cross entropy

ATTENTION BLOCK

The Attention module is a well-known traditional dot product calculation method. It first calculates the Attention weight with query matrix and key matrix, and then sums the value matrix to get the context vector. In addition, the Attention Block introduces position coding HP (I) h_P (I) HP (I) to help align text and spectrum

H_p (I) = sin (w_si / 10000 ^} {k/d) \, I = 0, 4-trichlorobenzene,… \ \ h_p (I) = cos (w_si / 10000 ^} {k/d) \,,3,5 I = 1,…

Where, III is the time step index, KKK is the channel index in position coding, DDD is the number of all channels in position coding, and WSW_SWS is the position rate of encoding. Position rate determines the average slope of the middle line of attention distribution, roughly corresponding to speed of speech. For single-speaker, wSW_sws in query is fixed as 1, and wSW_sws in key is fixed as the ratio of input time step to output time step. For multi-talkers, wSW_SWS is evaluated by per-talker embedding (left in figure below)

The detailed process is shown in the figure below

In translation scenarios, the corresponding order of words in source language sentences and target language sentences is not strictly monotonous, while in speech synthesis, speech is read in text order, so alignment is stricter

CONVERTER

The converter network takes as input the output of the last hidden layer of the decoder, which contains several non-causal convolution blocks, and then predicts the parameters of the downstream vocoder. Unlike a decoder, a converter is non-causal and non-autoregressive, so it can make predictions using future information from the decoder. There are many kinds of converters, you can use Griffin-fim or WAVenet, etc., of course, wavenet will be better. The entire model framework is as follows

RESULTS

Deep Voice3 model uses full convolution instead of GRU to extract text and spectrum features, which can greatly improve the utilization rate of GPU during training. Under the same batch size, the speed is 10 times that of Tacotron. And the number of steps required to achieve convergence is only 1/4 that of Tacotron. The natural language of synthesized speech also improved after monotonic attention was added

REFERENCE

Introduction to neural Network Speech Synthesis Models -DeepVoice3
Deep Voice 3: Extending speech synthesis through convolution sequence learning
Deep Voice3 paper