Deep Neural Networks for Small Footprint Text-Dependent Speaker Verification

directory

ABSTRACT

1. INTRODUCTION

2. PREVIOUS WORK

3. DNN FOR SPEAKER VERIFICATION

3.1 DNN as a feature extractor

3.2. Enrollment and evaluation

3.3. Within DNN training procedure

4. EXPERIMENTAL RESULTS

4.1. The Baseline system

4.2. Within DNN verification system

4.3. Effect of enrollment data

4.4. Noise robustness

4.5. The System combination

5. CONCLUSIONS

Acknowledgments

6. REFERENCES



ABSTRACT

In this paper, we investigate the application of deep neural networks (DNNs) to small text-dependent speaker verification tasks. During the development phase, DNN is trained to categorize speakers at the frame level. In the speaker entry stage, trained DNN is used to extract speech features from the last hidden layer. These speaker features or averages, d-vectors, are used as speaker feature models. In the evaluation phase, d-vector was extracted for each discourse and compared with the input speaker model for verification. Experimental results show that the speaker verification system based on DNN achieves better performance in a small voice-text related speaker verification task than the commonly used I-vector. In addition, dnN-based systems are more robust to added noise and are superior to I-vector systems at low error rejection operation points. Finally, the combined system outperformed the I-vector system in the quiet and noisy conditions with a relative error rate (EER) of 14% and 25% respectively.

1. INTRODUCTION

Speaker verifier (Speaker verifier.SV) is the task of accepting or rejecting a Speaker’s identity based on information from his/her speech signal. SV systems can be divided into two categories, text-dependent and text-independent, based on the text to be said. Text-dependent SV systems require the generation of phonetic fixed or prompted text phrases, while text-independent SV systems do not require phonetic text information. In this article, we focus on a small text-dependent SV task based on fixed text, although the proposed technique can be extended to text-independent tasks.

The SV process can be divided into three stages:

  • Development stage: The background model is trained in a large data set to obtain the voiceprint model. Background model is different from simple gaussian mixture model (GMM), but a more complex joint factor analysis (JFA) model [2,3,4] based on universal background model (UBM) [1].
  • Registration phase: Registers new speakers by deriving speakers to obtain specific information about the speaker-dependent model. Speakers in the registration and training phases will not be duplicated.
  • Evaluation phase: Speaker model and background model are evaluated using each test discourse registered. Identification for authentication.

Statistical tools for each of the three phases in a wide variety of SV system validation have been studied using different methodologies. The country’s most advanced SV system is based on I-Vectors [5] and Probabilistic linear discriminant analysis (PLDA). In these systems, JFA is used as a feature extractor to extract the low-dimensional I vector as a compact representation of the phonetic representation of SV.

Inspired by the powerful feature extraction capability and the recent successful application of deep neural Networks (DNNs) for speech recognition [6], we propose a DNN-based SV technology as a speaker function extractor. A new background model based on DNN is used to model the speaker directly. DNN is trained to map frame-level features in a given context to corresponding speaker identity targets. During registration, the speaker model is calculated as an average hidden layer of activation derived from the last DNN, which we call the deep vector or “D-vector”. In the inner evaluation phase, we use the distance between the target D-vector to make decisions and the test d-vector, similar to the I-vector SV system. One significant advantage of using DNN for SV is that it is easy to integrate them into state-of-the-art speech recognition systems from then on. They can share the same DNN reasoning engine and simple Filterbank energy front end.

The rest of this article is arranged as follows. In Section 2, previous work on SV is described. In section 3, we describe the proposed DNN based SV system. Section 4 shows the results of experiments with small size text-dependent SV systems. The DNN- SV – based system is compared with the I-vector system and the noisy conditions. We also evaluated performance in different ways and a number of admissions discourse and description improvements came from the combination of the two systems. Finally, Section 5 summarizes the paper and discusses future work.

2. PREVIOUS WORK

The combination of i-vector and PLDA [5,7] has become the main method for text-independent speaker recognition. I-vector represents the total discourse variability space in the low-dimensional space named. Given a discourse, the speaker – and session-related GMM supervector is defined as follows:


M = m + Tw (1)

Where M is a super-vector independent of speaker and session, which is generally considered to be UBM super-vector, T is a low-rank rectangular matrix called total variability matrix (TVM), and W is a random vector distribution with standard normal N (0, I). The vector W contains total factors and is called an i-vector.

Moreover, PLDA on the I vector can decompose the total variability into speaker and conversation variability more efficiently than JFA. The i-vector-plda technique and its variants have also been used successfully for text-related speaker recognition tasks [8,9,10].

In past studies, neural networks have been studied for speaker recognition [11,12]. As a nonlinear classifier, neural network can distinguish the characteristics of different speakers. Neural networks are often used as binary classifiers for target and non-target speakers, or multicategory classifiers for speaker recognition purposes. Automatic Correlation neural network (AANN) [13] is proposed to use the reconstruction error difference calculated from UBM-AANN and speaker-specific AANN as the verification score. Multi-layer perceptrons (MLPS) with bottleneck layers have been used to obtain powerful features for speaker recognition [14]. Recently, there have been some preliminary studies on using deep learning for speaker recognition, such as using convolutional deep belief networks. [15] And Boltzmann machine classifier [16].

3. DNN FOR SPEAKER VERIFICATION

 

Figure 1. Model based on deep neural network

The PROPOSED SV background DNN model is shown in Figure 1. The idea is similar [15] in the sense that neural networks are used to learn speaker specific functions. The main difference is that we perform supervised training here and use DNN instead of convolutional neural networks. Furthermore, in this paper, we evaluate the SV task rather than the simple speaker recognition task.

3.1 DNN as a feature extractor

At the heart of the approach proposed in this work is the idea of using the DNN architecture as a speaker feature extractor. As with the i-vector approach, we look for a more abstract and compact representation of speaker acoustic frames, but use DNN instead of generating factor analysis models.

To do this, we first built a supervised DNN that operates at the framework level to categorize the speakers in the development set. The input of the background network is formed by stacking each training frame with its adjacent context frames. The number of outputs corresponds to the number of speakers in the developing group N. The target tag forms an n-dimensional vector of unique heat, where the only non-zero component is the component corresponding to the identity of the speaker. Figure 1 shows the DNN topology.

Once the DNN training is complete, we use the cumulative output activation of the last hidden layer as the new speaker representation. That is, for each frame of a given utterance belonging to a new speaker, we calculate the output activations of the last hidden layer using standard feedforward propagation in trained DNN, and then accumulate these activations to form a new compact representation of the speaker, the D-vector. For several reasons, we chose to use the output of the last hidden layer instead of the SoftMax output layer.

First, we can reduce the size of the DNN model at run time by pruning the output layer, which also allows us to use a large number of development speakers at run time without increasing the DNN size (for example, we trained for 10,000 or 1,000 people by just changing the size of the last SoftMax layer, The previous structure remains unchanged, thus increasing the number of training analogies and eigenvectors. Second, we observed better generalization for untrained speakers from the output of the last hidden layer (SoftMax is very responsive to the training label and tends to represent a person more than a feature vector). (It follows that the penultimate hidden layer has a similar effect.)

The basic assumption here is that trained DNNS, which have learned to develop compact representations of set speakers in the output of the last hidden layer, can also represent untrained speakers.

3.2. Enrollment and evaluation

Given a set of utterances X s = {O s 1, O s 2… O sn}, observe O si = {O 1, O 2… Om}, the process registration can be described as follows. First, we fed each observation in discourse O si and its context to the supervised training DNN. The output of the last hidden layer is then obtained, L2 normalized, and summed for all observations in O si. We call the accumulated vector the D vector associated with the discourse O s I. The final representation of speaker S is derived by averaging all d vectors corresponding to the utterances in X s. (** why take l2-norm, take the dot product of two vectors l2-norm, you can get the cosine similarity of the two vectors. * *)

In the evaluation phase, we first extract the normalized D-vector from the test discourse. Then we calculate the cosine distance between the test d-vector and the d-vector of the claimed speaker. Validation decisions are made by comparing distances to thresholds. (Emphasis, how to select the threshold)

3.3. Within DNN training procedure

Given the low situational resources explored in this study (see section 4), we use dropout[17] [18] to train the background DNN to the maximum DNN.

Dropout is a useful strategy to prevent DNN fine-tuning from overfitting when using small training sets [18] [19]. In essence, the Dropout training program involves randomly ignoring certain hidden units for each training token. Maxout DNNs [17] was conceived to make proper use of the dropout property. Maxout networks differ from standard multi-layer perceptrons (MLP) in that the hidden units of each layer are divided into non-overlapping groups. Each group generates a single activation through the maximum pool operation. Training on the MaxOut network optimizes the activation function of each unit.

Specifically, in this study, we trained a maximum DNN with four hidden layers and 256 nodes per layer within the DistBelief framework [20]. Two pool sizes are used for each layer. The first two layers do not use dropout, while the last two layers discard 50% of their activation after dropout, as shown in Figure 1

For other configuration parameters, ReLU[21] was used as the nonlinear activation function of the hidden unit, with a learning rate of 0.001 and exponential decay (0.1 per 5M step size). The input to DNN is formed by stacking the 40-dimensional logarithmic filter bank energy features extracted from a given frame and its context, 30 frames to the left and 10 frames to the right. The training objective vector has a dimension of 496, which is the same as the number of speakers in the development set (see section 4). The final maximum DNN model contains approximately 600K parameters, which are similar to the minimum baseline I-vector system.

4. EXPERIMENTAL RESULTS

The experiment was conducted on a small text-related SV task. The dataset contained 646 speakers using the same phrase “OK Google” multiple times in multiple sessions. The gender distribution is balanced on the data set. 496 randomly selected speakers were used for the training background model and the remaining 150 speakers were used for registration and evaluation. The number of utterments per speaker used for background model training ranged from 60 to 130. For registered speakers, the first 20 utterances are reserved for possible use in registration and the remaining utterances are used for evaluation. By default, we only use the first four utterances of the registry to extract the speaker model. We used one of the 150 trials as the target trial, and a total of approximately 12,750 trials were conducted.

4.1. The Baseline system

In this small text-dependent SV task, our goal was to keep the model size small while achieving good performance. The baseline system is an SV system based on the I vector, similar to [5]. GMM UBM is trained on 13-dimensional perceptual linear prediction (PLP) features with additional δ and δ δ features. We evaluated equal error rate (EER) performance for i-vector systems with three different model sizes. The number of Gaussian components in UBM, the dimension of the I vector, and the dimension of the output of linear discriminant analysis (LDA) vary. TVM was initialized with PCA and further refined with 10 EM iterations, while for UBM training we used 7 EM iterations. As shown in Table 1, the performance of the i-vector system decreases with the decrease of model size, but not by much. The EER results of the T-norm [22] used for score normalization were consistently much better than the original score. The smallest i-vector system contains about 540K parameters and is used as our baseline system.

Table 1. Comparison of EER results of I-vector systems with different numbers of UBM Gaussian components, I-vector and LDA output dimensions.

4.2. Within DNN verification system

The left figure in Figure 2 shows the comparison of the detection error tradeoff (DET) curves of the I-vector system and the D-vector system. An interesting finding is that the original score is slightly better than the T-standard score in the D-vector system, while the T-norm score is significantly better in the I-vector system. Histogram analysis of the original fractions of the D-vector system shows that the distribution is heavy-tailed rather than normally distributed. This suggests that a more complex scoring normalization method may be needed for D-vector SV systems. In addition, because t-norm requires additional storage and computation at run time, we will evaluate the D-vector system using raw scores unless otherwise noted.

The overall performance of the I-vector system was better than that of the D-vector system: 2.83% EER score using the I-vector T-norm was compared with 4.54% original score using the D-vector. However, in the low false rejection region, as shown in the lower right part of the figure in Figure 2, the D-vector system is superior to the I-vector system.

We also tried training DNN in different configurations. Without maxout and Dropout techniques, trained DNN’s EER was definitely about 2% off. Increasing the number of nodes to 512 in the hidden layer does not help much, while reducing the number of nodes to 128 makes the EER worse at 7.0%. Reducing the context window size to 10 frames on the left and 5 frames on the right also reduced EER performance to 5.67%.

 

4.3. Effect of enrollment data

In the D-vector SV system, no speaker adaptation statistics are involved in the registration phase. Instead, the background DNN model is used to extract speaker-specific characteristics for each discourse during the registration and evaluation phases. In this experiment, we investigate the variation in validation performance in a D-vector system, where each speaker has a different number of incoming words. We used 4,8,12, and 20 statements to compare the performance results of speaker registration.

EER results are listed in Table 2. It shows that both SV systems perform better as the number of registered discourses increases. The trends are similar for both systems.

4.4. Noise robustness

In practice, there is often a mismatch between development and run-time conditions. In this experiment, we study the robustness of d-vector SV system under noise conditions and compare it with i-vector system. Train background models with clean data. 10 dB cafeteria noise was added to the registration and evaluation data. The comparison of DET curves is shown on the right in Figure 2. As shown in the figure, the performance of both systems is reduced by noise, but the performance loss of the D-vector system is small. The overall performance of the D-vector system is very close to that of the I-vector system under 10 dB noise. At operating points with a false rejection probability of 2% or less, the D-vector system is actually better than the I-vector system.

4.5. The System combination

The above results show that the proposed D-vector system can be a feasible SV method compared with i-vector system. This evaluation is mainly applicable to noisy environments or applications requiring small footprint models and low error rejection rates. Alternatively, our goal here is to provide an analysis of combined I/D vector systems.

Although more complex combinations can be designed at the feature level, our preliminary results in Figure 3 were obtained using a simple combination called summation fusion, which sums the scores provided by each individual system for each trial. The previous T-standard stage is applied in both systems to facilitate the combination of scores. The results show that the combined system is superior to any component system under virtually all possible operating points and noise conditions. In EER performance, I/D vector systems beat relative, very and noisy conditions by 14% and 25%, respectively, in i-vector systems.

5. CONCLUSIONS

In this paper, we propose a new speaker verification method based on DNN for small-size text-related speaker verification tasks. DNN has been trained to classify speakers with frame-level acoustic features. DNN is trained to extract speaker-specific functions. Then, like the usual i-vector, the average of these speaker characteristics or D-vector is used for speaker validation. Experimental results show that the performance of d-vector SV system is quite good compared with i-vector system, and the system fusion is better than the independent I-vector system. Simple summation and fusion of the two systems can improve the performance of the I-vector system at all operating points. In clean and noisy conditions, the combined system achieved 14 percent and 25 percent better EER than our classic I-vector system, respectively. In addition, the D-vector system is more robust to additive noise in registration and evaluation data. The d vector system is superior to the I vector system at low false rejection operation points.

Future work includes improving the current cosine distance score and trying standardisation schemes such as gaussification of the original score. We will explore different combinatorial approaches, such as using the PLDA model on the eigenspace of I vectors and stacked D vectors. Finally, we aim to examine the effect of increasing the number of speakers developed and how speaker clustering affects performance.

Acknowledgments

The authors would like to thank our 

 

6. REFERENCES

[1] D. Reynolds, T.F. Quatieri, and R.B. Dunn, “Speaker verification using Best-seller Gaussian Mixture Models,” Digital Signal Processing, Vol. 10, No. 1, pp. 19-41, 2000. [2] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint Factor Analysis versus Eigenchannels in Speaker Recognition,” IEEE Transactions on Audio, Speech, Language Processing, Vol. 15, pp. 1435 — 1447, 2007. [3] P. Kenny, G. Boulianne, P. Ouellet, And P. Dumouchel, “Speaker and Session variability in GMM-based Speaker Verification,” IEEE Transactions on Audio, Speech and Language Processing, Vol. 15, pp. 1448 — 1460, 2007. [4] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, And P. Dumouchel, “A Study of interspeaker variability in speaker verification,” IEEE Transactions on Audio, Speech, Language Processing, Vol. 16, pp. 980 — 988, 2008. [5] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end Factor Analysis for Speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, [6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling of Inspeech Recognition,” IEEE Signal Processing Magazine, Vol.29, pp. 82 — 97, November 2012. [7] P. Kenny, “Bayesian speaker verification with heavy-tailed priors,” in Proc. Odyssey Speaker and Language Recognition Workshop, 2010. [8] T. Stafylakis, P. Kenny, P. Ouellet, P. Perez, J. Kockmann,and P. Dumouchel, “Text-dependent speaker recogntion using PLDA with Uncertainty Propagation,” in Proc. Interspeech,2013. [9] H. Aronowitz, “Text-dependent speaker verification using a small Development Set,” in Proc. Odyssey Speaker and Lan-Guage Recognition Workshop, 2012. [10] A. Larcher, K.-A. Lee, B. Ma, and H. Li, “Phonetically constrained PLDA modeling for text-dependent speaker verification with Multiple short utterances,” in Proc. ICASSP, 2013. [11] J. Oglesby and J. S. Mason, [12] Y. Bennani and P. Gallinari, “A Neural Model for Speaker Identification,” In Proc. ICASSP, 1990. “Connectionist Approaches for Automatic speaker Recognition,” in ESCA Workshop on Automatic speaker Recognition, Identification and Verification, 1994. [13] b. Yegnanarayana and S.P. Kishore, “AANN: An Alternative GMM for Pattern Recognition, “Neural Networks, Vol. 15, No. 3, pp. 459 — 469, 2002. [14] L.P. Heck, Y. Konig, M.K. Sonmez, and M. Weintraub, 20. Robustness to telephone handset Distortion in speaker recognition by discriminative feature design Communication, Vol.31, No. 2, pp. 181 — 192, 2000.

[15] H. Lee, Y. Largman, P. Pham, and A. Ng, “Unsupervised feature learning for Audio classification using convolutional deep belief networks,” in NIPS, 2009. [16] T. Stafylakis, P. Kenny, M. Senoussaoui, And P. Dumouchel, “Preliminary Investigation of Boltzmann Machine Classifiers for Speaker Recognitin,” in Proc. Odyssey Speaker and Language Recognition Workshop, 2012. [17] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, AndY. Bengio, “Maxout Networks,” in Proc. JMLR, 2013, pp. 1319 — 1327. [18] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Susskever, and R. R. Salakhutdinov, “Improving Neural Networks by Preventing co-adaptation of Feature emulsion,” in arXive Preprint,2012. [19] G. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural Networks for LVCSR using Rectified Linear Units and Dropout,” in Proc. ICASSP, 2013. [20] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le,M.Mao, M.Ranzato, A.Senior, P.Tucker, K.Yang, Anda. Ng, “Large Scale Distributed Deep Networks,” in NIPS, 2012. [21] V. Nair and G.E. Hinton, 1. “Rectified Linear Units improve restricted Boltzmann Machines,” in ICML, 2010. [22] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score Normalization for Text-independent speaker verification Systems,” Digital Signal Processing, Vol. 10, pp. 42 — 54, 2000.