Translator | Hao Yi
Edit | Debra
AI Front Line introduction:UBC Luanch Pad proposed a speaker recognition method based on transfer learning, which aims to distinguish the audio information of different speakers in a multi-person conversation scene. At present, ideal results have been achieved in simple scenarios, but still need to be improved in complex situations. The authors also released their speaker recognition application library, which can be used directly through the Python interface.






Please pay attention to the wechat public account “AI Front”, (ID: AI-front)

Speaker recognition refers to the separation and labeling of audio according to the different identities of speakers in a complex speech environment. The technology has a wide range of applications in smart hardware, such as Google Home’s need to detect who is talking (rather than what is being said, which are two different issues). Speaker recognition is an open problem, but has been rapidly developed using modern deep learning techniques. During the year of the UBC Luanch Pad team, we (UBC Launch Pad team, this article is compiled in the first person) built a library for telling people recognition. This library is written in Python and we call it Minutes.

The research target

There are two main difficulties in speaker identification. If you can’t control the environmental impact of the data acquisition process, speaker identification systems must be able to handle a lot of variation in the audio sample. In addition, the algorithm model needs to predict the classification results (i.e. speakers) from a short training sample, because it is difficult to obtain a large number of samples for a new class (a new speaker). Typical prediction models require thousands of samples for each type of target in the training process, but Google Home requires a small number of samples to achieve the same resolution. Transfer learning is a key technology to achieve this goal. In a corporate meeting scenario, each participant uses the client to collect a small number of voice recordings. The entire meeting is recorded and uploaded to the server. On the server side, these conversations are broken down into transcripts of different speakers using trained models and then transcribed into text. For server systems, the main problem to be solved is the quick and economical generation of new models that can predict the addition of new classes without the need to relearn speech recognition.

Our data set

Thanks to the rapid development of audio transcription, there are already some large public data sets available. LibriSpeech ASR Corpus is a large data set of English speech. In this work, we simply decomposed the audio files at intervals of one second (sampling rate 48000Hz) and labeled each audio segment with the ID of the speaker, thus obtaining our training sample. This interval is a hyperparameter, which we call samples per observation. Here’s a visual of 10 such observations strung together.

By splitting the database, we convert each observation into a spectrogram. Because image recognition is one of the most rapidly evolving areas of machine learning, the use of acoustic spectrographs offers opportunities to leverage convolutional neural networks (CNN) and a host of other advanced technologies.

The migration study

It is very easy to build a neural network, which can well predict the data subject to the distribution of the training set. For example, the following example using the Keras library achieved 97% validation accuracy on five classes after 15-25 epochs. Of course, the model also couldn’t predict never-before-seen categories after the training.

Now, let’s introduce transfer learning, which is a technique for continuing training on a new data set using the generated base model (this method speeds up the training process for subsequent models). Due to the limitations of hardware resources and the size of the training set data, generating a more complex model than the above model will take more time. Our goal is to maximize the reuse of the base model, and the Keras framework can do this easily — just fix some layer parameters and then change the size of the last layer to train with a small amount of data.


The degree of reusability of a base model is related to model utilization, which you can measure by exporting Keras model summaries.



Increased utilization has a direct impact on the cost of subsequent generation of the migration model, as more parameters need to be optimized during training.



It is important to note that the higher the utilization, the faster we can train the model to be more accurate than 85%, at 4%, 21%, 47% utilization. Each requires 15, 10, and 5 epochs. The last part of image recognition CNN is usually a dense layer with many parameters. Retraining of these parameters will seriously reduce the utilization rate. We’ve found some simple techniques that can be effective in mitigating these effects:

The pooling layer is used earlier to reduce the size of the following parameter matrix.

Add additional convolutional layers to reduce the dimensionality of downstream data.

Using the method of transfer learning, we can predict new classes and significantly reduce the time required to train the model.

Accuracy on YouTube

We collected YouTube conversations, collated them, tagged them, and got a supervised dataset of real-world scenarios. Using the transfer learning method described above, we were able to achieve 60% accuracy in conversation scenarios with 3-4 speakers, such as SciShow. Our explanation for this is that the audio of real conversations is more complex than the LibriVox data set, and that YouTube’s timestamps are inaccurate. At present, in dichotomous problems or relatively simple cases, such as a conversation between two speakers, the results of transfer learning are relatively good. And. Simple data enhancement techniques may make models more robust to changes in audio clarity.

Next step

We posted Minutes for speakers to separate tasks. We hope to speed up speaker recognition learning by reusing as much of the base model as possible. Keep an eye on the Python library of Minutes and the data set on our Facebook page.

英文原文 :

https://medium.com/@ubclaunchpad/speaker-diarisation-using-transfer-learning-47ca1a1226f4