How to Clone Voice Based on Real-time Voiceprint Variation

Speaking of voice change, many people’s earliest memory is Conan’s bow tie voice changer from Detective Conan. When they were young, they dreamed about owning such a magical device, which must be cool. However, in the Internet era, this fantasy has come true. With the application of voice changing software, we can often hear players using voice changing software in social and game scenes, which is opposite to their gender and age. However, this kind of voice changer often transforms a person’s voice into a certain type of voice, for example, a male voice into a girl’s voice, but it cannot change his voice into a specific person’s voice as Conan did, which is not a real sense of voice print change.

By network audio technology research and development team “real-time voice print voice” to subvert the traditional voice sound software and AI real-time voice experience, by extracting speech phonemes and voiceprint characteristics and a series of technical means, to any user in real time audio and video interaction of real-time voice into specified or any other people’s voice, To realize the real “clone” of sound like Conan sound changer, next we will introduce the technical principle behind the traditional mainstream sound change method and real-time sound print sound change respectively.

01 What is voiceprint variation?

Before introducing sound changes, let’s review the production and perception of sound. When we speak, we combine the words of our thoughts with those of our vocal organs, such as lungs, throat and vocal tract, to produce sound signals with specific meanings. Due to differences in vocal organs, speech habits, pronunciation size, fundamental frequency and so on, each person’s voice print map is unique, just like a fingerprint, so people can identify a person through the auditory system. In fact, at the perceptual level, people can easily separate the linguistic content of a speech (text) from the timbre information of the speaker (voiceprint).

Voicechange refers to changing the timbre of a speech so that it sounds like another person is saying the same thing. The processing of vowel changes includes two steps: speech perception separation and speech synthesis. First of all, the speech recognition module in the voiceprint change system separates the linguistic information in the received speech from the speaker’s timbre information. Then, the speech synthesis module recombines the target speaker’s voice pattern with the previously extracted linguistic content to synthesize the new speech, so as to achieve the transformation of timbre.

After introducing the basic principle of sound change, let’s take a look at what kinds of traditional sound change methods there are, and what kind of technical principle are they based on?

1. Traditional sound effects: The early sound changes generally use multiple sound effects in series to modify the human voice from various dimensions. Common voice effects including modulation effect, equalizer, and formant filter reverb processors, modulation effect is implemented by changing the voice tone voice effects, such as the male voice changed into a female will need to increase the tone, the film “huang’s eye of” the voice of the “yellow” is through modulation algorithm increase the originally male voice tone. Equalizers and formant filters change the timbre by changing the energy distribution of each frequency band of the human voice. A higher value can make the voice sound louder or clearer, while a lower value can give deep, vigorous characteristics. The reverberator is the reverberation effect that changes the space in which the human voice resides.

But the way of effect is poorer, commonality everyone into a target one tone need to adjust the parameters, and the language of each tone pronunciation trends are also different, the effect of using the same set of parameters adjusting apparatus may only to some pronunciation is correct, this makes a lot of PCM sound effect is very unstable. We mentioned at the beginning of the article in the social, live many host software used in the scene’s voice sound fx or entertainment PCM sound effect is mostly found on the way, although you can do it “live” this kind of way, but as a result of using effect is one of the traditional link, the voice print voice, voice effect is not stable, not only the voice sound is also very limited, Can’t be arbitrarily converted to a specific person’s voice.

2. AI sound changing algorithm: the development of AI technology has found a way to crack the tedious process that traditional sound changing effects need to individually adjust each person and each sound. Early AI voice changing algorithms are mainly based on statistical models, whose core idea is to find a spectrum mapping between speaker’s speech and target speech. The models need to be trained on parallel corpus. The so-called parallel corpus means that for every sentence spoken by the speaker, the target speaker should have a sentence with the same content. The training sample of parallel corpus consists of original speech and target speech with the same linguistic content. Although the model based on this framework has achieved some success in voice change, such matching data are scarce, and it is difficult to effectively extend to the voice change scenario of multiple speakers.

In recent years, the mainstream AI sound changing algorithm effectively solves these problems through non-parallel training framework, and greatly enriches the application scenarios of sound changing, such as timbre, emotion and style transfer, etc. The core idea of non-parallel training method is to uncouple linguistic features and non-linguistic factors (such as timbre and pitch), and then recombine these factors to generate a new speech. It does not rely on paired speech corpus, greatly reducing the cost of data acquisition. At the same time, the framework is also very conducive to knowledge transfer, which can be used to extract linguistic content and voice print features using some speech recognition and voice print recognition models pre-trained on mass data.

With the development of deep learning, there are more and more kinds of ai-based sound changing algorithms, which have significant advantages over traditional sound changing methods in the similarity and naturalness of target timbre. According to the number of original speakers and target speakers supported by a single voiceprint variation model, it can be divided into one-to-one, many-to-many, any-to-many and any-to-any, where one represents a single timbre and many represents a limited set. Can only change to a few specified tones. Early academic research is mainly based on one-to-one and many-to-many architectures. Any-to-many is the model adopted by many AI sound changing software at present. For example, in a sound changing software, any user can choose one of a dozen sound effects to change the sound.

Any is an open set. Any-to-any means that you can transform any person’s voice into any other person’s voice, which represents the ultimate goal of voiceprint sound change technology. Everyone can use it to transform into the voice of specified or any person to achieve the “clone” of voice. This is also the direction that sonnet real-time voiceprint changes sound aims to achieve.

From any-to-many to any-to-any, real-time voiceprint changes have multiple challenges to overcome

Currently, the mainstream voice pattern change can achieve any-to-many sound effects with the help of AI sound change algorithm, but the research on voice pattern change is mainly focused on offline or asynchronous use scenes, such as recording a sound with voice change software in advance, generating a voice change of a specified target and then sending it to the other party. According to the survey, more and more players hope to realize the function of real-time sound changes when interacting with audio and video in social, live and metasverse scenes. According to Sonnet, in the process of real-time interaction, the real-time sound changes will face multiple challenges:

Linguistic content integrity: in real-time interaction, the loss of some words or mispronunciation of speakers will not only make it very difficult for listeners to understand, but also the loss of some key words (such as “no”) will cause global semantic changes, which will have a fatal impact on the interaction.
Real-time rate: Real-time rate refers to the ratio of the processing time of a piece of audio to the audio length of the model, the lower the better. For example, it takes 1 minute to process a 2-minute speech, then the real-time rate is (1/2 = 0.5). Theoretically, the end-to-end real-time rate of the voice changing engine needs to be less than 1 to support real-time voice changing processing. Considering the jitter of the computation, a lower real-time rate is required to ensure stable service, which has great limitations on the size and computational power of the model.
Algorithm delay: Most current voice changing algorithms rely on the data input of future speech frames when processing the current frame data, and the duration of this part of speech is the algorithm delay. In the real-time interaction scenario, the perceived delay is about 200ms, which will greatly reduce the enthusiasm of users to participate. For example, after a user finishes a sentence, if the other party has to wait more than one second to hear the voice after the voice changes, many people may not use this function in the chat scene.

How does soundnet’s audio technology team solve the algorithm delay and real-time rate of audio processing, and achieve the breakthrough of any-to-any variation sound effects?

First of all, the sound network real-time voice print voice “in the first through the speech recognition model to extract speech phonemes characteristics of frame level, voiceprint recognition model to extract a voiceprint characteristics, and then pass together with speech synthesis module, frequency spectrum characteristics of synthetic voice after, finally using time domain waveform signals based on AI vocoder, these three support streaming data processing module. Flow processing mainly for freshness of data value is very high, the need to provide valuable information faster, usually at the beginning of the trigger the need to be processed within a few hundred or even tens of milliseconds after the results, in real time voice print’s voice, phoneme, the voice print data flow processing performance for real-time processing, low latency, people in the use of voice conversation effect also need to guarantee the fluency of communication, You can’t have one person saying something and then the other person hearing it change for seconds.

At the level of neural network design, the network structure of CNN (convolutional neural network) and RNN(recursive neural network) is used to extract the local and long range timing features of speech signals respectively. Language signals are short-time stationary, and CNN can be used to extract frame-level phoneme features effectively. RNN models the features (words) that change more slowly with time in speech. Generally, the pronunciation of a word lasts for hundreds of milliseconds. Therefore, Sonnet uses RNN, a network with timing memory capability, to build a spectrum conversion module to model the timing characteristics of speech.

The data processed by RNN is “serialized” data, and the training samples are related before and after, that is, the current output of a sequence is also related to the previous output, for example, a speech has a time series, and the words spoken are related before and after. Through this design, not only effectively save the calculation force, but also significantly reduce the algorithm delay, the current “real-time voiceprint sound” algorithm delay can reach 220ms, in the industry leading level.

In addition, the voice recognition module is trained separately on the basis of massive data, which can accurately extract frame-level phoneme features, greatly reducing the occurrence of typo or missing words after sound changes, and ensuring the integrity of linguistic content. Similarly, sonnet also trained the voice pattern recognition model based on massive data to extract the timbre features of the target speaker, which significantly improved the timbre similarity between the altered voice and the target speaker, and finally realized the any-to-any voice changing ability.

Compared with traditional voice change software, real-time voice change with real-time and any-to-any voice change ability will play a greater application value in chat rooms, live broadcast, games, metaverse and other scenes, which can not only enhance users’ immersion and entertainment experience in application scenes. User activity, duration, and revenue are also expected to increase further.

For example, in the traditional chat room scene, voice changing software can only change into the voice of cute girl or uncle. Real-time voice print change can change the user’s voice into the voice similar to that of some stars, turning the boring chat room into the star chat room.

In Meta Chat and other meta-cosmic scenes, real-time voice pattern changes can be combined with 3D spatial audio to further enhance users’ sense of immersion. For example, after a social APP cooperates with SpongeBob Squarepants, it obtains the image and voice authorization of the IP of the characters in the cartoon. When the user controls the exclusive cartoon characters to Chat, The voice can be changed into the voice of the corresponding characters such as SpongeBob Squarepants, Patrick and Doctor Octopus. The perception level seems to enter the real animation world, and the user’s sense of immersion is effectively enhanced.

Based on this, the real-time voice print voice can also be further extended, television, film, anime IP voice value, well-known film and television, animation role as the voice of all can use in the language of voice chat room, studio, game scenarios such as real time audio and video interaction, for the application itself, more rich entertainment can improve the user experience in the application within the time, and payment, etc.

At present, “Real-time voiceprint changes” is now open for testing. If you want to further consult or access real-time voiceprint changes, you can click “here” to leave your information, and we will contact you in time for further communication.

How to Clone Voice Based on Real-time Voiceprint Variation

01 What is voiceprint variation?

From any-to-many to any-to-any, real-time voiceprint changes have multiple challenges to overcome

Related Posts

Introduction to the Vim editor

Self-check list before front-end test

Web Performance Optimization Tips: HTTP Caching ii