头图

When it comes to voice changing, many people's earliest memory is Conan's bow tie voice changer in "Detective Conan". When I was a child, I fantasized about having such an artifact, it must be cool. In the Internet era, this fantasy has come true. With the application of many voice-changing software, we can often hear players making voices opposite to their gender and age through voice-changing software in social and game scenarios. However, this kind of voice changer often transforms a person's voice into a certain type of voice, for example, a male voice becomes the voice of a cute girl. It is not possible to change his own voice into the voice of a specific person like Conan. Voiceprint in the sense of voice change.

The "real-time voiceprint voice changer" developed by the sound network audio technology team will subvert the traditional voice-changing software and AI real-time voice-changing experience. The voice of any user can be transformed into the voice of a specified or any other person in real time, realizing the true "clone" of the voice like the Conan voice changer. Next, we will introduce the traditional mainstream voice-changing methods and the technical principles behind real-time voiceprint voice-changing respectively. .

01 What is voiceprint voice changer?

Before introducing voice change, let's review the process of speech production and perception. When speaking, we cooperate with the words corresponding to our thoughts through the vocal organs (such as the lungs, throat and vocal tract) to generate sound wave signals with specific semantics. Due to the differences in each person's vocal organs, language habits, pronunciation size, fundamental frequency, etc., each person's voiceprint is unique, just like a fingerprint, so people can identify a person's identity information through the auditory system. In fact, at the perceptual level, one can easily separate the linguistic content of a speech (text) from the timbre information of the speaker (voiceprint).

图片

Voice change is the replacement of the timbre of a piece of speech to make it sound like another person is speaking the same thing. The process of voiceprint-to-voice transformation includes two steps of perceptual separation and synthesis of speech. First, the speech recognition module in the voiceprint voice changing system separates the linguistic information in the received speech and the timbre information of the speaker. Then, the speech synthesis module resynthesizes the voiceprint of the target speaker and the previously extracted linguistic content into a new speech, thereby realizing the transformation of the timbre.

After introducing the basic principles of voiceprint voice-changing, let's take a look at the traditional voice-changing methods, and what technical principles are they based on?

1. Traditional sound effects effector : Early voice changers generally used multiple sound effects effectors in series to modify the human voice from various dimensions. Common voice-changing effects include pitch-changing effects, equalizers, reverberation effects, formant filters, etc. The pitch-changing effects are achieved by changing the pitch of the sound. For example, to change a male voice into a female voice, the pitch needs to be raised. The voice of "Minions" in "Minions with Big Eyes" is achieved by raising the pitch of the original male voice through a pitch-shifting algorithm. Equalizers and formant filters change the timbre by changing the distribution of energy in each frequency band of the vocal, with higher values making the sound sound louder or crisper, and lower values giving it a deep, rich character. The reverb effect is a reverb effect that changes the space where the human voice is located.

However, the generality of these effectors is poor, and the timbre of each person needs to be re-adjusted to become a target person, and the pronunciation change trend of each tone in the language is not the same. The effectors adjusted by the same set of parameters may It is only correct for certain pronunciations, which makes many voice changing effects very unstable. We mentioned at the beginning of the article that most of the software voice-changing effects used by many anchors in social and live broadcast scenes or the voice-changing effects that come with entertainment sound cards are mostly in this way. The traditional link effector is used, which is not voice-printed. Not only is the voice-changing effect unstable, but the sound effect of the voice-changing is also very limited, and it cannot be arbitrarily transformed into the voice of a designated person.

2. AI voice-changing algorithm : The development of AI technology has found a way to crack the tedious process that traditional voice-changing effectors need to individually adjust each person and each sound. The early AI voice-changing algorithms were mainly based on statistical models. The core idea was to find a spectral mapping relationship between the speaker's voice and the target voice. The model needs to be trained on parallel corpus. The so-called parallel corpus refers to every sentence spoken by the speaker, and the target person must have a corpus with the same content. The training samples of the parallel corpus consist of original speech and target speech with the same linguistic content. Although models based on this framework have achieved some success in voice-changing, data on this pairing is scarce, and it is difficult to effectively extend to multi-speaker voice-changing scenarios.

and the mainstream AI voice-changing algorithms in recent years have effectively solved these problems through a non-parallel training framework, and greatly enriched the application scenarios of voice-changing, such as the transfer of timbre, mood and style, etc. . The core idea of the non-parallel training method is to decouple the linguistic features of speech and non-linguistic factors (such as timbre, pitch), and then recombine these factors to generate new speech. It does not rely on paired speech corpus, which greatly reduces the cost of data acquisition. At the same time, this framework is also very conducive to knowledge transfer, and some pre-trained speech recognition and voiceprint recognition models on massive data can be used to extract linguistic content and voiceprint features.

With the development of deep learning, there are more and more types of voice-changing algorithms based on AI. Compared with traditional voice-changing methods, they have significant advantages in target timbre similarity and naturalness. According to the number of original speakers and target speakers supported by a single voiceprint voice transformation model, it can be divided into one-to-one, many-to-many, any-to-many, any-to-any, where one represents A single timbre, many represents a limited set that can only be turned into a few specified timbres. Early academic research is mainly based on one-to-one and many-to-many architectures. Any-to-many is the model used by many AI voice-changing software. For example, in a voice-changing software, any user can choose from more than a dozen voices. Select one of the sound effects to change the voice.

and any is an open set, any-to-any means that any one's voice can be transformed into any other's voice, which represents the ultimate goal of voiceprint voice-changing technology, and everyone can use it Transform into the voice of a specified or any person, and realize the "clone" of the voice, which is also the direction that "Soundnet Real-time Voiceprint Voice Changer" wants to achieve.

From any-to-many to any-to-any, real-time voiceprint voice changing needs to overcome multiple challenges

At present, although the mainstream voiceprint voice changing algorithm can achieve any-to-many voice changing effect with the help of AI voice changing algorithm, the research on voiceprint voice changing mainly focuses on offline or asynchronous use scenarios, such as recording a piece of voice with voice changing software in advance and generating The voice-changing voice of a specified target is sent to the other party. According to the survey, in social, live broadcast and metaverse scenarios, more and more players hope to realize the function of real-time voice-changing effects when interacting with audio and video. According to Shengwang, in the process of real-time interaction, Voiceprint voice changing will face multiple challenges in real time:

  • Linguistic content integrity: In the process of real-time interaction, the loss or mispronunciation of some words of the speaker will not only make it very difficult for the listener to understand, but also the loss of some keywords (such as "no") will cause global semantics Changes have a fatal impact on interactions.
  • Real-time rate: The real-time rate refers to the ratio of the model's processing time for a piece of audio to the audio duration. The lower the better. For example, it takes 1 minute to process a 2-minute speech, then the real-time rate is (1/2=0.5). In theory, the end-to-end real-time rate of the voice-changing engine only needs to be less than 1 to support real-time voice-changing processing. Considering the calculation jitter, a lower real-time rate is required to ensure stable service, which has a great limitation on the size of the model and computing power.
  • Algorithm delay: Most of the current voice-changing algorithms rely on the data input of future speech frames when processing the current frame data. The duration of this part of the speech is the algorithm delay. In real-time interaction scenarios, people can perceive a delay of about 200ms. Too high a delay will greatly reduce the enthusiasm of users to participate. For example, after a user has finished speaking, if the other party needs to wait for more than 1 second to hear the changed voice, many people may not use this function in the chat scene.

In this regard, how does the sound network audio technology team solve the delay of the algorithm and the real-time rate of audio processing, and achieve a breakthrough in any-to-any sound effect?

First of all, "SoundNet Real-time Voiceprint Voice Changer" first extracts the frame-level phoneme features of the voice through the voice recognition model, and the voiceprint recognition model extracts the voiceprint features, and then passes the two together to the speech synthesis module to synthesize the spectrum after voice change. Finally, the AI vocoder is used to synthesize the time domain waveform signal. These three modules all support streaming data processing . Streaming processing is mainly aimed at the high freshness value of data, which needs to provide faster valuable information. Usually, the processing result needs to be obtained within hundreds or even tens of milliseconds after the trigger starts. The real-time processing and low-latency of phoneme and voiceprint data are manifested in the real-time processing. People also need to ensure the smoothness of communication when using the voice-changing effect to communicate. It is not possible for one party to speak a word and the other party to hear the voice change after several seconds.

图片

At the level of neural network design, the sound network mainly uses the network structure of CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) to extract local and long-range time series features in speech signals respectively. The speech signal has the characteristics of short-term stability, and CNN can effectively extract the frame-level phoneme features . RNN models features (words) in speech that change more slowly over time. Generally, the pronunciation of a word lasts for hundreds of milliseconds. Therefore, SoundNet uses RNN-based networks with time-series memory capabilities to construct spectrum conversion modules. The temporal characteristics of speech are modeled.

The data processed by RNN is "serialized" data, and the training samples are related before and after, that is, the current output of a sequence is also related to the previous output. For example, a piece of speech has a time series, and the words are related before and after. . This design not only effectively saves computing power, but also significantly reduces the algorithm delay. current algorithm delay of "SoundNet Real-time Voiceprint Voice Change" can be as low as 220ms, which is at the leading level in the industry.

In addition, SoundNet has independently trained the speech recognition module based on massive data, which can accurately extract frame-level phoneme features, greatly reducing the occurrence of typos or missing words after voice change, and ensuring the integrity of linguistic content. Similarly, SoundNet also trained a voiceprint recognition model based on massive data to extract the timbre features of the target speaker, which significantly improved the timbre similarity between the voice-changed voice and the target speaker, and finally realized any-to-any. voice-changing ability.

Compared with the traditional voice changer software, the real-time voiceprint voice changer, with its real-time and any-to-any voice-changing capabilities, will play a greater role in the chat room, live broadcast, games, metaverse and other scenarios, not only can Enhancing the user's immersion and entertaining experience in application scenarios is expected to further improve the user activity, usage time, and revenue of the application.

For example, in the traditional language chat room scene, the voice changing software can only become the voice of a cute girl or an uncle. The real-time voiceprint voice changer can change the user's voice into a voice similar to some stars, turning the original boring chat room into a voice. Celebrity chat room.

In metaverse scenes such as Meta Chat, real-time voiceprint voice change can be combined with 3D spatial audio to further enhance the user's immersion. When manipulating the exclusive animation characters to chat, the voice can become the voice of the corresponding characters such as SpongeBob SquarePants, Patty Star, Doctor Octopus, etc. The perception level seems to enter the real animation world, and the user's immersion is effectively improved.

Based on this, real-time voiceprint voice change can also further expand the sound value of film and animation IP. The dubbing of well-known film and animation characters can be used in real-time audio and video interaction in chat rooms, live broadcast rooms, game voice and other scenes. In itself, a richer entertainment experience can improve the user's in-app usage time, payment rate, etc.

At present, "Soundnet Real-time Voiceprint Voice Changer" has been opened for open testing. If you want to further consult or access real-time voiceprint voice changer, you can click "here" to leave your information, we will contact you in time to do further communication.


RTE开发者社区
658 声望971 粉丝

RTE 开发者社区是聚焦实时互动领域的中立开发者社区。不止于纯粹的技术交流,我们相信开发者具备更加丰盈的个体价值。行业发展变革、开发者职涯发展、技术创业创新资源,我们将陪跑开发者,共享、共建、共成长。