foreword
"Voice processing" is a very important scene in the field of real-time interaction. In the " RTC Dev Meetup丨Technical Practice and Application of Voice Processing in the Field of Real-time Interaction ", technical experts from Shengwang, Microsoft and Shumei focused on this topic. related sharing.
This article is based on the sharing of content by Feng Jianyuan, an expert in the audio experience algorithm of Shengwang. Follow the official account "Shengwang Developer " and reply to the keyword " DM0428 " to download PPT materials related to the event.
01 The dilemma of real-time voice-changing algorithms based on traditional sound effects
1. What does the voice change change?
■Figure 1
To identify a person by pronunciation, there are many dimensions to consider.
First of all, each person's pronunciation cavity is different. The opening and closing of the mouth and the vibration of the vocal cords in the larynx may have individual acoustic differences. This leads to the fact that each person's pronunciation has different timbres. On this basis, the use of language To express, it may produce different rhythms; secondly, each person is in a different room, and may also be accompanied by different reverberation, which will affect the recognition; in addition, sometimes singing through changing voices may also require musical instrument coordination and some music theory Knowledge.
Moreover, when we perceive sound, we are also affected by psychology. The same sound, some people may feel magnetic, but some people may feel rough. Psychological perceptions vary from person to person.
2. Why do people have different tones?
So in a real-time scenario, how to change the sound in real time? When the rhythm is long, although we cannot change the word choice and sentence construction, we can change the timbre of the pronunciation cavity, because it requires higher real-time performance.
For example, for a sentence such as "what day is today", these words can be split into some phonemes. For example, "now" can be split into two phonemes, "J" and "in", which is a voiced sound. The vocal cords vibrate to produce sound, and the timbre of each person is quite different at this time; while "Yes" is a voiceless sound, which is realized by the air flow of the lips and teeth. For this word, the timbre difference of each person is relatively small. Why is this?
■Figure 2
Essentially, the frequency at which the vocal cords vibrate determines the pitch of the pronunciation. Different people make the same sound through different vibration frequencies of the vocal cords, and their timbre will be very different in voiced sound. For example, in the above example, the pronunciation of the word "what's the date today" may be quite different for different people, while the word "yes" may be slightly different.
The traditional transformation algorithm is also considered from this perspective. Different people have different fundamental frequencies. According to the different vibration frequencies of the vocal cords, the distribution of the fundamental frequencies can be adjusted. Different fundamental frequencies have different harmonics, and harmonics are multiples of the fundamental frequency. This allows you to change the sound by transposing the pitch.
■Figure 3
When making different sounds, the opening and closing degree of the mouth determines the resonance of the pronunciation cavity. The different opening and closing degrees will lead to the enhancement or weakening of different frequency responses. If the resonance frequency of the mouth reaches the same frequency, the frequency band will be affected by Enhanced, otherwise weakened. This is the principle of formants. From the perspective of voiced sound, the vibration of the vocal cords generates a fundamental frequency, and the fundamental frequency will have corresponding harmonics. The distribution of harmonics at different frequencies is determined by the opening and closing of the cavity. In this way, changing the distribution of the fundamental frequency and frequency response can be adjusted to the timbre.
3. Voice changing based on traditional sound effects
The traditional algorithm actually adjusts different pronunciation dimensions through different effectors. Figure 4 shows the commonly used effectors in traditional algorithms, such as pitch shifting. Most of the current pitch shifting algorithms use pitch shifting to raise the fundamental frequency and harmonics. Or scale down. In this case, the shift operation will simultaneously change the formants, so that the sound is shifted up or down according to the energy of the spectrum, thereby changing the fundamental frequency of the human vocalization.
The fundamental frequency of girls is relatively high, and the fundamental frequency of boys is relatively low. By this method, the fundamental frequency direction of the two can be consistent. However, the mouth opening and closing of male and female pronunciation will also change with the change of the word, and only changing the fundamental frequency cannot achieve a perfect sound transformation. For example, in the movie "Despicable Me", in which the minions speak Spanish, because the fundamental frequency is increased, it becomes the voice of a child, but this is not natural, and it needs to be adjusted by an equalizer.
In the previous example, only the fundamental frequency was changed when transposing the pitch. In fact, the formant and frequency response can be adjusted by the equalizer or the formant filter, which are also the key modules for adjusting the timbre. In addition, people who have undergone Bel Canto training have more high-frequency sounds and fuller pronunciation when singing or speaking. From this perspective, the harmonics can be enhanced by the exciter.
■Figure 4
Most of the traditional algorithms change the sound from different dimensions through a series chain of effects. Commonly used effect software includes MorphVOX PRO or voicemod, they have different effects, and different character characteristics can be produced by adjusting parameters. The selection of sound effects is actually the superposition of different effectors. The voice of each person's voice will be different in the end. Because the basic position is different, it is difficult to make everyone's voice the same, but you can make a direction. Sexual adjustment.
I used the MorphVOX software to make a male-to-female or female-to-male transgender change. When the male voice changes to a female voice, using the default options to adjust it makes it feel like a minion. If you make a female voice into a male voice, you will find that the voice sounds a bit silly, like a situation where the nasal voice is heavier. This happens because the female voice needs to be down-tuned when it becomes a male voice, so that the entire spectrum is compressed down, so that a lot of high-frequency information is lost.
It can be seen that the traditional effects have their own defects, resulting in less accurate changes. In order to get a better sound, the traditional link method needs to manually change the parameters, which cannot achieve accurate timbre transformation. If only the preset values are used for adjustment, the timbre transformation effect is not ideal. In addition, it is very difficult for each person's voice to become another designated person, which may only be done by professional tuners.
02 ASR+TTS=VC? Exploration of the possibility of real-time voice change based on AI
1. ASR
Traditional voice changing has many limitations, so can AI-based methods improve real-time voice changing effects? We know that ASR technology can realize the function of speech to text. In fact, when changing the voice, it is also necessary to retain the semantic information. Only the overall adjustment may cause missing words or wrong expressions.
As shown in Figure 5, there are many recognition frameworks, such as Kaldi of hybrid (framework), etc. This framework will comprehensively determine the sound conversion through acoustic models, pronunciation dictionaries, and language models. The language is reasonable in two conditions.
■Figure 5
From this point of view, it is a relatively explanatory link, and it is also easy to build, because there are many ready-made pronunciation dictionaries and language models that can be used. It also has certain shortcomings, that is, the decoding process is relatively complicated. This framework contains multiple models. If there is a problem with one of them, there will be deviations in the recognition.
In response to this problem, there have been many end-to-end speech recognition frameworks, such as Espnet, whose accuracy from the perspective of a general model can achieve better recognition results than hybrid when there is sufficient data. And its training process is relatively simple, unlike the hybrid framework, which requires training an acoustic model, a language model, and then decoding. This framework only needs to pair speech and text data, and then directly train end-to-end. But its shortcomings are also obvious, because it needs to be trained end-to-end, so it is more demanding, requiring a larger training set and more accurate annotations. In addition, different scenarios may need to be customized, requiring more data, and if it is an end-to-end model, there may not be a better corpus for training.
At present, from the perspective of voice change, it is not necessary to recognize text, as long as the phoneme is accurately recognized, which is the difference from the ASR model.
2. TTS
ASR can recognize pronunciation, so the same sound, issued by different people, needs to use TTS. For this, there are also many ready-made frameworks that can be used. The process is nothing more than normalizing the text extracted by ASR, then performing spectral prediction, and then generating sound through a vocoder, as shown in Figure 6. Google's Tacontron and Microsoft's Fastspeech have practiced this feature, which can synthesize speech with lower latency. For the link in Figure 6, Fastspeech can also directly skip the vocoder part to generate speech directly, that is to say, the end to end generation of this part can be done directly from text to speech. In this way, combined with the form of ASR and TTS, voice change can be achieved.
■Figure 6
The vocoder technology in TTS is actually relatively mature. What it does is actually predict speech from a compressed spectral information. Figure 7 shows the mushra scores of different vocoders, which can be understood as the naturalness of pronunciation transformation. , as a judgment of whether the conversion is natural or not.
We can see that vocoders based on principal points such as wnet or wRNN have achieved good results in speech synthesis, and they are not much different from the naturalness of real people's voices. In addition, Hifi-GAN or Mel-GAN can achieve a similar effect to WaveRNN in the case of good real-time performance. With the development of vocoder technology, it has achieved better speech generation effect, which has become the premise of voice changing effector.
■Figure 7
3. VC
Figure 8 shows the basic steps of realizing voice conversion through ASR+TTS concatenation. Speaker A speaks a paragraph, then extracts text or phoneme information irrelevant to the speaker through ASR, and then uses TTS technology to convert it into Speaker B's timbre. Restore to achieve voice change.
■Figure 8
For this link, if the sound can only be changed from one person to another, it is in the form of One-to-one. ASR is actually irrelevant to the speaker. If this is the case, it can be done Any-to-one. That is to say, the ASR model can recognize accurate text no matter who the Speaker is, so that everyone becomes Speaker B. The ultimate goal of voice-changing is to become an Any-to-any form, which means that a model can turn anyone's voice into any kind of voice and expand the voice-changing ability.
We know that many end-to-end models, such as CycleGAN, StarGAN or VAEGAN, can achieve sound transformation in a limited set, so that people in the training set can convert to each other in the group, but this is also the limitation of its design, if Being able to switch the timbre of the TTS makes it possible to address the customizability of the sound, making it anyone's voice. It can be seen that the theory that ASR + TTS is equal to VC is actually achievable.
If you want to become an Any-to-any form, you need a quick way to not retrain the entire model every time you add a person. This requires borrowing from the idea of transferable learning. Figure 9 shows the speech generation method based on transfer learning.
■Figure 9
The idea of transfer learning is to perform the embedding operation on the target timbre through the speaker encoder module to extract the features of the timbre, and then put the features into the speech generation module synthesizer, so that the speech generation of different timbres can be performed. This is actually Google's previous discovery. They added the speaker encoder module to the TTS to achieve different timbre changes. When adding a new ID, that is to say, when a speaker is added, it only needs to identify the speaker for a minute or tens of seconds. The corpus can extract the features of the speaker and generate the corresponding timbre. This idea is actually a method of transfer learning.
There are many different speaker encoding methods, such as I-vector, X-vector, Google's GE2E (generalized end to end), which is mainly designed for loss, as well as Chinese (better) Deep Speaker, Korean (better) ) of RawNet et al. Therefore, we can disassemble it, except ASR and TTS, and then add the speaker encoder module to adjust it to achieve the timbre of the target speaker and achieve Any-to-any.
03 Algorithm implementation of real-time voice changing system
1. Algorithm framework of real-time voice changing system
Next, let's take a look at how the real-time voice change system is implemented. As shown in Figure 10, a small corpus of the target speaker is added to the voiceprint recognition (speaker encoding part) to obtain the voiceprint features, and at the same time, the phoneme features are extracted through the speech recognition module. Then, in the spectral conversion module, the spectral features are generated, and then the voice is obtained through the vocoder.
■Figure 10
In fact, there is a difference between offline and online in the whole process. The voiceprint recognition module can be implemented through offline training. The main difficulty here is:
①Real-time voice change should consider computing power and real-time performance. Even in terms of power, speech recognition, spectrum conversion and vocoder, even if the fastest algorithms in the industry are used, their computing power is on the order of GFLOPS. can be realised.
②As far as real-time is concerned, in the real-time communication process, if the voice change takes a long delay, it may take a long time to respond during communication, which occurs more frequently in the RTC environment. This requires the end-to-end delay to be as small as possible. Generally speaking, it cannot exceed 1 second. If it exceeds 1 second, you will feel that the voice change has a significant delay effect. Of course, the end-to-end delay also includes the delay on the link such as encoding, decoding, acquisition and playback. Ultimately, the delay of the entire algorithm cannot be higher than 400 milliseconds.
③ To achieve any-to-any voice change, whether the voice change effect is stable and whether the typo rate is low enough is also a challenging direction. The typo rate of speech recognition is already good at about 5%. If you want a lower typo rate, you may need a larger model or more targeted scenarios. In addition, the similarity of timbres depends on whether the voiceprint recognition module can accurately extract timbres, which is also a huge challenge.
2. Real-time voice changing system
In fact, different speech recognition frameworks, voiceprint recognition frameworks and vocoder frameworks can be freely combined to form a complete voice changing system. But in order to reduce the algorithm delay to 400 milliseconds, more deployment of voiceprint system may be considered. In the example shown in Figure 11, when deploying a real-time voice changing system, we will consider whether to perform voice changing in the cloud or on the device side. These two sets of links actually have their own advantages and disadvantages. Let’s first look at the advantages and disadvantages of voice changing on the cloud.
■Figure 11
If you do voice change in the cloud, you need to collect the audio of the speaker locally, and then remove noise reduction and echo through the APM sound processing module. Next, the encoding and decoding of the voice is performed. After the voice is encoded, it will be transmitted to the server for voice change. This part is affected by the network, and it may be necessary to introduce the NetEQ module for anti-weak network, and then change the voice after decoding. The audio after the voice change also needs to be encoded and then sent to the listener. The listener will also introduce the NetEQ module for anti-weak network, and then decode and play.
This form increases the process of NetEQ and encoding and decoding on the server than the end-side implementation of voice change. If there is no weak network, the delay it controls may increase by 30-40 milliseconds. But if there is a weak network, in order to resist packet loss, the delay may reach 100 milliseconds or 200 milliseconds, or even higher.
The advantage of cloud-based voice changing is very obvious, that is, the computing power required for voice changing on the server is relatively low. When deploying, a better vocoder, ASR or spectrum conversion can be used to improve the sound quality through computing power. If it is considered from the end side, you can put the voice change on the sending end, because there may be more than one receiving end, you only need to do one voice change on the sending end, which saves the conversion of NetEQ and Codec on the server, other Latency on the link is the same as in cloud deployments. From this point of view, even if there is no voice change in the end-to-end method, it can achieve a delay of about 400 milliseconds in general, plus the 400 millisecond delay of the algorithm, the delay of the end-to-end voice change will be less than 800 milliseconds, which can be used in a very long time. To a large extent, the delay consumption in the real-time transmission process is reduced, and the communication is smoother.
The end-to-side voice change is also affected by the computing power. As mentioned earlier, the end-side efficiency is on the order of GFLOPS. For example, the iPhone X or a newer version has a GPU processing chip. In this case, real-time computing can also be achieved on the end, but the model computing power cannot be too large.
04 Demo display and application scenarios
■Figure 12
Next, we will introduce the effect of the demo of voice changing through the link mentioned above, and its application scenarios. Demo is a conversation between a man and a woman. In virtual social interaction, the voices produced may not necessarily match the characters themselves. For example, if a girl wants to express in the image of a boy, if the voice does not change, it may not match the image of a boy, resulting in gender dislocation. Through the comparison before and after Demo's voice change, it can be found that in order to solve this problem, the voice change effect can be used to realize the correspondence between characters and pronunciation. In addition, the conversation just now was not interrupted by the change of voice.
1. Application scenarios of real-time voice change
So what kind of scenarios will the voice changer be used in? In fact, voice changing is useful both in MetaChat scenarios such as meta-language chat, or in traditional live chat. For example, when chatting, it may be necessary to customize the avatar, including changing the image or voice, to enhance the sense of immersion. If you are the image of a cute girl, but use a strong voice, you may be recognized quickly. Another example is in the game, if you can use the voice of the target character, there will be more sense of substitution.
In addition, there are also many scenes of virtual digital people. Many celebrities will broadcast live broadcasts by customizing their own virtual images. They do not need to enter the live broadcast room by themselves. Others can normalize 24-hour live broadcasts through background simulations. The representative of the chat room scene is the online script killing. When the user performs the role interpretation, there will be different scenes to substitute, which requires different voice switching, which can be achieved by changing the voice.
2. Better real-time voice changing effect
As far as the current real-time transformation effect is concerned, there are still many directions that can be optimized. Here are a few examples for everyone to discuss. First of all, as mentioned above, in order to ensure real-time performance, the adjustment is only made by changing the dimension of the speaker's timbre. In fact, the similarity of recognition can be improved by the speaker's cadence or the expression of pronunciation emotions. Therefore, in the case of meeting the real-time requirements, more exploration can be carried out, and more features of expression can be put into the voice changing effector.
Second, as far as vocoders are concerned, it requires less computing power and better results. In fact, the current vocoders have been able to perform relatively well in terms of naturalness, but for point-by-point vocoders, there are still great challenges in computing power. Therefore, the computing power and effect need to be balanced, and a more suitable vocoder can be obtained on the basis of the balance.
In addition, a more robust phoneme extraction module is also a major problem that plagues ASR speech recognition. In noise or complex scenes, there are other sounds besides human voices, which will affect the effect of speech recognition. In this case, a noise reduction module can be introduced to extract human voices, or a phoneme extraction module that is more robust to noise can be developed to identify human voices in noise. When we communicate, whether it is ASR or voice change, we may also encounter multilingual problems, such as English in the middle of Chinese, or even citing Japanese, etc. In this case, there will be multilingual recognition problems, that is, code-switch, there is no perfect solution to this problem, it is a direction that can be improved in the future.
Ultimately, the hardware determines how complex the upper limit of the models that can be deployed can be. If better hardware is deployed on the end, better voice-changing effects can be achieved while reducing end-to-end latency. This is also a direction that needs attention in the future.
05 Q&A session
1. Is there any way to effectively reduce the delay for real-time voice change?
We are doing real-time voice change, especially when interaction is still required. First of all, it is best to deploy it on the cloud, which can reduce the delay caused when the server is transmitting and resisting weak networks. Secondly, while doing the algorithm, such as real-time streaming ASR, there will be a lookahead. For example, we need to watch a few more frames in the future to ensure its recognition accuracy. For this part, whether it is ASR, TTS or spectrum conversion, it is necessary to control the amount of lookahead, so that it can ensure real-time performance without having an excessive impact on the effect. In practice, the overall algorithm delay can be controlled to within 400 milliseconds, and it is estimated to be around 100 milliseconds for each module.
2. Where can I experience the Any-to-any voice changing function? Can individuals use it?
This part is still in the polishing stage, but we already have a demo to experience. A demo will be released in the next few months for everyone to experience.
3. If Xiaobai wants to learn audio systematically, are there any better learning resources or learning paths to recommend?
I previously collaborated with Geek Time on "Getting the Audio Tech Done". I found that the audio field may be a bit niche, but it covers a lot of fields. You can find some resources on the Internet to study systematically. In addition to the audio itself, it actually includes audio 3A processing (audio link processing), acoustics and AI-based technologies.
You can first look for relevant information from these angles, and integrate and learn from each other, and the effect will be better.
Further reading : How to "clone" voice based on real-time voiceprint voice change
Upcoming Events
"RTC Dev Meetup - Hangzhou Station", we will focus on big front-end technology, and invite technical experts from Shengwang, Ant Group and Hikvision to share with us the business structure and cross-end practice in the real-time interaction field in the big front-end era.
It's better to take action, click here to register now!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。