Agora Lipsync Technology Revealed: Real-time voice-driven portraits simulate real people

The of the Metaverse makes people full of fantasy about the shape of the virtual world in the future. Previously, we revealed how the 3D spatial audio technology , a self-developed 3D spatial audio technology developed by 1620e38cfa4218 , can perfectly simulate the real auditory experience in the virtual world and increase the player's immersion. . Today, we temporarily leave the metaverse and return to the real world. Let’s talk about how the Agora Lipsync (lip sync) technology developed by Shengwang is implemented The mouth movement of the static face avatar can be driven by the speech audio signal of the speaker.

Before introducing the Agora Lipsync technology, let's take a brief look at two similar technologies in the current industry:

Oculus Lipsync , Oculus Lipsync is a Unity integration for synchronizing an avatar's lip movements to speech. It primarily analyzes audio input offline or in real-time, and then predicts a set of visemes for animating the lips of an avatar or non-player character (NPC). To improve the accuracy of audio-driven facial animation, Oculus Lipsync utilizes a neural network model to learn the mapping relationship between speech and phonemes. The input audio is converted into phonemes through the model, and the phonemes can correspond to specific visual phonemes, and then the poses and expressions of the lips and faces of the virtual characters are realized based on the Unity integration technology. This technology is mainly used in the field of virtual anchors and games.

facial capture technology , holographic images are used in many press conferences and event conferences. The guest wears specific hardware equipment outside the stage, and his body movements and speech movements will be synchronized in real time on the big screen of the stage In the virtual image of , in order to achieve lip synchronization, it is necessary to use the key facial expression capture technology and related hardware devices.

Compared with these two technologies, Agora Lipsync of Shengwang has the core difference. Agora Lipsync does not need a camera or facial expression capture technology, but uses the generative confrontation network in the deep learning algorithm to translate Chinese and English (or other languages) pronunciations. The mouth shape and facial expression are intelligently associated, driving the portrait to simulate the mouth shape of a real person, and supporting 2D portrait pictures and 3D portrait models.

Next, we will focus on demystifying the technical principles behind Agora Lipsync's voice-driven mouth movement.

Generative Adversarial Network + Model Lightweight Realization of Speech Signal-Driven Portrait Mouth Movement

The voice-driven mouth shape technology, as the name suggests, drives the mouth movement of the static face avatar through the speaker's voice audio signal, so that the generated face avatar's mouth state is highly matched with the speaker's voice. The realization of the technology of real-time voice-driven face image speaking needs to overcome many challenges. First of all, it is necessary to find the correspondence between the voice information and the face information. The phoneme is the smallest unit that can be pronounced by our people, and the corresponding phoneme can be found through the phoneme. Mouth shape, but there is more than one mouth shape state that emits the same phoneme, and there are differences in facial features and speech states of different people, so this is a complex one-to-many problem. Secondly, there will be some other challenges, including whether the generated speaker face is distorted, and whether the change of the speaker's face and mouth shape is smooth and so on. In addition, if it is used in low-latency real-time interactive scenarios, issues such as computational complexity need to be considered.

■Figure 1: For example, the phoneme of a, the degree of mouth opening and closing of pronunciation is not unique

The traditional Lipsync (lip sync) method can be realized through speech processing combined with face modeling. However, the number of mouth shapes that can be driven by speech is often limited, while Agora Lipsync of SoundNet can generate speakers in real time through deep learning algorithms. face image. At present, deep learning algorithms continue to improve their performance as the scale of data increases. By designing neural networks, features can be automatically extracted from data, reducing the work of manually designing feature extractors for each problem. Deep learning has been brilliant in many fields such as computer vision and natural language processing.

the realization of the voice-driven face image task, we need to map the one-dimensional voice signal to the two-dimensional pixel space of the image. SoundNet uses the Generative Adversarial Network (GAN) in deep learning. The idea of GAN comes from zero-sum game theory and consists of two parts. One is the Generator, which receives random noise or other signals to generate target images. One is the Discriminator, which judges whether an image is "real", the input is an image, and the output is the probability that the image is a real image. The goal of the generator is to fool the discriminator by generating images that are close to the real one, while the goal of the discriminator is to try to distinguish between the fake and real images generated by the generator. The generator hopes that the fake images are more realistic and the discrimination probability is high, while the discriminator hopes that the fake images can be more realistic and the discrimination probability is low. Through such a dynamic game process, the Nash equilibrium point is finally reached. There are many examples of generative confrontation in nature. In the process of biological evolution, the prey will slowly evolve their own characteristics, so as to achieve the purpose of deceiving the predator, and the predator will also adjust itself to the prey according to the situation. identification and co-evolution.

After the GAN-based deep neural network is trained, the generator can transform the input signal and generate realistic images. In this regard, designed a deep learning model for voice-driven image tasks, using large-scale video corpus data, so that the model can generate speaking faces according to the input speech. The model extracts the features of the input signals of two different modalities, namely the voice and the image, to obtain the corresponding image latent vector and speech latent vector, and further learns the implicit mapping relationship between the two cross-modal latent vectors. According to this relationship, the latent vector feature is reconstructed into a speaker face image that matches the original audio. In addition to whether the generated images are realistic, we also need to consider the timing stability and the matching degree of audio and video. For this, we have designed different loss functions to constrain them in training. The entire model inference calculation process is implemented end-to-end.

At the same time, Agora Lipsync is also suitable for Chinese, Japanese, German, English and other multilingual voices and people of various skin colors to meet the user experience of different countries and regions.

We can get a more intuitive understanding of how the generative adversarial network learns to generate speaker face avatars end-to-end through Figure 2 below.

Figure 2 can be divided into 4 processes:

1. The Generator generator in the deep learning model receives a face image and a small piece of speech, and generates a fake portrait image (Fake image) through feature extraction and processing inside the generator.

2. The "Real Data" in the figure refers to the video sequence used for training, from which the target image that matches the Audio is taken out. Compare the difference between the target image and the Fake Image generated by the Generator, and further update the model parameters in the generator through backpropagation according to the loss function, so that the generator can learn better and generate a more realistic Fake Image;

3. While comparing the differences, input the target image and Fake Image in Real Data into the Discriminator, and let the discriminator learn to distinguish between true and false;

4. During the whole training process, the generator and the discriminator fight against each other and learn from each other until the performance of the generator and the discriminator reaches a balanced state. The final generator will generate an image that is closer to the real face and mouth shape.

■ Figure 2: How Generative Adversarial Networks Generate Corresponding Face Images

The deep learning model can generate end-to-end face images, but the amount of computation and parameters is often large. Due to the requirements of storage and power consumption, it is still challenging to apply the algorithm in real time with low resources. Some of the commonly used model lightweight technologies include manual design of lightweight structures, neural architecture search, knowledge distillation, and model pruning. In Agora Lipsync's voice-driven mouth shape task, the model designed by SoundNet is essentially an image generation model with a relatively large volume. We designed an end-to-end lightweight voice-driven image model through model lightweighting technology. It only needs to transmit the voice stream to drive the static image to generate the speaking face. On the basis of ensuring the effect, the calculation amount and parameter amount of the model are greatly reduced, so as to meet the landing requirements of the mobile terminal . A still face image produces mouth movement to achieve the effect of sound and picture synchronization.

After introducing the technical principles of Agora Lipsync, let's look at its application scenarios. Compared with the metaverse virtual world and real video social scenarios, Agora Lipsync fills in the voice social scenarios, without turning on the camera, but you can experience live video Lianmai's visual scene and gameplay are blank, and it has great application value in scenarios such as chat rooms, interactive podcasts, and video conferences.

Chat room: In the traditional chat room, users usually choose real avatars or virtual avatars for voice connection. It is often necessary to use topical and interesting chat content to ensure the content quality and duration of the chat room. , and by adding the technology of voice-driven mouth movement, the chat process can be made more vivid and interesting in form. For players who do not want to open the camera, they can choose a good-looking or funny photo as their avatar, so that everyone Without turning on the camera, you can also see each other's faces as if they are talking in real life, which finally increases the player's motivation to chat further in the chat room.

Interactive podcast: Last year, the interactive podcast platform represented by Clubhouse was popular all over the world. Compared with the traditional chat room, the topic content and user relationship of the interactive podcast platform are obviously different. The chat topics in the pod room are mainly based on technology, the Internet, and the workplace. , entrepreneurship, stock market, music and other topics, and users are very willing to upload their real-life avatars. By adding voice-driven mouth movement technology, chats between users can be made more engaging and realistic.

Video conferencing: In video conferencing scenarios, users are often required to turn on their cameras as much as possible. However, it is often inconvenient for some users to turn on their cameras, resulting in a conference scene where someone is opening video and someone is using voice. Users who cannot turn on the camera avoid embarrassment, and create a scene as if a real person participates in a video conference by driving the mouth movement of the face avatar. On the other hand, through the voice-driven face-speaking method, the video conference transmission does not need to transmit the video stream, only the voice stream is required, especially under the condition of weak network, it not only avoids the picture jam or delay, but also reduces the transmission cost. .

At present, Agora Lipsync technology mainly supports 2D portrait pictures and 3D portrait models. In the future, under the continuous research of the sound network algorithm team, the technology will be further upgraded. It can not only support cartoon avatars, but also can further drive the head, eyes and other organs through voice. movement to achieve a wider range of application scenarios and scenario values.

If you want to further consult or access Agora Lipsync technology, you can click "Read the original text" to leave your information, and we will contact you in time for further communication.

Agora Lipsync Technology Revealed: Real-time voice-driven portraits simulate real people

Generative Adversarial Network + Model Lightweight Realization of Speech Signal-Driven Portrait Mouth Movement

RTE开发者社区

引用和评论

ElevenLabs 新 TTS 模型支持音频标签；NotebookLM 前产品经理新项目曝光：将邮件日历新闻转为互动音频丨日报

一文掌握 MCP 上下文协议：从理论到实践

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读

🔥吐血整理 Bolt.diy 部署与应用攻略