What is the Avatar of the fire?
With the great fire of the meta-universe concept, the word Avatar also began to appear more and more in people's vision. In 2009, a 3D science fiction blockbuster "Avatar" directed by James Cameron made many people know the English word "Avatar". However, many people do not know that this word is not coined by the director, but comes from Sanskrit, which is an important term in Hinduism. According to the Cambridge English Dictionary, Avatar currently has three meanings.
Translation results of avatar from the Cambridge Dictionary © Cambridge University Press
Initially, Avatar originated from the Sanskrit avatarana, consisting of ava (off, down) + tarati (cross over), which literally means "descend to the earth", referring to the incarnation of the gods descending to the earth, usually specifically referring to the main god Vishnu (VISHNU) descending to the earth The state of transforming into a human form or a beast form. Later entered English vocabulary in 1784.
In 1985, Chep Moninsta and Joseph Romero used the word Avatar to refer to the user's online image when they designed the online role-playing game Habitat for Lucasfilm Games (LucasArts). Then in 1992, the book Snow Crash by science fiction writer Neal Stephenson described a meta-universe parallel to the real world. All people in the real world have an Avatar in the meta-universe, and this is the first time that the term appears in the mass media.
In the Internet age, the term Avatar began to be widely used by programmers in software systems to represent an image of a user or his personality, which is what we often call "avatar" or "personal show." This avatar can be a three-dimensional image in an online game or virtual world, or a two-dimensional image commonly used in online forums or communities. It is a marker that can represent the user himself.
From QQ show to Avatar
Nowadays, supporting users to create their own avatars has become a standard configuration of various software applications, and the avatars used by users have evolved from ordinary 2D images to 3D images with the development of technology. The milestone event was 2017. Apple released a new feature on iPhone X-Animoji, which uses facial recognition sensors to detect changes in the user’s facial expressions, while recording the user’s voice with a microphone, and finally generates cute 3D animated emoticons. Users can Share emojis with friends via iMessage. However, the first generation does not support user-defined images, only the animal cartoon avatars built into the system. Subsequently, the updated Animoji II began to support users to freely pinch their faces and generate stylized face avatars. At present, you can see the automatic face pinching function in many scenes. Only by taking one or a few photos, a CG model that meets the characteristics of the user's face is automatically generated, but behind it depends on the support of complex CG modeling and rendering technology.
Avatar can also skip the expensive CG modeling and rendering process, and "stylize" the photographed face through machine learning algorithms. That is, the target training style is automatically transferred and merged with the original facial features of the photographer to create a stylized face Avatar that conforms to the user's facial features.
Four technical implementation routes of face stylized Avatar
What is face stylization?
The so-called face stylization is to convert the real face avatar into a specific style avatar, such as cartoon style, anime style, and oil painting style, as shown in the following figure:
Basically, the realization of face stylization can be achieved through several technical routes such as texture mapping, style transfer, cyclic confrontation network, and hidden variable mapping.
Texture map
Texture mapping is generally given a sample picture, and the texture of the picture is automatically pasted pixel by pixel or block by block to the target face through an algorithm to form a reasonable, natural and followable face mask [1].
[1] Medium sample picture
Style transfer
Style transfer is given a style photo or a group of style photos, based on the learning method to extract style codes from the style pictures, propose content codes from the target face pictures, and automatically generate corresponding stylized pictures through two sets of codes [2, 3 ]. Only the surface texture of the face image is changed, but the structural attributes of the face cannot be reasonably retained or adjusted to form a meaningful structural style change.
[3] Medium sample picture
Cyclic adversarial network
Using the method of cyclic confrontation network, by using the cyclic confrontation network and its reconstruction constraints to train, the stylized effect can be achieved without paired training samples. It is often used in conjunction with style transfer, that is, style coding and content coding are extracted separately. The stylization of the face will also display the modeling and deform the face structure information according to the target style attributes (for example, based on the key points of the face). However, due to the lack of constraints on the intermediate results of the cyclic adversarial network (such as A->B->B in A), the final generation effect is uncontrollable and unstable (that is, the rationality of A->B cannot be guaranteed) [4].
[4] Medium sample picture
Latent variable mapping
Hidden variable mapping generally uses a pre-trained real face generation model to fine-tune a set of style pictures to the target style, so as to obtain a corresponding face stylized generation model [5, 6]. Use an encoding network to map the input face image to or obtain the hidden variable corresponding to the image based on multi-step optimization, and use the variable as the input of the face stylization generation model to obtain the stylized image corresponding to the face image . The latent variable mapping method based on optimization often gets better results, but it requires a lot of calculations in actual operation. Although the mapped hidden variables contain the global information of the face, it is easy to lose the detailed features of the original input face, and it is easy to cause the generated effect to fail to reflect the personal identification features and detailed expressions.
[5] Medium sample picture (from https://toonify.photos/)
[6] Medium sample picture
Alibaba Cloud Video Cloud Self-developed Cartoon Wisdom Painting Avatar
In 2020, the self-developed Avatar of Alibaba Cloud Video Cloud was born and won the attention of the industry. At the Yunqi Conference in October 2021, the cartoon smart drawing project of Alibaba Cloud Video Cloud appeared on the Alibaba Cloud Developer Booth. Nearly 2,000 participants rushed to experience it and became a hit at the conference.
Alibaba Cloud Cartoon Zhihui adopts the technical solution of latent variable mapping. For the input face image, discover its salient features (such as eye size, nose shape, etc.), and can automatically generate a virtual image with personal characteristics (that is, the stylized effect) ).
First, use our own massive copyrighted high-definition face data set to train a model that can generate high-definition face pictures in an unsupervised manner, that is, a real face simulator, which generates a large number of different facial features under the control of hidden variables High-definition face pictures. Use a small amount of collected target style pictures (the target style pictures do not need to correspond to the real face one-to-one) to fine-tune the model to obtain a stylized simulator. The real face simulator and the stylized simulator share hidden variables, that is, a hidden variable can be mapped to get a pair of "pseudo" face pictures and their corresponding stylized pictures.
By sampling a large number of hidden variables, we can obtain a large number of data pairs covering different face attributes (gender, age, expression, hairstyle, whether to wear glasses, etc.), which can be used to train the image translation network. Based on the inherent structure of the face (such as eyes, nose, etc.) and the structural difference between the real face and the stylized virtual image (such as the eyes of a cartoon image are often large and round), the local area correlation calculation is added to the network With the constraints of modules and face reconstruction, the virtual image generated by the trained network is both vivid and lovely, but also has personal characteristics.
Model design
Based on the innate structure of the face (such as eyes, nose, etc.) and the structural difference between the real face and the stylized virtual image (such as the eyes of cartoon images are often large and round), the local area correlation calculation is added to the network Module (that is, it is hoped that the characteristics of the eyes of the real person and the eyes of the avatar have a certain corresponding relationship) and the constraints of face reconstruction, so that the generated avatar is both vivid and cute, but also has personal characteristics.
Show results:
Avatar's future
Thanks to the rapid development of AI technology, we now have the technology to produce virtual humans, but I believe all this is just the beginning. In the foreseeable future, Avatar will be the digital avatar of the meta-universe digital residents, appearing more and more frequently in the virtual world. And Avatar will also become an extremely important digital asset in the virtual world.
Finally, quoting Zuckerberg’s description of digital people, “The characteristic of the virtual world is the sense of existence, that is, you can really feel another person or in another place. Creation, virtual people and digital objects will become our self-expression. At the core, this will bring new experiences and economic opportunities."
“The defining quality of the metaverse is presence, which is this feeling that you’re really there with another person or in another place,” Mr. Zuckerberg told analysts in July. “Creation, avatars, and digital objects are going to be central to how we express ourselves, and this is going to lead to entirely new experiences and economic opportunities.”
references:
[1] Aneta Texler, Ondřej Texler, Michal Kučera, Menglei Chai, and Daniel Sýkora. FaceBlit: Instant Real-time Example-based Style Transfer to Facial Videos, In Proceedings of the ACM in Computer Graphics and Interactive Techniques, 4(1), 2021.
[2] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. A Neural Algorithm of Artistic Style. Journal of Vision September 2016, Vol.16, 326.
[3] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A Learned Representation for Artistic Style. In International Conference on Learning Representations 2017.
[4] Kaidi Cao, Jing Liao, and Lu Yuan. CariGANs: Unpaired Photo-to-Caricature Translation. In ACM Transactions on Graphics (Siggraph Asia 2018).
[5] Justin N. M. Pinkney and Doron Adler. Resolution Dependent GAN Interpolation
for Controllable Image Synthesis Between Domains. In NeurIPS 2020 Workshop.
[6] Guoxian Song, Linjie Luo, Jing Liu, Wan-Chun Ma, Chunpong Lai, Chuanxia Zheng, and Tat-Jen Cham. AgileGAN: Stylizing Portraits by Inversion-Consistent Transfer Learning. In ACM Transactions on Graphics (Siggraph 2021).
"Video Cloud Technology" Your most noteworthy audio and video technology public account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges and exchanges with first-class engineers in the audio and video field. The official account backstage reply [Technology] You can join the Alibaba Cloud Video Cloud Product Technology Exchange Group, discuss audio and video technologies with industry leaders, and get more industry latest information.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。