Picture style transfer: Mitigating the loss of details and the failure of face stylization based on examples

Abstract: The example-based method introduced in this article can well alleviate the loss of details, the failure of face stylization and other problems and obtain high-quality style conversion images.

This article is shared from Huawei Cloud Community "Example-based Style Migration" , author: lemon grapefruit tea with ice.

Although the neural network-based style transfer method generates an amazing style transfer map, most of the current methods can only learn some similar color distribution, overall structure, etc., and are not very good for the texture details of local areas. The learning ability of these details will also bring distortion and deformation. The example-based method introduced in this article can alleviate the above problems and get high-quality style transfer images.

Style transfer refers to preserving the content of the image and converting the image to the target style. For example, the first line of pictures in the following figure are pictures of various target styles, and the second line is the pictures after style conversion while preserving the content:

Note: Style refers to changes in the color and texture of the picture. Some papers believe that content is also a style.

Foreword:

Almost most of the current style transfer is optimized by combining AdaIn (Adaptive Entity Regularization) on the basis of GAN (Generative Adversarial Network), plus the perceptual loss (contentloss) formed by the vgg network; there are more classic ones. Pixel2pixel, cyclegan, etc. use paired data or cycle loss to perform image translation (ImageTranslation) tasks. Although the neural network-based style transfer method generates an amazing style transfer map, most of the current methods can only learn some similar color distribution, overall structure, etc., and are not very good for the texture details of local areas. The learning ability of these details will also bring distortion and deformation. Especially recently, many methods have been tried to stylize human faces, including (u-gat-it, stylegan, etc.). These neural network-based methods do not have good effects on some styles like oil painting and watercolor.

The following first introduces two effective neural network-based style transfer methods: Among them, U-GAT-IT has a better effect on the face of the second element, and whitebox has a better effect on landscape pictures.

U-GAT-IT: UNSUPERVISEDGENERATIVE ATTENTIONAL NETWORKS WITH ADAPTIVE LAYERINSTANCE NORMALIZATION FORIMAGE-TO-IMAGE TRANSLATION

u-gat-it is suitable for Image to Image Translation tasks from large deformed faces to two-dimensional styles. The author introduces the attention module into the generator and discriminator parts of the whole framework, so that the model focuses on some semantically important areas and ignores some small areas. The author also combined entity regularization (Instance Normalization) and layer regularization (LayerNormalization) to propose Adaptive layer InstanceNormalization (Adaptive layer InstanceNormalization) AdaLIN. The AdaLIN formula helps the attention module to better control the shape and texture changes.

The entire model structure is shown in the figure, including two generators G_{s->t}Gs−>t and G_{t->s}Gt−>s and two discriminators D_sDs and D_tDt, the above structure diagram It is the structure of G_{s->t}Gs−>t and D_tDt, which means source to target (real to two-dimensional). G_{t->s}Gt−>s and D_sDs are the opposite.

The whole generator process is: unpaired data is input into the generator module, K feature maps E are extracted through downsampling and residual blocks, etc., and the auxiliary classifier is used to learn the weights W of these k features (similar to CAM, using global Average pooling and global maximum pooling get the weight w), and finally get the attention feature map a = w ∗ sa=w∗s. The feature map is then input to a fully connected layer to obtain the mean and variance, the final normalized feature map is obtained through the AdaLIn function proposed in the paper, and the converted image is obtained after inputting this feature map into the decoder.

The discriminator is to generate a specific loss through a two-classification network, constraining the generated image to be consistent with the distribution of the training data.

In actual training, the training speed of Ugatit is slow. Although it will generate some better two-dimensional style pictures, this method that does not use information such as key points of the face will cause some of the generated pictures to be exaggerated and cannot reach industrial applications. standard.

whitebox: Learning toCartoonize Using White-box Cartoon Representations
Suitable style: real person -> realistic target style
Not suitable for style: abstract style such as oil painting
Main contribution: Three representations (the surface representation, the structure representation, and the texture representation.) that simulate human painting behavior to form the relevant loss function.

The network structure is as above, the structure is relatively simple, mainly various losses:

1. The pre-trained VGG network extracts high-dimensional and low-dimensional features to form structure loss and content loss;
2. Surface representation simulates abstract watercolor paintings of paintings, etc. (obtained through a filter);
3. The texture representation is similar to the sketch style, generated by a color shift algorithm;
4. The structure representation is obtained by KMeans clustering, and the structured color patch distribution is obtained.

Summary: This method can generate good results such as Hayao Miyazaki and other similar Japanese animation styles, but the style conversion of characters and other styles will bring about the loss of details. The loss generated by various representations proposed by this article simulating human painters has good reference significance.

Face style transfer based on instance synthesis

The style transfer methods introduced above can actually be classified into one category. They all use neural networks to learn the corresponding styles by learning a large number of data with similar styles. This method can achieve better results in landscape pictures or face images with less details (two-dimensional), but for style maps with rich information, most of this method can only learn some color distributions, etc. , The generated style map will lose a lot of local details. Especially for face stylization, only using attention (U-GAT-IT) will still produce a large number of failed style transition graphs, although there is a paper (LandmarkAssisted CycleGAN for Cartoon Face Generation) that uses the key point constraints of the face to achieve a more stable Transformation effects, but for more complex style maps, these methods still have some deficiencies. At this time, the style transfer based on instance synthesis can alleviate these problems (loss of details, failure of face stylization, etc.).

Example-Based Synthesis of Stylized Facial Animations
Effect comparison:

The picture in the second column of the picture above is a stylization method based on neural network, including the whitebox method introduced above, trying to smooth the partial texture of the final converted picture to achieve the painting effect. However, the result is that these methods are not satisfactory for style maps with rich textures.

In the method in this paper, a stylized face video frame O can be obtained by inputting a face style image S with rich texture and a video frame T.

The specific method and idea of the article is quite simple: through the input style map and key frame (to be converted), a series of guide maps (Gseg, Gpos, Gapp, Gtemp) are obtained, and the data are synthesized using the guide map proposed by Fišer in 2016 The algorithm (StyLit: Illumination-Guided Example-Based Stylization of 3D Renderings). The focus is on the construction and meaning of various guide maps.

Segmentation map (Gseg):
Purpose: Since different areas of the style map have different strokes, the face area is divided into areas such as eyes, hair, eyebrows, nose, and mouth. The specific method to obtain the figure is as follows:

Briefly explain the image above. In order to obtain the three-part image b of the original face a, the author first obtains a rough avatar segmentation image c, and then uses the face key points (chin key points) to obtain a closed mask e. In order to obtain the skin area f, the author uses a statistical model of the skin to filter the pixels belonging to the skin and obtains the image h. Finally, in order to obtain a segmentation map of other areas of the face (human eyes, etc.), the key points of the face are used, and in order to prevent the key points from being inaccurate, the final image i is obtained by blurring the relevant areas.

Location guide map (Gpos):
The pixel coordinates are normalized to 0-1, then use the original image and the key points of the target's face, and use "moving
least squares deformation", and finally get the coordinate point deformation map of the target image relative to the original image.

Appearance guide map (Gapp):
Convert to grayscale, modify contrast, etc.

Timing guide chart (Gtemp):
Using the previously studied hand-drawn sequence has timing consistency in the low-frequency region, the timing guidance diagram is obtained by the fuzzy style reference diagram S and the style conversion diagram O synthesized in the previous frame.

synthesis
After obtaining the above guide map, the author uses the StyLit method to obtain the synthesized style map; in addition, the human eye and mouth area are synthesized by an additional mask (d) that is stricter than the previous Gseg, as shown below:

The results show: The first behavior style reference picture, the second behavior style picture after conversion.

Real-Time Patch-Based Stylization of Portraits Using Generative
AdversarialNetwork

The method described above has too much preparation work when synthesizing style maps, and four maps (Gseg, Gpos, Gapp, Gtemp) need to be generated, and there will be some failed steps (key point acquisition failure, segmentation failure, etc.) that will lead to style The conversion failed; however, it is undeniable that the example-based synthesis method described above generates extremely high quality style maps. The author of this article proposes to use GAN to combine this method to generate high-quality style maps, and use GPU to achieve higher The speed of reasoning.

The method is also very simple. Use the above method to generate high-quality stylized data, and use commonly used counter loss, color loss (see the L1 distance before the style map reference image and the converted style map), and the perceptual loss (predicted Train the VGG to extract the features of the style reference image and the converted style image to find the L2 distance);

In terms of network structure, the author adds residual connections and residual blocks on the basis of previous researchers, as shown in the following figure:

Summary: This article does not have any special highlights, but it provides us with an idea to use better but slower methods to generate high-quality large-scale training sets. Using the above architecture (GAN + common style transfer loss, etc.), you can Get a better effect and faster style transfer method.

FaceBlit: Instant Real-time Example-based StyleTransfer to Facial Videos
This article is also an example-based style transfer method. The author compared the above two papers. Although the first paper, Example-BasedSynthesis of Stylized Facial Animations introduced above, is of very high quality, the method is prepared before synthesis Too much work, it takes tens of seconds to complete the stylization of a picture. Although the second paper Real-TimePatch-Based Stylization of Portraits Using Generative Adversarial Network has a faster speed on the GPU, it requires a long training and resource consumption. The following is a comparison between the author and the two results:

Among them, a is the style map, b and c are the results of the above two papers, and d is the original image.

The author first improved the guidance diagram used in the first paper. Four guidance diagrams (segmentation diagram, appearance guidance diagram, position guidance diagram, and timing guidance diagram) were used in the first paper. The author of this article combined the above four guidance The graph is compressed into two types: the position guide graph Gpos and the appearance guide graph Gapp, and their algorithm for generating the guide graph is changed, and the guide graph can be generated in tens of milliseconds.

Positional Guide
The first is to obtain the key points of the face. For the style map, use a pre-trained algorithm to generate it in advance. For real face images, the author reduces the resolution to half of the previous one before entering the face detector to obtain faster detection (the impact on accuracy can be ignored). After obtaining the key points of the face, embed these coordinate information into the RGB three channels, where R is the key point x and G is y; then use the "moving least-squaresmethod" used in the first paper to calculate from the original image to the style The key points of the image are deformed; the last remaining B channel is used to store the segmentation image, where the segmentation image of the style image can be generated in advance, and the mask of the original image is generated using the following method:

Simply put, by connecting the key points and drawing an elliptical area, a partial face area is obtained, and finally the boundary of this area is expanded by analyzing the color distribution of the skin, and finally the mask of the entire face is obtained.

Appearance Guide
First convert the original image to a grayscale image, and use the grayscale image to subtract the Gaussian blur from the grayscale image:

After obtaining the above two pictures, the author constructed a table with the following formula to record the distance between multiple coordinates.

And enter this table into StyleBlit (a fast compositing algorithm based on examples, described below)

StyleBlit: Fast Example-Based Stylization with Local Guidance
The above articles are actually mainly building various guidance maps, or the existing algorithms are trained on high-quality data sets. The core style transfer generation algorithm is a similar method to this article. The method introduced in this article is currently the best (according to the difference of the given guide map) and the fastest stylized synthesis method based on an example or called a guide map. As for how fast the speed is, the original text says this: On a single-core CPU, we can process 1 million pixel images at 10 frames per second, while on a normal GPU, we can process at 100 frames per second. Speed processing 4K ultra-high-definition resolution video screen.

Here is only a brief explanation of the idea of the paper, and a detailed introduction from stylit to styleblit will follow. The idea of StyleBlit is not complicated. Assuming that we have two images or two objects with different styles (3d or 2d), the core idea of the algorithm is to paste the pixel values of the original image onto the target image ((through similar Patching method). Assuming that the pixel set of the original image is {p1,p2...pn} and the pixel of the target image is {q1,q2,...qn}, we have an additional table whose value is {p,q ,error}, which is similar to the table above about the original pixel and target pixel coordinates, and the error rate between the two:

Our goal is to stylize or render the target image, so we need to traverse all the pixels of the target image {q1,q2,...qn}, and after finding the closest pixel in the corresponding original image, calculate the error rate of the two, if it is lower than A certain threshold, then copy the color of the original pixel to the target image.

Of course, there are some details. For example, the author divides the pixels into layers, calculates the error rate from the top layer to the lower layer, and finds the most suitable original image pixels (the red points are the target image pixels, and the black and blue are the original pixels). Picture pixels, blue is the nearest three levels of pixels):

The final algorithm pseudo code is as follows:

Result display: The style reference object is a sphere, and the target object is a humanoid modeling

Conclusion: As long as there is a suitable guide map, from 2d to 3d, face style conversion, etc., this method can quickly and efficiently generate high-quality target images. From the experimental results, it is better than most style transfer based on neural network.

Note:
Image translation: Some researchers believe that image translation should be a broader concept than style transfer, such as day and night image conversion, line drawing coloring, spring to winter, horse to zebra, 2D to 3D conversion, super-resolution reconstruction , Missing image repair, stylization, etc., all of which belong to Image to Image Translation tasks. The whole can be summarized as converting the input graph into the target graph, and both the input graph and the target graph conform to its specific data distribution. This article mainly talks about some recent style transfer papers.

AdaIn: The idea of AdaIN is that it is dedicated to extracting content and style information from the feature map output by VGG16 of a picture, and separating these two information; the original picture is normalized after subtracting the mean and dividing by the variance The style can be subtracted, and the mean variance extracted from the style map can be reversed normalize to complete the style transfer;

For more AI technology dry goods, welcomed Huawei's cloud area on AI, there are AI programming Python, etc. six combat battalions for everyone for free.

Click to follow to learn about Huawei Cloud's fresh technology for the first time~

Picture style transfer: Mitigating the loss of details and the failure of face stylization based on examples

Foreword:

Face style transfer based on instance synthesis

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

登Nature子刊，俄罗斯研究团队基于机器学习实现万亿级质谱数据搜索，发现未知化学反应

英伟达新一代GPU架构（50系列显卡）PyTorch兼容性解决方案

PyTorch PINN实战：用深度学习求解微分方程

10招立竿见影的PyTorch性能优化技巧，让模型训练速度翻倍

PyTorch CUDA内存管理优化：深度理解GPU资源分配与缓存机制

Triton入门教程：安装与编写和运行简单Triton内核