Abstract: The instance-based method introduced in this paper can alleviate the problems of detail loss and face stylization failure and get high-quality style conversion images.

This article is from the huawei cloud community “Example-based Style transfer”, by: lemon grapefruit tea and ice.

Although the style transfer methods based on neural network generate amazing style transformation diagrams, most of the current methods can only learn some similar color distribution, overall structure, etc., but they do not have good learning ability for texture details in local areas, which will also bring distortion and deformation. The instance-based method introduced in this paper can alleviate the above problems and get high quality style transformation images.

Note: Style refers to the changes in color and texture of pictures, etc. Some papers believe that content is also a style.

preface

At present, most style transfer is optimized by combining AdaIn(Adaptive Entity Regularization) on the basis of GAN(generative adversarial network) and adding contentLoss (perception loss) formed by VGG network. There are also more classic pixel2pixel, cyclegan, etc., which use paired data or cycle loss to carry out ImageTranslation tasks. Although the style transfer methods based on neural network generate amazing style transformation diagrams, most of the current methods can only learn some similar color distribution, overall structure, etc., but they do not have good learning ability for texture details in local areas, which will also bring distortion and deformation. In particular, we have recently tried many methods for face stylization, including U-gat-it, Stylegan, etc. These methods based on neural network do not have good effects on some styles like oil painting and watercolor.

Here are two effective style transfer methods based on neural network. Among them, U-Gat-it has a good effect on the face of the quadratic person, and Whitebox has a good effect on the landscape image.

U-GAT-IT: UNSUPERVISEDGENERATIVE ATTENTIONAL NETWORKS WITH ADAPTIVE LAYERINSTANCE NORMALIZATION FORIMAGE-TO-IMAGE TRANSLATION

U-gat-it is suitable for image-to-image Translation tasks from faces with large deformation to quadratic style. The authors introduce attention modules into the generator and discriminator parts of the framework, making the model focus on semantically important areas and ignore some small areas. In addition, Adaptive layer InstanceNormalization (AdaLIN) is proposed by combining InstanceNormalization and layer Normalization. AdaLIN formula helps the attention module to better control the change of shape and texture.

The entire model structure is shown in figure, including two generator G_ {t} – > s _Gs_ – > _t_ and G_ {t} – > s _Gt_ – > _s_ and two discriminant D_s_Ds_ and D_t_Dt_, the above chart for G_ {t} – > s _Gs_ – > _t_ and D_t_Dt_ structure, Represents source to target(true to quadratic), G_{t-> S}_Gt_−>_s_ and D_s_Ds_ are reversed.

The whole generator process is as follows: The unmatched data input generator module extracts K feature graph E through downsampling and residual blocks, and the auxiliary classifier is used to learn the weight W of these K features (similar to CAM, the weight W is obtained by using global average pooling and global maximum pooling). Finally, attention feature graph A = W ∗ s_a_=w_∗ _S is obtained. The feature map is then input to a full connection layer to obtain the mean and variance. The final normalize feature map is obtained through the AdaLIn function proposed in the paper, and the transformed image is obtained after the feature map is input into the decoder.

The discriminator is to generate a specific loss through a binary network, constrained to generate images and training data consistent distribution.

In actual training, Ugatit training speed is relatively slow. Although some good two-dimensional style images are generated, this method does not make use of key points of faces and other information, resulting in exaggerated deformation of figures in some generated images, which cannot meet the industrial application standards.

whitebox: Learning toCartoonize Using white-box Cartoon Representations suitable for style: real characters -> Three representations of human painting behavior (the surface representation, thestructure representation, and the texture representation.) are simulated to constitute the relevant loss function.

The network structure is simple as above, mainly including various loss:

1. High and low dimensional features are extracted from the pre-trained VGG network to constitute structureLoss and content Loss;

2. Abstract watercolors of surface representation paintings (obtained through a filter);

3. The texture representation is similar to the sketch style, which is generated by a Color Shift algorithm;

4. Structure representation is obtained by KMeans clustering to obtain the structured color block distribution.

Conclusion: This method can generate good effect of Miyazaki Hayao and similar Japanese animation style, but for characters and other style conversion will bring about the loss of details. In this paper, loss generated by various representations proposed by simulating human painters has good reference significance.

Face style transfer based on instance synthesis

In fact, all the style transfer methods introduced above can be classified into one category. They all use neural network to learn corresponding styles by learning a large number of data with similar styles. This method can obtain better results in landscape images or face images with less details (quadratic elements). However, for style maps with rich information, this method can only learn some color distribution and so on, resulting in a large number of local details will be lost. In particular, the use of face stylization alone (U-gat-it) still produces a large number of failed style transitions, Although markassisted CycleGAN for Cartoon Face Generation (LandmarkAssisted CycleGAN for Cartoon Face Generation) uses Face key point constraint to achieve a relatively stable conversion effect, these methods are still inadequate for complex style images. At this point, style transfer based on instance composition can alleviate these problems (loss of details, failure to stylize faces, etc.).

Example-Based Synthesis of Stylized Facial Animations

The image in the second column above is a neural network-based stylization method, including the Whitebox method described above, that attempts to smooth out the local texture of the final transformed image to achieve a painting effect. However, the result is that for style diagrams with rich textures, these methods are not satisfactory.

In this paper, a stylized face video frame O can be obtained by inputting a textured face style figure S and a video frame T.

The specific method of the article is quite simple: A series of guide diagrams (Gseg,Gpos, Gapp, Gtemp) are obtained through input style diagrams and key frames (to be converted). Using StyLit (Illumination-Guided Example-Based Stylization of 3D Renderings), which was proposed by Fišer in 2016. The emphasis is on the construction and significance of various guidance diagrams.

Segmentation map (Gseg) : Objective: Due to the different areas of the style map have different strokes, so the face area is divided into eyes, hair, eyebrows, nose, mouth and other areas. The specific method to obtain the figure is shown as follows:

To briefly explain the above figure, in order to obtain the tripartite graph B of the original face A, the author first obtained a rough head segmentation graph C, and then used the face key point (chin key point) to obtain a closed mask E. In order to obtain skin region F, the author uses a statistical model of skin to screen pixels belonging to skin and get Figure H. Finally, in order to obtain the segmentation of other areas of the face (human eyes, etc.), the key points of the face are used. In order to prevent the inaccuracy of the key points, the final image I is obtained by blurring the relevant areas.

Gpos: pixel coordinates are normalized to 0-1, then the original image and the target’s key points of the face are used, and “movingleast squares deformation” is used to finally get the coordinate point deformation of the target image relative to the original image.

Appearance Guide (Gapp) : Transform to grayscale, modify contrast, etc.

Timing Guide Diagram (Gtemp): Using the hand-drawn sequence studied before to have timing consistency in the low-frequency region, the timing guide diagram is obtained through the fuzzy style reference diagram S and the style conversion diagram O synthesized in the previous frame.

After the above guidance diagrams were synthesized, the StyLit method was used to obtain the synthesized style diagrams. In addition, the eye and mouth areas are synthesized by additional masks (d) that are stricter than the previous Gseg, as shown below:

Results: the first behavior style reference diagram, the second behavior transformed style diagram.

Real-Time Patch-Based Stylization of Portraits Using GenerativeAdversarialNetwork

The method introduced above has too much preparation for the synthesis of style maps, which requires the generation of four kinds of maps (Gseg, Gpos,Gapp, Gtemp), and some steps (failure in key points acquisition, segmentation, etc.) will lead to the failure of style conversion. However, it is undeniable that the style graph generated by the case-based synthesis method introduced above is of very high quality. The author of this paper proposes that using GAN in combination with this method can generate high-quality style graph and use GPU to achieve high inference speed.

The method is also very simple. The above method is used to generate high quality stylized data, and the commonly used antagonism loss, color loss (L1 distance is calculated before the style reference map and the converted style map) and perception loss (L2 distance is calculated before the features of the style reference map and the transformed style map are extracted by the pre-trained VGG) are used.

In terms of network structure, the author added residual connection and residual block on the basis of previous researchers, as shown in the figure below:

Summary: There is nothing particularly remarkable in this paper as a whole, but it provides an idea for us to generate high-quality large-scale training sets by using effective but slow methods. By using the above architecture (GAN+ common style transfer loss, etc.), a effective and fast style transfer method can be obtained.

FaceBlit: Instant real-time example-based StyleTransfer to Facial Videos this paper is also an instance-based StyleTransfer method. The author compares the above two papers. The first paper, Example-BasedSynthesis of Stylized Facial Animations, introduced above, has very high quality, but there is too much preparation before the synthesis. It takes tens of seconds to complete the Stylized image. The second paper real-time patch-based Stylization of Portraits Using Generative Adversarial Network Although it has a fast speed on GPU, However, it takes a long time to train and consume resources. The following is a comparison of the authors’ results:

Where, a is the style chart, B and C are the results of the above two papers respectively, and D is the original chart.

The author improved the first paper first used in the figure, the direction of the first paper is used in the four guide figure (figure, appearance guidance figure, location guidance diagram, timing diagram), the author will be more than four guiding position is compressed into the graph guide to figure Gpos and appearance figure Gapp two kinds, and changed their figure generated guidance algorithm, It takes only a few milliseconds to generate a guide diagram.

Positional Guide ** The first is to get the key points of the face, for the style graph, using a pre-trained algorithm to generate in advance. For real face images, the author can obtain faster detection by reducing the resolution to half of the previous one before input the face detector (the effect on accuracy can be ignored). After obtaining the key points of the face, by embedding these coordinate information into RGB three channels, R is the key point X,G is y; Then the “moving least-squaresmethod” used in the first paper is used to calculate the key point deformation from the original image to the style image. For the last remaining B channel, which is used to store the segmentation graph, the segmentation graph of the style graph can be generated in advance, while the mask of the original graph is generated using the following method:

To put it simply, a partial face region can be obtained by connecting key points and drawing an ellipse region. Finally, the mask of the whole face can be obtained by analyzing the color distribution of the skin to expand the boundary of this region.

** First convert the original image to a grayscale image, and use the grayscale image after subtracting gaussian blur:

After obtaining the above two images, the author constructed a table using the following formula to record the distances between multiple coordinates.

And enter this table into StyleBlit (a fast instance-based synthesis algorithm, described below)

StyleBlit: In fact, the above articles are mainly about constructing various Guidance charts or obtaining high-quality data sets from existing algorithms for training. The core style transfer generation algorithm is similar to the method in this paper. The method introduced in this paper is the best (depending on the guide diagram given), the fastest example based or guide diagram stylized synthesis method. On a single-core CPU, we can process a 1-megapixel image at 10 frames per second. On a normal GPU, we can process 4K ultra hd video at 100 frames per second.

Here is a brief exposition of the ideas of the paper, which will be introduced in detail from Stylit to Styleblit later. StyleBlit’s idea is not complicated, suppose we have two images or two objects of different styles (3D or 2D), the core idea of the algorithm is to paste the pixel values of the original image onto the target image by some kind of mapping (through a patch-like method). Assume that the pixel set of the original image is {P1, P2… Pn}, the pixel of the target image is {Q1,q2… Qn}, we have an additional table with the values of {p,q,error}, that is, the table introduced above about the coordinates of the original pixel and the target pixel, and the error rate between the two:

Our goal is to stylize or render the target image, so we need to traverse all pixels of the target image {q1, Q2… Qn}, after finding the nearest pixel in the corresponding original image, calculate the error rate of the two. If the error rate is lower than a certain threshold, the color of the original pixel will be copied to the target image.

Of course, there are also some details. For example, the author stratifies the pixels and calculates the error rate from top to bottom to find the most suitable original pixels (red pixels are the target pixels, black and blue pixels are the original pixels, and blue pixels are the three nearest levels) :

The pseudo-code of the final algorithm is as follows:

The result shows: the style reference object is a sphere, and the target object is a humanoid model

Conclusion: As long as there is a suitable guide image, the method can generate high quality target image quickly and efficiently from 2D to 3D, face style conversion, etc. From the experimental results, it is better than most style transfer based on neural network.

Note: Image translation: Some researchers believe that image translation should be a broader concept than style transfer, such as image conversion from day to night, line drawing coloring, spring to winter, horse to zebra, 2D to 3D conversion, super-resolution reconstruction, missing image restoration, stylization, etc. These all belong to the Image to Image Translation task. The overall can be summarized as the transformation of the input graph to the target graph, and both the input graph and the target graph conform to their specific data distribution. This article mainly talks about some style transfer papers that I have read recently.

AdaIn: The idea of AdaIn is to extract content and style information from a feature image output by VGG16, and separate these two information. The style can be subtracted from the original image after subtracting the mean value and dividing by the variance normalize, and the style migration can be completed by adding the mean variance extracted from the style map to reverse normalize;

For more AI technology dry goods, welcome to huawei cloud AI zone, currently there are AI programming Python and other six combat camps for everyone to learn for free.

Click to follow, the first time to learn about Huawei cloud fresh technology ~