This article was first published on my blog: Xerrors. Fun/Anime-Gan-n…

Welcome to more articles: xerrors.fun


First, the abstract

In this paper, a novel approach for transforming photos of real-world scenes into anime style images is proposed, which is a meaningful and challenging task in computer vision and artistic style transfer. The approach we proposed combines neural style transfer and generative adversarial networks (GANs) to achieve this task. For this task, some existing methods have not achieved satisfactory animation results. The existing methods usually have some problems, among which significant problems mainly include: 1) the generated images have no obvious animated style textures; 2) the generated images lose the content of the original images; 3) the parameters of the network require the large memory capacity. In this paper, we propose a novel lightweight generative adversarial network, called AnimeGAN, to achieve fast animation style transfer. In addition, we further propose three novel loss functions to make the generated images have better animation visual effects. These loss function are grayscale style loss, grayscale adversarial loss and color reconstruction loss. The proposed AnimeGAN can be easily end-to-end trained with unpaired training data. The parameters of AnimeGAN require the lower memory capacity. Experimental results show that our method can rapidly transform real-world photos into highquality anime images and outperforms state-of-the-art methods.

This paper presents a new method to transform photos of real scenes into animation styles, which is a meaningful and challenging task in the field of computer vision and art style transformation. Our proposed approach combines neural style transformation and generative adversarial networks (GANs) to accomplish this task. For this task, some existing methods have not achieved satisfactory conversion effect. The existing methods usually have some problems, among which the important problems mainly include: 1) the generated image does not have obvious animation style texture; 2) The generated image loses the content of the original image; 3) Network parameters require large memory capacity. In this article, we present a new type of lightweight generative adversarial network called AnimeGAN to enable fast animation style transitions. In addition, we further propose three novel loss functions to make the generated image have a better visual effect of animation. These loss functions are gray pattern loss, gray counter loss and color reconstruction loss. The proposed AnimeGAN can be easily trained end-to-end with unpaired training data. AnimeGAN’s parameters have low memory requirements. Experimental results show that our method can quickly convert real photos into high-quality animated images and is superior to the most advanced methods.

In short, there are people in the industry who are doing it, but they are not doing it, there are a few small problems to solve, and by coming up with a new network, a new loss function, we are doing our best so far.

1. Introduction

The shortcomings of each algorithm

The present algorithm has the following problems. This is indeed a common problem in the field at the moment, and almost every paper addresses these questions:

These important problems mainly include: 1) the generated images have no obvious animated style textures; 2) the generated images lose the content of the original photos; 3) a large number of the parameters of the network require more memory capacity.

  1. The resulting image has no obvious animation-style textures;
  2. The generated image loses the content information of the original photo;
  3. Large network parameters require higher memory capacity.

What are the solutions proposed by the author

To solve these problems, the authors propose a new lightweight GAN network that is smaller and faster.

The proposed AnimeGAN is a lightweight generative adversarial model with fewer network parameters and introduces Gram matrix to get more vivid style images.

AnimeGAN proposed in this paper is a lightweight generative adversarial model with fewer network parameters and the introduction of Gram matrix to generate more vivid style images. In order to generate better visual effects of images, three loss functions are proposed: gray style loss, color reconstruction loss and gray counter loss. In the generative network, “grayscale style loss” and “color reconstruction loss” make the generated image have a more obvious anime style and retain the color of the photo. The “grayscale versus loss” in the recognition network makes the generated image have bright colors. In discriminator network, edge-promoting adversarial Loss (EDge-promoting Adversarial Loss) proposed by CartoonGAN was also used to preserve clear edges.

In addition, in order to make the generated image have the content of the original photo, pre-trained VGG19 is introduced as a perceptual network to obtain L1 loss of depth perception features of the generated image and the original photo.

Similar to CartoonGAN, the pre-training model can better extract the higher-dimensional information of images, so that the differences between two photos can be compared under different styles. After all, the higher-dimensional semantic information is the same even though the styles of two images are different. This is why these networks do not need to train the two models at the same time as Cycle GAN and Dual GAN to ensure the normal convergence of the model.

Other enhancements:

  1. Performing an initialization training on the generator, using only content loss Lcon(G, D) to pre-train the generated network G, making AnimeGAN’s training easier and more stable, CartoonGAN proposed. (Not sure why)
  2. The last convolution layer with 1×1 convolution kernel does not use the normalized layer, and the tanH nonlinear activation function is used afterwards. (Not sure why)
  3. The activation function used in each module is LReLU.

2. Our Method

The following three aspects of network architecture, loss function and training are introduced in detail.

The network architecture

Before looking at AnimeGAN, take a look at CartoonGAN’s model, the PyTorch implementation

AnimeGAN is an improvement on CartoonGAN, implemented by PyTorch

The generator network can be thought of as a codec network, standard convolution, deeply detachable convolution, residual network, up-sampling and down-sampling modules.

It can be seen from the following statements that this paper borrowed the idea of CartoonGAN greatly and made improvements on it, avoiding checkerboard texture. The following figure shows the experimental result of CartoonGAN’s paper, and it can be clearly seen that the grid texture appears.

Loss function

The first is the processing of training data, TODO

The generator loss function is mainly divided into four parts (formula picture), different loss has different weight coefficients, the author adopts 300, 1.5, 3, 10:


L ( G . D ) = Omega. a d v L a d v ( G . D ) + Omega. con L c o n ( G . D ) + Omega. g r a L g r a ( G . D ) + Omega. col L c o l ( G . D ) L(G, D)=\omega_{a d v} L_{a d v}(G, D)+\omega_{\operatorname{con}} L_{c o n}(G, D)+\omega_{g r a} L_{g r a}(G, D)+\omega_{\operatorname{col}} L_{c o l}(G, D)
  1. Adversarial loss (ADV) is the adversarial loss in Generator G that affects the animation conversion process.
  2. Content loss (CON) is the content loss that helps generate image retention input photo content.
  3. Grayscale style loss (GRA) enables the generated image to have a clear anime style in texture and lines.
  4. Color reconstruction loss (COL) so that the resulting image has the color of the original photo.

For content loss and gray style loss, pre-trained VGG19 is used as a perceptual network to extract high-level semantic features of images. They are expressed as (picture of this formula) :


L c o n ( G . D ) = E p i …… S d a t a ( p ) [ V G G l ( p i ) V G G l ( G ( p i ) ) 1 ] L g r a ( G . D ) = E p i …… S d a t a ( p ) . E x i …… S d a t a ( x ) [ Gram ( V G G l ( G ( p i ) ) ) Gram ( V G G l ( x i ) ) 1 ] \begin{array}{c} L_{c o n}(G, D)=E_{p_{i} \sim S_{d a t a}(p)}\left[\left\|V G G_{l}\left(p_{i}\right)-V G G_{l}\left(G\left(p_{i}\right)\right)\right\|_{1}\right] \\ L_{g r a}(G, D)=E_{p_{i} \sim S_{d a t a}(p)}, E_{x_{i} \sim S_{d a t a}(x)}\left[\| \operatorname{Gram}\left(V G G_{l}\left(G\left(p_{i}\right)\right)\right)\right. \\ \left.-\operatorname{Gram}\left(V G G_{l}\left(x_{i}\right)\right) \|_{1}\right] \end{array}

For color extraction and conversion, the author first converts RGB channels into YUV channels, and then uses different loss calculation methods for different channels (sounds like a good idea), such as formulas


L c o l ( G . D ) = E p i …… S data  ( p ) [ Y ( G ( p i ) ) Y ( p i ) 1 + U ( G ( p i ) ) U ( p i ) H + V ( G ( p i ) ) V ( p i ) H ] \begin{array}{r} L_{c o l}(G, D)=E_{p_{i} \sim S_{\text {data }}(p)}\left[\left\|Y\left(G\left(p_{i}\right)\right)-Y\left(p_{i}\right)\right\|_{1}+\left\|U\left(G\left(p_{i}\right)\r ight)-U\left(p_{i}\right)\right\|_{H}+\left\|V\left(G\left(p_{i}\right)\right)-V\left(p_{i}\right)\right\|_{H}\right] \end{array}

Huber Loss is used here. What is it?

The final generator loss function L(G) can be expressed as (picture of this formula) :


L ( G ) = Omega. a d v E p i …… S d a t a ( p ) [ ( G ( p i ) 1 ) 2 ] + Omega. c o n L c o n ( G . D ) + Omega. g r a L g r a ( G . D ) + Omega. c o l L c o l ( G . D ) \begin{array}{r} L(G)=\omega_{a d v} E_{p_{i} \sim S_{d a t a}(p)}\left[\left(G\left(p_{i}\right)-1\right)^{2}\right]+\omega_{c o n} L_{c o n}(G, D) \\ +\omega_{g r a} L_{g r a}(G, D)+\omega_{c o l} L_{c o l}(G, D) \end{array}

The discriminator uses a loss function that not only introduces CartoonGAN’s edge antagonism loss to promote the image generated by AnimeGAN to have clear recurring edges, but also adopts a new grayscale antagonism loss to prevent the generated image from being displayed as a grayscale image. Finally, the discriminator’s loss function represents a picture of such a formula:


L ( D ) = Omega. a d v [ E a i …… S d a t a ( a ) [ ( D ( a i ) 1 ) 2 ] + E p i …… S d a t a ( p ) [ ( D ( G ( p i ) ) ) 2 ] + E x i …… S d a t a ( x ) [ ( D ( x i ) ) 2 ] + 0.1 E y i …… S d a t a ( y ) [ ( D ( y i ) ) 2 ] ] \begin{array}{r} L(D)=\omega_{a d v}\left[E_{a_{i} \sim S_{d a t a}(a)}\left[\left(D\left(a_{i}\right)-1\right)^{2}\right]+E_{p_{i} \sim S_{d a t a}(p)}\left[\left(D\left(G\left(p_{i}\right)\right)\right)^{2}\right]\right. \\ \left.+E_{x_{i} \sim S_{d a t A} (x)} \ left [\ left (D \ left (x_ {I} \ right) \ right) ^ {2} \ right] + 0.1 E_ {y_ {I} \ sim S_ {D t a a}(y)}\left[\left(D\left(y_{i}\right)\right)^{2}\right]\right] \end{array}

About promoting edge against loss

In CartoonGAN’s paper, it is described as follows: In the previous GAN framework, the task of discriminator D was to figure out whether the input image was generated from a generator or from a real image. However, we observe that only training discriminator D to distinguish the generated image from the real cartoon image is not enough to convert the photo into a cartoon image. This is because the important feature of cartoon images is the presentation of clear edges, but the proportion of these edge information in the whole image is usually very small. Therefore, an output image with no clearly reproduced edges but correct shadows may fool the discriminator.

So make the discriminator really edge oriented. Therefore, the cartoon image with smooth edges is also sent into discriminator D, as shown in the formula:


L a d v ( G . D ) = E c i …… S data  ( c ) [ log D ( c i ) ] + E e j …… S data  ( e ) [ log ( 1 D ( e j ) ) ] + E p k …… S data  ( p ) [ log ( 1 D ( G ( p k ) ) ) ] \mathcal{L}_{a d v}(G, D)=\mathbb{E}_{c_{i} \sim S_{\text {data }}(c)}\left[\log D\left(c_{i}\right)\right]+\mathbb{E}_{e_{j} \sim S_{\text {data }}(e)}\left[\log \left(1-D\left(e_{j}\right)\right)\right]+\mathbb{E}_{p_{k} \sim S_{\text {data }}(p)}\left[\log \left(1-D\left(G\left(p_{k}\right)\right)\right)\right]

Training details

The proposed AnimeGAN can be easily trained end-to-end with unpaired training data. Since the GAN model is highly nonlinear, the optimization is easily trapped in the local optimal solution under random initialization. CartoonGAN suggested that pre-training the generator would help speed up the convergence of GAN. Therefore, only the content loss function Lcon(G, D) is used to pretrain the generated network G. Initialize an epoch with a learning rate of 0.0001. During AnimeGAN’s training phase, the generator and discriminator learned 0.00008 and 0.00016, respectively. The Epochs of AnimeGAN is 100 and the Batch size is set to 4. The Adam optimizer is used to minimize total losses. AnimeGAN trained on an Nvidia 1080TI GPU using Tensorflow.

What you don’t understand:

  1. It is mentioned in this paper that many NST algorithms are based on Gram matrix matching statistics to extract depth features from pre-trained convolutional networks. I don’t know what that means.
  2. What is a Huber Loss?

The combination of mean square error and absolute error.

  1. What is a thorough experiment?

It’s called the control variable method.

The resources

1. AnimeGAN: a novel lightweight GAN for photo animation | pdf | Tensorflow | PyTorch

2. AnimeGAN v2 | Xin Chen and Gang Liu | TensorFlow | PyTorch

4. CartoonGAN | pdf | Pytorch | Pytorch2 | Tensorflow

5. Understand the sampling and format of YUV

6. In-depth understanding of deeply separable convolution – GiantPandaCV