“This is the 11th day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021”

Improved style migration

Once neural style transfer was proposed, it aroused great interest in the industry, but then people realized some shortcomings of the original neural style transfer. One limitation is that style migration takes all style information, including the color and stroke of the entire style image, and transfers it to the entire content image. For example, in TensorFlow2 implementing neural style migration, the blue in the style image is transferred to the trees, but sometimes we want to have the option of transferring only the strokes without transferring the color, and only to specific areas for finer control.

The neural style transfer team developed a new algorithm to solve these problems. The following figure shows an example of the control the algorithm can provide and the results:

The neural style transfer team proposed the following control measures:

  1. Spatial control: Controls the spatial location of style migration in content and style images. This is done by applying a spatial mask to the style features before calculating the Gram matrix.
  2. Color control: Can be used to preserve the color of the content image. To do this, we convert RGB formats to color Spaces, such as HCL, to separate brightness from other color channels. We then perform the style migration only in the brightness channel and then merge it with the color channel in the original style image to produce the final stylized image.
  3. Degree control: Manages the granularity of brush strokes. This process involves more process because it requires running style migrations multiple times and selecting style characteristics at different levels in order to compute the Gram matrix.

The two main themes of improving style migration – improving speed and improving style migration – let’s take a closer look at some variants of the classic algorithm to lay the groundwork for our next project, real-time arbitrary style transformation.

Faster style migration via feedforward networks

Neural style migration is based on optimizations similar to neural network training, so even with GPU, it is slow and usually takes several minutes to get style migration results. This limited its application on mobile devices, so there was a real need to develop faster style transfer algorithms, and feedforward style transfer came into being. The diagram below shows one of the first networks to adopt this architecture:

The architecture actually looks simpler than the architecture diagram above. There are two networks in this architecture:

  1. Trainable convolutional networks (often referred to as style transfer networks) are used to transform input images into stylized images. It can be implemented as something likeU-NetVAEThe encoder/decoder architecture of the.
  2. Fixed convolutional networks are usually pre-trainedVGGUsed to measure loss of content and style.

Similar to the original neural style transfer, VGG is first used to extract content and style targets, but this architecture will no longer train the input image, but train the convolutional network to transform content image into stylized image. The content and style features of stylized images are extracted by VGG, and the loss is calculated and propagated back to a trainable convolutional network. We can train it like any other CNN. In the reasoning phase, we only need to perform one forward calculation to convert the input image into a stylized image!

The speed problem was solved with this network, but there was still a problem – such a network could only learn one style to migrate. We need to train a network for each style we want to perform, which is much less flexible than primitive neural style migration.

Control the style characteristics of the migration

The original neural style transfer paper did not explain why the Gram matrix could be used to effectively extract style features. Many subsequent improvements to style migration, such as feed-forward style migration, continue to use Gram matrices as style characteristics. Demystifying Neural Style Transfer changed this point by finding that Style information is essentially represented by activation distribution in CNN. The matching activation Gram matrix is equivalent to minimizing the maximum mean discrepancy (MMD) in the activation distribution. Therefore, we can perform style migration by matching the activation distribution of the image with that of the style image.

Therefore, Gram matrix is not the only way to implement style migration. We can also use anti-loss gans such as Pix2PIx to perform style shifts by matching the pixel distribution of the generated image with the real image. The difference is that GAN tries to minimize the difference in pixel distribution, whereas style migration applies it to the distribution of network layer activation.

Later, researchers found that we could represent style using only the mean and variance of activation. In other words, if we input two similarly styled images into VGG, their network layer activations will have similar mean and variance. Therefore, we can train the network to perform style transfer by minimizing the difference in activation mean and variance between the generated image and the style image. This led to the use of a normalized layer control style.

Use a normalization layer to control the style

A simple and effective way to control activation statistics is by changing γγγ and βββ in the normalized layer. In other words, we can change the style by using different affine transformation parameters (γγγ and βββ), using the same equation for batch normalization and instance normalization:


B N ( x ) = I N ( x ) = gamma ( x mu ( x ) sigma ( x ) ) + Beta. BN(x) = IN(x) = \gamma (\frac {x-\mu(x)}{\sigma(x)}) + \beta

The difference is that Batch normalization (BN) calculates the mean µµµ and standard deviation σσσ along the (N, H, W) dimensions, While instance normalization (IN) has been calculated based only on (H, W).

However, each canonicalization layer has only one pair of γγγ and βββ, limiting the network to learning only one style. So, how do we make online learning multiple styles? Multiple groups of γγγ and βββ coefficients can be used, with each group learning one pattern. This is where conditional Instance Normalization (CIN) came in.

It is based on instance normalization, but has multiple pairs of γγγ and βββ. Each different pair of γ, γ, γ and β, β, β values are used to train a particular style; In other words, they are conditioned on stylistic images. The equation for normalizing conditional instances is as follows:


C I N ( x ; s ) = gamma S ( x mu ( x ) sigma ( x ) ) + Beta. S CIN(x; s) = \gamma^S (\frac {x-\mu(x)}{\sigma(x)}) + \beta^S

Suppose we have S different styles of images, and then we have S γγγ and S βββ in the normalization layer of each style. In addition to the content images, we also transfer the unique heat-coded style tags input styles to the network. In fact, γ, γ, γ and β, β, β are implemented as matrices of shape (S by C). We retrieve the style γγγ γ and βββ β by performing a single heat-coded label (1×S) multiplied by the matrix (S×C) to obtain the γSγ^SγS and βSβ^SβS for each style (1×C) channel.

Next, we can encode styles into the embedding space of γγγ and βββ, and then perform style interpolation by interpolating γγγ and βββ :

While all of these variants have made some excellent progress, the network is still limited to the fixed N styles used in training. In the next series of posts, we’ll learn and implement improvements that allow for arbitrary styling!

Series of links

TensorFlow2 implements neural style migration