Xerrors. fun/week-post-2…

Welcome to more articles: xerrors.fun


Depth separable convolution

Deep understanding of deeply separable convolution

My understanding is that the original complicated calculation is simplified, and the original convolution calculation is divided into two parts. The first part is Depthwise convolution, which extracts the information of each channel, while the connection between different channels is not extracted. The second part is pointwise convolution, which uses 1×1 convolution, mainly to extract information of different channels, which not only simplifies the calculation, but also can obtain information of different dimensions. I remember that Teacher Li Hongyi talked about this part in the video when he talked about how to reduce the model parameters, and the diagram is easy to understand.

# from: https://github.com/softmurata/AnimeGAN/blob/master/subnetwork.py#L156

class DepthWiseConv(nn.Module) :

    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=0, dilation=1, bias=False) :
        super().__init__()
        self.depthwise = nn.Conv2d(in_channels, in_channels, kernel_size, stride, padding, dilation, groups=in_channels, bias=bias)
        self.pointwise = nn.Conv2d(in_channels, out_channels, 1.1.0.1.1, bias=bias)

    def forward(self, x) :

        x = self.depthwise(x)
        x = self.pointwise(x)

        return x
Copy the code

Transpose convolution

In PyTorch, you can use torch. Nn.ConvTranspose2d to do this.

For many applications of many network architectures, we often need to perform conversions in the opposite direction of normal convolution, that is, we want to perform upsampling. Examples include generating high-resolution images and mapping low-dimensional feature maps to higher-dimensional Spaces, such as in autoencoders or form-meaning segmentation. (In the latter case, form-meaning segmentation first extracts the feature graph in the encoder and then restores the original image size in the decoder so that it can classify each pixel in the original image.)

The traditional way to implement upsampling is to apply interpolation schemes or create rules manually. Modern architectures, such as neural networks, tend to let the network learn the appropriate transformations automatically, without human intervention. And to do that, we can use the transpose convolution.

Transposed convolution is also known as deconvolution or fractionally strided convolution in the literature. However, it should be noted that the name “deconvolution” is not very appropriate because transposed convolution is not true deconvolution as defined by the signal/image processing domain. Technically, deconvolution in signal processing is the inverse of convolution. But that’s not the case here. Therefore, some authors strongly object to the notion of transposed convolution as deconvolution. People call it deconvolution mainly because it’s so easy to say. We’ll see why it’s more natural and appropriate to call this operation the transpose convolution.

We can always do transpose convolution using direct convolution. For the example below, we apply a 3×3 kernel transpose convolution on a 2×2 input (surrounded by a 2×2 unit step zero padding). The size of the upsample output is 4×4.

Interestingly, by applying various padding and step sizes, we can map the same 2×2 input image to different image sizes. Below, the transpose convolution is applied to the same 2×2 input (with a zero inserted between the inputs and surrounded by a 2×2 unit step filled with zero), resulting in an output of 5×5.

Looking at the transpose convolution in the example above helps us to build some intuition. But in order to generalize its application, it is useful to know how it can be implemented by matrix multiplication on a computer. This is also where we can see why transpose convolution is an appropriate name.

In convolution, we define C as the convolution kernel, Large as the input image, and Small as the output image. After convolution (matrix multiplication), we subsample the large image into the small image. The convolution of this matrix multiplication is implemented as C x Large = Small.

The following example shows how this operation works. It flattens the input into a 16×1 matrix and converts the convolution kernel into a sparse matrix (4×16). Matrix multiplication is then used between sparse matrices and spread inputs. The resulting matrix (4×1) is then converted into a 2×2 output.

Now, if we multiply both sides of the equation by the transpose CT of the matrix, and by virtue of the property that the multiplication of a matrix with its transpose gives an identity matrix, we obtain the formula CT x Small = Large, as shown below.

As you can see here, we performed upsampling from small images to large images. That’s what we want to achieve. Now. And you can see where the transpose convolution came from. An arithmetic explanation of the transpose matrix can be found at: arxiv.org/abs/1603.07…

Huber Loss

Reference: Summary of commonly used loss functions in machine learning – Zhihu

MSE Loss converges quickly but is easily affected by outliers. MAE is more robust to outliers but converges slowly. Huber Loss is a Loss function that combines MSE and MAE and takes the advantages of both. Also known as Smooth Mean Absolute Error Loss. The principle is very simple, that is, MSE is used when the error is close to 0, and MAE is used when the error is large. The formula is the picture


J h u b e r = 1 N i = 1 N I y i y ^ i Or less Delta t. ( y i y ^ i ) 2 2 + I y i y ^ i > Delta t. ( Delta t. y i y ^ i 1 2 Delta t. 2 ) J_{h u b e r}=\frac{1}{N} \sum_{i=1}^{N} \mathbb{I}_{\left|y_{i}-\hat{y}_{i}\right| \leq \delta} \frac{\left(y_{i}-\hat{y}_{i}\right)^{2}}{2}+\mathbb{I}_{\left|y_{i}-\hat{y}_{i}\right|>\delta}\left(\delta\left|y_{i}-\ hat{y}_{i}\right|-\frac{1}{2} \delta^{2}\right)

In the formula, δ is a hyperparameter of Huber Loss, and the value of δ is the position of two Loss connections of MSE and MAE. The first term on the right-hand side of the equals sign above is the MSE part, the second term is the MAE part, MAE in part of the formula for the delta ∣ yi – y ^ I ∣ – 12 delta, delta, left 2 | y_ {I} – \ hat {y} _ {I} \ right | – \ frac {1} {2} \ delta ^ {2} the delta ∣ yi – y ^ I ∣ – 21 delta 2 is to ensure that error ∣ – y y ^ ∣ = + delta | y – \ hat {} y | = / PM/delta ∣ y – y ^ ∣ = + delta MAE and MSE values is consistent, ensuring that Huber Loss Loss is continuously differentiable.

The figure below is Huber Loss when δ=1.0. It can be seen that in the interval of [-δ, δ], it is actually MSE Loss, and in the interval of (-∞, -δ) and (δ, ∞), it is MAE Loss.

Characteristics of Huber Loss

Huber Loss combines MSE and MAE Loss. When the error is close to 0, MSE is used to make the Loss function differentiable and the gradient more stable. Using MAE when the error is large can reduce the influence of outliers and make the training more robust to outliers. The disadvantage is that you need to set an additional delta hyperparameter.

The resources

1. Understand 12 convolution methods (including 1×1 convolution, transpose convolution and depth separable convolution, etc.) 2. Summary of commonly used loss functions in machine learning – Zhihu