The program was originally planned for today, but EfficientNet EfficientNet EfficientNet is based on both SENet and MobileNet networks. And then the next lesson is on SENet, and then the next lesson is on EfficientNet, and of course, every lesson is implemented by PyTorch.

1 background

Mobile is the concept of Mobile, Mobile phone, MobileNet is a lightweight deep neural network proposed by Google in 2017, specifically for Mobile, embedded such computing power is not high, speed, real-time devices.

2 it’s innovative

2.1 Deeply separable convolution

The deeply separable convolution is mainly applied to replace the traditional convolution operation, and the pooling layer is abandoned. Decomposing the standard convolution into:

  • Depthwise convolution
  • Point by point convolution.

This has the advantage of significantly reducing the number of parameters and computation.

2.2 General convolution computations

Let’s first review what convolution is in general:Let’s start with the title:The dimensions of the feature map are H (height) and W (width), the dimensions (side length) are K, M is the number of channels in the input feature map, and N is the number of channels in the output feature map.

Now to simplify the problem, as shown in the figure above, the input single-channel feature map, the output feature map is also single-channel, we know that each convolution result is a scalar, and from the output feature map, a total of 9 convolution are performed. Each convolution is calculated nine times, because each convolution requires each number on the convolution kernel to be multiplied by the corresponding number on the original eigengraph (multiplication is not considered in this case). Therefore, as shown in Figure 6.18, a total of:


9 9 = 3 3 3 3 = 81 9 * 9 = 3 * 3 * 3 * 3 = 81

If the input feature is a 2-channel, then it means that the convolution kernel is also a 2-channel convolution kernel, and the output feature is still single-channel. So the amount of calculation becomes:


9 9 2 = 3 3 3 3 2 = 162 9 * 9 * 2 = 3 * 3 * 3 * 3 * 2 = 162

Instead of taking 9 multiplications for each convolution of the single-channel feature, now it takes 18 multiplications to get 1 number in the output because the number of input channels is 2. Now assume that the output feature map wants to output the 3 channel feature map. Then, it is necessary to prepare three different convolution kernels and repeat all the above operations for three times to get three feature graphs. So the amount of calculation is:


9 9 2 3 = 3 3 3 3 2 3 = 486 9 * 9 * 2 * 3 = 3 * 3 * 3 * 3 * 2 * 3 = 486

Now solve the original problem: the feature map size is H (height) and W (width), the convolution kernel is square, size (side length) is K, M is the number of channels in the input feature map, N is the number of channels in the output feature map. So the computation amount of convolution is:


H W K K M N H*W*K*K*M*N

This is the formula for the amount of the convolution.

2.2 Deeply separable convolution computation

  • Depthwise Convolution (DSC) is also marked as Depthwise Convolution.

Suppose that in a general convolution, an input feature of 64×7×7 needs to be transformed into an output feature of 128×7×7 through a 3×3 convolution kernel. Calculate how much computation this process requires:


7 7 3 3 64 128 = 3612672 7 * 7 * 3 * 3 * 64 * 128 = 3612672

If you use deep separable convolution, you turn the convolution into two steps:

  1. Depthwise: 64×7×7 was first used to get a 64×7×7 feature map through a 3×3 convolution kernel. Pay attention! Here’s a 64×7×7 feature with a 3×3 convolution kernel, not a 64×3×3 convolution kernel! Here, 64×7×7 feature map is regarded as 64 7×7 pictures, and then convolved with 3×3 convolution kernel successively.
  2. Pointwise: In Depthwise operation, it is not difficult to find that such calculation cannot integrate information of different channels at all, because all channels were disassembled in the previous step, so in this step, 64×1×1 convolution kernel should be used to integrate information of different channels, and 128 64×1×1 convolution kernel should be used to generate 128×7×7 feature map.

The final amount of calculation is:


7 7 3 3 64 + 7 7 1 1 64 128 = 429632 7 * 7 * 3 * 3 * 64 + 7 * 7 * 1 * 1 * 64 * 128 = 429632

The amount of computation was reduced by more than 80 percent.

The decomposition process is shown as follows:

In the picture, you can see:

  • (a) represents the general convolution process. All the convolution kernels have M channels, and then there are N and convolution kernels in total, which means that the input feature graph has M channels, and the output feature graph has N channels.
  • (b) represents the Depthwise process. There are M convolution kernels in total. Here, a convolution is performed for M channels of the input feature map, and the output feature map also has M channels.
  • (c) represents the pointwise process. There are N convolution kernels of 1×11 \times 11×1 in total, so as to integrate the information of different channels, and the output feature map has N channel numbers.

2.3 Network Structure

On the left is the general convolution, the convolution followed by the BN and ReLU activation layers, because DBC is going to be divided into two convolution processes, so it becomes the structure on the right, with BN and ReLU after Depthwise, and BN and ReLU after Pointwise.

It can be seen from the overall network structure:

  • Except for the first layer which is the standard convolution layer, all the other layers are deeply separable convolution.
  • The Pooling layer is not used for the entire network.

PyTorch implementation

import torch
import torch.nn as nn
import torch.nn.functional as F


class Block(nn.Module) :
    '''Depthwise conv + Pointwise conv'''
    def __init__(self, in_planes, out_planes, stride=1) :
        super(Block, self).__init__()
        self.conv1 = nn.Conv2d\
            (in_planes, in_planes, kernel_size=3, stride=stride, 
             padding=1, groups=in_planes, bias=False)
        self.bn1 = nn.BatchNorm2d(in_planes)
        self.conv2 = nn.Conv2d\
            (in_planes, out_planes, kernel_size=1, 
            stride=1, padding=0, bias=False)
        self.bn2 = nn.BatchNorm2d(out_planes)

    def forward(self, x) :
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        return out


class MobileNet(nn.Module) :
    # (188,2) conv stride= 1, conv stride=2,
    # by default conv stride=1
    cfg = [64, (128.2), 128, (256.2), 256, (512.2), 
           512.512.512.512.512, (1024.2), 1024]

    def __init__(self, num_classes=10) :
        super(MobileNet, self).__init__()
        self.conv1 = nn.Conv2d(3.32, kernel_size=3, 
        	stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(32)
        self.layers = self._make_layers(in_planes=32)
        self.linear = nn.Linear(1024, num_classes)

    def _make_layers(self, in_planes) :
        layers = []
        for x in self.cfg:
            out_planes = x if isinstance(x, int) else x[0]
            stride = 1 if isinstance(x, int) else x[1]
            layers.append(Block(in_planes, out_planes, stride))
            in_planes = out_planes
        return nn.Sequential(*layers)

    def forward(self, x) :
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.layers(out)
        out = F.avg_pool2d(out, 2)
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        return out

net = MobileNet()
x = torch.randn(1.3.32.32)
y = net(x)
print(y.size())
> torch.Size([1.10])
Copy the code

Normally this pre-training model outputs 1024 linear nodes, but here I added a fully connected layer of 1024->10 myself.

Let’s look at the network structure:

print(net)
Copy the code

Output results:

Then in the code:

Set the number of channels for the model:

Now that MobileNet is almost done, the next lesson is to implement and explain SENet’s PyTorch.