The Convolution of Depthwise(DW) and Pointwise(PW) is collectively referred to as Depthwise Convolution(see Google’s Xception). These Convolution structures are similar to regular Convolution operations and can be used to extract features. However, compared with conventional convolutional operations, the number of parameters and operation cost are lower. So you will encounter this structure in lightweight networks such as MobileNet.

General convolution operation (see operation)

For a 5×5 pixel, three-channel color input image (shape is 5×5×3). After passing through the convolution layer of the 3×3 convolution kernel (assuming that the number of output channels is 4, then the convolution kernel shape is 3×3×4), four Feature maps are finally output. If there is same padding, the size is the same as that of the input layer (5×5); if not, the size changes to 3×3.

Depthwise Separable Convolution

Depthwise Convolution is the decomposition of a complete Convolution operation into two steps, that is, Depthwise Convolution and Pointwise Convolution.

Depthwise Convolution

Different from conventional Convolution operations, one Convolution kernel of Depthwise Convolution is responsible for one channel, and one channel is convolved by only one Convolution kernel. The conventional convolution mentioned above each convolution kernel operates on each channel of the input image simultaneously. Similarly, for a 5×5 pixel, three-channel color input image (shape is 5×5×3), Depthwise Convolution first goes through the first Convolution operation. Different from the above conventional Convolution, DW is completely carried out in a two-dimensional plane. The number of convolution kernels is the same as the number of channels in the previous layer (channels and convolution kernels correspond one to one). Therefore, a three-channel image is generated into three Feature maps after operation (if there is the same padding, the size is 5×5 with the input layer), as shown in the figure below.

The number of Feature maps after Depthwise Convolution is the same as the number of channels at the input layer. Therefore, the Feature map cannot be extended. In addition, this operation independently carries out convolution operation for each channel of the input layer, failing to effectively utilize feature information of different channels in the same spatial position. Therefore, Pointwise Convolution is required to combine these Feature maps to generate new Feature maps.

Pointwise Convolution

The operation of Pointwise Convolution is very similar to conventional Convolution operation. The size of its Convolution kernel is 1×1×M, and M is the number of channels at the upper layer. Therefore, the convolution operation here will make a weighted combination of the map of the previous step in the depth direction to generate a new Feature map. If there are several convolution kernels, there are several output Feature maps. As shown in the figure below.

In view of the above two cases, my understanding is as follows:

Our number of cores is the number of channels that we eventually output, and the number of channels itself is the number of channels of the images, assuming that the image has three channels of color, and the number of channels that we finally output after convolution is 100, then we have 100 sets of 3 channels of 3 by 3 convolution cores, 100 [3,3] conv [image_height,image_width,3], and assuming the padding = “same”, the final output [image_height,image_width,100] is the calculation of ordinary convolution.

Depth separable convolution:

We attach a convolution kernel to each channel, assuming the image has [image_height,image_width,100], we have 100 convolution kernels, each of which is [3,3], Assuming the padding=”same”, we still have a feature map of [image_height,image_width,100], which is performed by Depthwise Convolution. Then we convolve the 100 feature maps we have obtained with K group [1,1,100]. The first map will generate 100 intermediate feature A, the second map will generate 100 intermediate feature B, and the third map will generate 100 intermediate feature C. Combining ABC together will form an intermediate feature with A length of 300. These 300 intermediate features are superimposed with different weights to generate a new feature map, and so on, finally K different feature maps are formed.