Read the CapsNet architecture and then implement it with TensorFlow. This is probably the most detailed tutorial ever

The heart of the machine is original

Author: Jiang Siyuan

Last week, Geoffrey Hinton et al published the much-discussed NIPS paper, which has since been read and implemented by many researchers and developers. In this article, Heart of the Machine will explain the structure and process proposed in this paper in detail, and complete the TensorFlow implementation of CapsNet with the help of a hotly discussed project on GitHub, and provide code comments for the main architecture.

This is the third GitHub project of Heart of Machine, which aims to explain the network architecture and implementation of CapsNet. In order to explain CapsNet, we will start from the convolutional layer and convolutional mechanism, and explain the process and output of convolutional operation from the perspective of engineering practice, which is very beneficial to further understand the processing of Capsule layer. We will explain the CapsNet architecture recently proposed by Geoffrey Hinton et al., based on an understanding of the Capsule layer. Finally, we will test and explain according to the implementation of Naturomics.

The heart of the machine making project address: https://github.com/jiqizhixin/ML-Tutorial-Experiment

Convolutional layer and convolutional mechanism

This section is intended for readers who are not familiar with the convolutional mechanism, because the first two layers of CapsNet are essentially traditional convolutional operations. If you are already familiar with basic convolution operations, you can skip this section and go straight to the structure and process of the Capsule layer.

From Beginner to Master: A Beginner’s Guide to Convolutional Neural Networks
Introduction to convolutional neural networks
An overview of convolution structures in deep learning
Machine Perspective: A long debunking of image processing and convolutional Neural network architecture
Understand convolution in deep learning

In order to explain convolutional neural networks, we should first know why convolution has better image performance than fully connected networks. The general architectures of fully connected networks and convolutional networks are shown as follows:

We know that each neuron (or unit) in the first layer of the fully connected network is connected to each neuron in the next layer, and the strength of the connection can be controlled by the corresponding weight. And all connection weights are what the fully connected neural network wants to learn. As can be seen from the figure above, convolutional neural network is also organized by layer by layer of neurons, but the neurons of the two adjacent layers of the fully connected network are connected, so the neurons of the same layer can be arranged in a row, so that the connection structure can be displayed conveniently. However, only part of the neurons are connected between the two layers of the convolutional network. In order to show the dimension of the neurons of each layer, we generally organize the nodes of each convolutional layer into a three-dimensional tensor.

The biggest problem of fully connected network image processing is that there are too many parameters or weights between each layer, mainly because the neurons between the two layers are connected. If a fully connected network with 500 hidden layers (784×500×10) is used to identify MNIST handwritten digits, the number of parameters is 28×28×500+5000+510=397510 parameters, which greatly limits the deepening of network level.

For convolutional networks, each unit will only be connected to part of the units at the next layer. Generally, each unit of the convolution layer can be organized into a three-dimensional tensor, that is, the matrix adds one-dimensional data along the third direction. For example, the input layer of the CIFAR-10 data set can be organized into a 32×32×3 three-dimensional tensor, where 32×32 represents the size or number of pixels of the picture and 3 represents the RGB tri-color channel.

Convolution layer

The convolutional layer tries to analyze each piece of the neural network more deeply, so as to obtain features with higher abstraction degree. Generally speaking, the matrix of neuron nodes processed by convolution layer will become deeper, that is, the organization of neuron will increase in the third dimension.

The following figure shows that the convolutional kernel or filter transforms a child node tensor at the current level into a node matrix with length and width of 1 and unlimited depth on the neural network at the next level. In the following figure, the input is a tensor of 32×32×3, and the small rectangle in the middle is the convolution kernel, which can generally be 3×3 or 5×5, etc. In order to calculate the product, the third dimension of the convolution kernel must be equal to the image depth it processes (i.e. the third dimension of the input tensor 3). The depth of the rightmost rectangle is 5, that is, five convolution kernels were used to perform the convolution operation. These five convolution kernels have different weights, but the weight of each convolution layer using a convolution kernel is the same. Therefore, each feature of the five layers in the following figure is obtained through a convolution kernel, that is, the layer shares the weight.

A convolution

In case the beginning reader is not familiar with the specific process of convolution, we can discuss the specific process of convolution operation. As shown below, the figure shows the specific operation process of convolution. First our input is a tensor of 5×5×3, that is x[:, :, 0:3]. Secondly we have two convolution kernels of 3 by 3, namely W0 and W1, and the third dimension has to be equal to the third dimension of the input tensor, so generally only two dimensions describe a convolution kernel. Finally, the convolution operation outputs a tensor of 3×3×2, where O [:, :, 0] is the convolution output of the first convolution kernel W0, and O [:, :, 1] is the output of the second convolution kernel. Since the Padding is used in the input tensor, that is, 0 is added around the input image of each channel, and the moving step of the convolution kernel is 2, the dimension of the output of each convolution kernel is 3×3 (that is, (7-3)/2).

In the figure above, the convolution kernel will be multiplied and added to the corresponding input tensor, and then the offset term will be equal to the corresponding value in the output tensor. For example, convolution and W0 are used to convolve the input tensor (the depth of which is 3 can be regarded as the RGB channels owned by the image), and the convolution and the three levels will multiply the three levels of the corresponding input tensor. W0 [0] :, :, multiplied by the x [0] :, :, the upper left corner of the nine elements 0 + 1 to 1 * * * * 0-1 0-1 0 0 + 1 + 0 * * * * 0 0-1 + 1-1 0 * 1 = 1, Similarly w0 [:, :, 1) x x [:, :, 1) the upper left corner of nine elements to 1, w0 [:, :, 2) x x [:, :, 2) the upper left corner of nine elements is zero, The sum of these three values plus the offset term B0 equals the first upper-left element of the rightmost output tensor O [:, :, 0], i.e. 1-1+0+1=1.

As the convolution kernel moves by one step, we can calculate the value of the output matrix moving by one element. Note that when the convolution kernel moves on the input tensor, the weight of the convolution kernel is the same, that is, this layer shares the same weight, that is, O [:, :, 0] and O [:, :, 1] share a set of weights respectively. The sharing of weights is emphasized here not only because it is a core attribute of the convolutional layer, but also to help us understand CapsNet’s PrimaryCaps layer later.

There are still many properties of convolution that have not been explained. For example, maximum pooling chooses the largest value in a filter to represent the characteristics of the region to reduce the size of the output tensor. Inception module processes the input tensor in parallel with multiple sets of convolution kernels. The output tensors obtained by parallel processing are then sequentially connected to form a deep output tensor as the output of Inception module, etc. You can also read the Heart of the Machine article on convolution for more information. Finally, we provide a simple implementation to show the calculation process of convolution operation:

This code performs convolution and average pooling, and its output is as follows:

Capsule layer and dynamic routing

This part mainly explains the general principle of Capsule layer and DynamicRouting mechanism. This part is based on our understanding of Hinton’s original paper and adopts the views of zhihu siy. Z and Debarko De et al. Further references are provided at the end of this article, and readers may read further to learn more.

Previously, we have known that convolution can reduce many parameters through weight sharing and local joining. In addition, sharing the weight of convolution kernel can make the content on the image not affected by position. For example, the image in CIFAR-10 is 32×32×3, and the convolution layer is composed of 16 convolution kernels of 5×5 size (or the depth of expression is 16), and its parameters are 5*5*3*16+16=1216. But these convolution layer units are too simple, and they don’t represent complex concepts.

For example, when images are rotated, deformed or oriented in different directions, CNN itself cannot process these images. Of course, this problem can be solved by adding different deformation of the same image in training. In CNN, each layer understands the image in a very subtle way, because the receptive field of our convolution kernel generally uses pixel-level operations such as 3×3 or 5×5 to understand the image, so the convolution layer always tries to understand local features and information. When we combine low-level features into complex and abstract features, we may need to use pooling to reduce the size of the output tensor or feature graph, which actually loses some information, such as location information.

And Equivariance mapping can help CNN understand attribute transformation such as rotation or proportion and adjust itself accordingly, so that attribute information such as position in image space will not be lost. However, CapsNet proposed by Geoffrey Hinton et al. uses vectors to replace scalar points, so more information can be obtained. In addition, we feel that Capsule’s use of vectors as inputs and outputs is a highlight of this paper.

Capsule layer

In the paper, Geoffrey Hinton introduces Capsule as: “Capsule is a group of neurons, the input and output vectors instantiate the parameters of a specific entity type (that is, certain objects, concepts, entities such as the probability of certain attributes). We use the probability of the length of the input and output vectors represent entities, the direction of the vector said instantiate the parameters (i.e. some graphic attribute of the entity). The same level of capsule predicts the instantiation parameters of the higher level of capsule through the transformation matrix. When multiple predictions are consistent (dynamic routing is used in this paper to make the predictions consistent), the higher level of capsule will become active.”

The activation of neurons in the Capsule represents various properties of the particular entities present in the image. These properties can include many different parameters, such as posture (position, size, direction), deformation, speed, reflectivity, color, texture, and so on. The length of the input and output vectors represents the probability of an entity’s occurrence, so its value must be between 0 and 1.

To achieve this compression and complete the Capsule level activation function, Hinton et al. used a nonlinear function called “squashing”. The nonlinear function ensures that the length of the short vector can be reduced to almost zero, while the length of the long vector is compressed to near but no more than 1. Here is the expression for the nonlinear function:

V_j is the output vector of Capsule j, s_j is the vector weighted sum of all the upper layers of Capsule j output to the current layer of Capsule J, simply speaking, s_j is the input vector of Capsule J. The nonlinear function can be divided into two parts, namely

and

, the first part is the scale of the input vector s_j, and the second part is the unit vector of the input vector s_j. This nonlinear function not only preserves the direction of the input vector, but also compresses the length of the input vector into the interval [0,1). V_j can approach 0 when s_j vector is zero, while v_j approaches 1 infinitely when s_j is infinite. This nonlinear function can be regarded as a kind of compression and redistribution of vector length, and therefore can also be regarded as a way to “activate” the output vector after the input vector.

As mentioned above, the input vector of Capsule is equivalent to the scalar input of neurons of classical neural network, and the calculation of this vector is equivalent to the propagation and connection mode between two layers of Capsule. The calculation of input vector is divided into two stages, namely linear combination and Routing. This process can be expressed by the following formula:

Including u_j | I hat as a linear combination of the u_i, this can be regarded as a general all connect to the Internet before a layer of neurons in different strength after the connection of the output to a layer of a neuron. Capsule only relative to the general neural network, each node has a group of neurons (to generate a vector), namely u_j | hat I said a layer of the ith a Capsule on the output of the vector and the corresponding weight vector multiplication (W_ij vector instead of elements) of predictive vector. U_j | I hat can also be understood as in the previous layer for the ith a Capsule after connected to a layer under the condition of the strength of the first j a Capsule.

In determining u_j | I hat, we need to use the Routing for the second stage to calculate the distribution of the output node s_j, this process will involve the use of dynamic Routing, dynamic Routing) iteratively updating c_ij. The input s_j of the next layer of Capsule can be obtained by Routing, and then the output of the next layer of Capsule can be obtained by putting s_j into the “Squashing” nonlinear function. We’ll focus on the Routing algorithm later, but the entire Capsule layer and the propagation between them is done.

So the whole hierarchy between the transmission and distribution can be divided into two parts, the first part is below u_i and u_j | I hat between the linear combination, the second part is u_j | I hat with s_j Routing process. If readers are still not clear about the propagation process, they can look at the propagation process between the two layers of Capsule units as follows, which is drawn according to our understanding of the propagation process:

Capsule hierarchy diagram

As shown above, the diagram shows the hierarchy of Capsule and the dynamic Routing process. The lowest level, u_i, has two Capsule units, which are passed to the next level, v_J, with four capsules. U_1 and u_2 is a vector that contains a group of neurons Capsule units, each with different weights W_ij () is also a vector multiplication concluded u_j | I hat. For example u_1 and W_12 multiplication draw prediction vector u_2 | 1 hat. The prediction vector is then multiplied by the corresponding “coupling coefficient” C_ij and passed into a specific posterior Capsule element. The input s_j of different Capsule units is the weighted sum of all possible inputs to the unit, that is, the product sum of all possible inputs of prediction vectors and coupling coefficients. Then we get a different input vector s_j, which is put into the “squashing” nonlinear function to get the output vector v_j of the latter layer Capsule unit. Then we can make use of the output vector v_j and corresponding prediction vector u_j | I hat product update c_ij coupling coefficient, such iterative update does not need to apply the back propagation.

Dynamic Routing algorithm

Because according to the thought of Hinton, find the best processing path is equivalent to correctly handle the image, so add Routing mechanism in Capsule can find a set of coefficients c_ij, they can make the prediction vector u_j | I hat output vector v_j compliant, namely the most conforms to the input vector of the output, So we have found the best path.

As stated in the original paper, C_ij is coupling coefficients, which are updated and determined iteratively by the dynamic Routing process. The sum of coupling coefficients between Capsule I and all capsules in the next level is 1, that is, c_11+ C_12 + C_13 + C_14 =1 in Figure 4. In addition, the coupling coefficient is determined by “Routing Softmax”, and logits b_ij in softmax function is initialized to 0, and the softmax calculation method of coupling coefficient C_ij is as follows:

B_ij depends on the location and type of the two capsules, but not on the current input image. We can update the coupling coefficient iteratively by measuring the consistency between the current output v_j of each Capsule j in the later layer and the prediction vector of the previous Capsule I with the consistency of the measurement. This paper simply measures this consistency through inner product, i.e

, this part also involves using Routing to update the coupling coefficient.

Routing process is above 4 on the right side of the expression of the update process, we will calculate v_j and u_j | I the product of the hat and put it with the original b_ij add and update b_ij, Then softmax(b_ij) was used to update C_ij and the latter layer’s Capsule input s_j was further corrected. When new v_j is output, c_ij can be updated iteratively, so that we do not need to propagate back and update the parameters directly by calculating the consistency of the input and output.

A more detailed update of the Routing algorithm can be seen in the following pseudocode:

For all Capsule I in layer L and Capsule j in layer L +1, b_ij is initialized to be equal to zero. R and iteration times, each time according to the first b_i c_i calculation, and then using c_ij and u_j | I hat s_j and v_j calculation. Update b_ij with the calculated v_j to enter the next iteration loop to update C_ij. The Routing algorithm is very easy to converge and can achieve good results in 3 iterations.

CapsNet architecture

Hinton et al. implemented a simple CapsNet architecture, which consists of two convolution layers and a fully connected layer. The first one is the general convolution layer, and the second convolution is equivalent to preparing for the Capsule layer. In addition, the output of this layer is vector, so its dimension is one dimension higher than that of the general convolution layer. Finally, 10 V_J vectors are constructed through vector input and Routing, and the length of each vector directly represents the probability of a certain category.

Here is the overall architecture of CapsNet:

The first convolution layer uses 256 9×9 convolution kernels with a step of 1 and uses ReLU activation function. The convolution operation should not use Padding, so the output tensor should be 20×20×256. In addition, the convolutional kernel receptive field of CapsNet is 9×9, which is larger than other 3×3 or 5×5. This is because the larger receptive field can feel more information under the condition of fewer CNN levels. The number of weights between the two layers should be 9×9×256+256=20992.

Then the second convolution layer starts to construct the corresponding tensor structure as the input of the Capsule layer. We can see from the figure above that the dimension of the tensor generated after the second convolution operation is 6×6×8×32, so how should we understand this tensor? Yunmeng Juke gave a very vivid and interesting explanation on Zhihu. As mentioned in the previous chapter, if we first consider 32 (32 channel) 9×9 convolution kernels convolved with step 2, the result is actually the traditional 6×6×32 tensor, which is equivalent to 6×6×1×32.

Since the output of traditional convolution operation is a scalar each time, and the output of PrimaryCaps needs to be a vector of length 8, the three-dimensional output tensor 6×6×1×32 under traditional convolution needs to be changed to the four-dimensional output tensor 6×6×8×32. As shown below, in fact, we can regard the second convolution layer as executing 8 Conv2d operations with different weights on the input tensor with a dimension of 20×20×256, and each Conv2d operation with 32 9×9 convolution kernels and a step of 2 is performed.

Since each convolution operation will generate a 6×6×1×32 tensor, a total of 8 similar tensors will be generated, so the 8 tensors (i.e., 8 components of Capsule input vector) are combined in the third dimension to form 6×6×8×32. From the above we can see that PrimaryCaps is like an ordinary convolution layer with a depth of 32, except that each layer is changed from a scalar value to a vector of length 8.

Furthermore, combined with the definition of Capsule given by Hinton et al., Capsule is equivalent to a common set of neurons that encapsulate together to form a new unit. In the CapsNet architecture discussed in this paper, we encapsulate eight convolution units together to form a new Caosule unit. The convolution calculation of PrimaryCaps layer does not use activation functions such as ReLU, which are prepared input to the next layer Capsule unit in the way of vector.

PrimaryCaps The component levels of each vector share the convolution weights, that is, the weights of the convolution kernel of the obtained 6×6 tensor are the same 9×9. In this way, the number of parameters of the convolution layer is 9×9×256×8×32+8×32=5308672, in which the second part 8×32 is the number of bias parameters.

Layer 3 DigitCaps propagates and Routing updates on the basis of the vector output of layer 2. The second layer outputs a total of 6×6×32=1152 vectors, and the dimension of each vector is 8, that is, layer I has a total of 1152 Capsule units. The third layer J has 10 standard Capsule units, and the output vector of each Capsule has 16 elements. The number of Capsule units in the previous layer is 1152, so there will be 1152×10 W_Ij, and the dimension of each W_IJ is 8×16. When u_i is multiplied by the corresponding w_ij to get the prediction vector, we will have 1152×10 coupling coefficient C_ij, and the corresponding weighted sum will give 10 input vectors 16×1. The input vector is input to the squashing nonlinear function to obtain the final output vector v_j, where the length of v_j represents the probability of being identified as a class.

Parameters between the DigitCaps layer and PrimaryCaps layer contain two classes, namely W_ij and c_ij. The number of parameters of all W_ij should be 6×6×32×10×8×16=1474560, the number of parameters of C_ij should be 6×6×32×10×16=184320, in addition, there should be 2×1152×10=23040 bias parameters, but the original paper did not specify these bias parameters. Finally, xiaobian calculates that the three-layer CapsNet has a total of 5,537,024 parameters, which does not include the following full connection reconstruction network parameters. (Don’t blame me if I made a mistake.)

Loss function and optimization

Now that we know the length of the output vector of the DigitCaps layer, the probability of a certain category, how can we construct a loss function and iteratively update the entire network based on this loss function? Previously, the coupling coefficient C_ij is updated through consistent Routing, and it does not need to be updated according to the loss function, but other convolution parameters of the whole network and W_ij in Capsule need to be updated according to the loss function. Generally, we can directly use standard back propagation to update these parameters for the loss function. In the original paper, the author adopts Margin loss commonly used in SVM, and the expression of the loss function is as follows:

Where C is the classification category, T_c is the indicator function of classification (1 if C exists, 0 if C does not exist), m+ is the upper boundary, and m- is the lower boundary. In addition, the magnitude of v_C is the L2 distance of the vector.

Because the length of the instantiation vector indicates whether the entity that Capsule is representing exists or not, if and only if a handwritten number belonging to category K appears in the image, We want the output vector length of the topmost Capsule of class K to be large (the output of DigitCaps layer in CapsNet of this paper). In order to allow multiple numbers in a picture, we give separate Margin loss for each Capsule representing the number K.

Having built the loss function, we can happily use backpropagation.

Reconstruction and characterization

Reconstruction means that we hope to reconstruct the actual image represented by the category by using the predicted category. For example, the model in the previous part predicts that the image belongs to a category, and then the reconstruction network will reconstruct the predicted category information into an image.

Previously, we assumed that the vector of Capsule could represent an instance, so if we put a vector into the subsequent reconstruction network, it should be able to reconstruct a complete image. Therefore, Hinton et al. used additional reconstruction losses to facilitate the DigitCaps layer to encode input digital images. The following diagram shows the architecture of the entire reconfigured network:

We need to mask all the output vectors except for the specific Capsule output vectors during training. Then, the output vector is used to reconstruct the handwritten digital image. The output vector of the DigitCaps layer is fed into a decoder containing three fully connected layers and constructed as shown in the figure above. The loss function of this process is constructed by calculating the Euclidean distance between the output pixels of the FC Sigmoid layer and the pixels of the original image. Hinton et al. also reduced the reconstruction loss by 0.0005, so that it would not dominate the Margin loss in the training process.

Reconstruction and representation of Capsule output vector can not only improve the accuracy of the model, but also improve the interpretability of the model, because we can modify some or some components of the vector to be reconstructed and observe the changes of reconstructed images, which helps us to understand the output results of Capsule layer.

The above is the CapsNet architecture constructed in this paper. Of course, Hinton also describes many of the experimental results and findings. Interested readers can refer to the latter part of the paper.

CapsNet TensorFlow implementation

The following defines how to build the next two layers of CapsNet. In the CapsNet architecture, we can access the objects and methods in this class to build the PrimaryCaps layer and DigitCaps layer.

The following is the code of the entire CapsNet architecture and inference process. We need to extract images from MNIST and put them into the method defined below. The batch of images will first output 10 category vectors through the three-layer CapsNet network, each vector has 16 elements. And the length of each class vector is the probability that the output image is of that class. Then, we will put a vector into the reconstruction network to construct the image represented by the vector.

This is the main code of the network, for more code please see naturomics GitHub address, or heart of the Machine GitHub address, we upload the code with annotations, hope to help beginners understand the process and architecture of CapsNet. Here is the main calculation diagram we defined CapsNet above, namely the static calculation diagram in TensorFlow:

We also iterated and trained about 30,000 steps, but since CPU was used, we adjusted the batch size to 8 to reduce the calculation pressure of a single iteration. The following is the loss during our training, with Margin loss at the top and reconstruction loss and total loss at the bottom:

Finally, put two corresponding images reconstructed from output vector of DigitCaps layer:

We have only preliminarily explored CapsNet, but it still has many possibilities. For example, it should be able to obtain a lot of image information in the form of vector. Will this advantage further show extraordinary representational power in other large data sets or planar 3D image data sets? And the second layer of PrimaryCaps has a very large number of parameters, like a set of convolution structures in parallel horizontally to produce vectors (similar to the Inception module, but much wider). Can we further reduce the level of parameters by sharing in some way? In addition, the effect of the current Routing process is not good at least in the MNIST data set, it only shows the existence of the concept, so can we find a more efficient Routing algorithm? In addition, can Capsule be extended to other neural network structures such as cyclic or gated units? These may be the questions we have, but moving forward, time will give us the answers.

Welcome to leave a comment, this article will be updated and revised on the heart of machine website.

The resources

The original paper, the Dynamic Routing Between Capsules (https://arxiv.org/abs/1710.09829)
Discussion on zhihu address: https://www.zhihu.com/question/67287444/answer/251241736
Naturomics implementation address (TensorFlow) : https://github.com/naturomics/CapsNet-Tensorflow
XifengGuo implementation address (Keras) : https://github.com/XifengGuo/CapsNet-Keras
Leftthomas implementation address (Pytorch) : https://github.com/leftthomas/CapsNet

This article is the heart of the machine original, reprint please contact this public number for authorization.

Read the CapsNet architecture and then implement it with TensorFlow. This is probably the most detailed tutorial ever

Related Posts

Resize in Labelimg has been labeled image

Remember the HQL optimization process

Introduction to Spark (III) — Classic Spark word statistics