Recently, IN the category of many keywords (small data set, unsupervised semi-supervised, image segmentation, SOTA model), I have seen such a concept, twin network, so today I have roughly read the relevant classic papers and blog posts, and then MADE a simple case to strengthen my understanding.

So this introduction to twin networks, I want to divide into two parts, the first part is this one on model theory, basic knowledge and the unique loss function of twin networks; The next article will show you how to use code for a simple twin network.

1 The origin of the name

The alias of the twin network will die Siamese Net, and Siam is the name of ancient Thailand, so Siamese is actually the ancient name of “Thai people”. Why does Siamese now mean “twin” or “conjoined” in English? This comes from an allusion:

Thailand was born in the 19th century the conjoined twins, the medical technology cannot make two people, hence two people stubbornly life life, found in 1829 by British businessman, in the circus, performance in many parts of the world, in 1839 their visit to the United States north Carolina became a “circus” lingling, and eventually become a citizen of the United States. On April 13, 1843, he married two sisters in England. Eun gave birth to 10 children and Chang gave birth to 12. When the sisters quarreled, the brothers would take turns to live with each wife for three days. He died of consumption in 1874, and the other soon followed, both at the age of 63. Their livers are still preserved at the Matte Museum in Philadelphia. The “Siamese twins” have since become synonymous with conjoined twins and helped bring attention to this particular condition around the world.

2. Model Structure

Here are a few points to understand:

  • Among them Network1 and Network2 according to professional words is shared power system, frankly speaking, these two networks is actually a network, in the code to build a network on the line;
  • For general tasks, each sample is used to obtain preD of a model through the model, and then the preD and ground truth are used to calculate the loss function, and then the gradient is obtained. The twin network changes this structure, assuming that it is the task of image classification, and I input image A into the model and I get an output pred1, and then I input image B into the model and I get another output pred2, and then I calculate the loss function between pred1 and pred2. In general, the model is run once and a Loss is given, but in Siamese NET, the model needs to be run twice to get a Loss.
  • I personally feel that a general task is like measuring an absolute distance, a distance from the sample to the label; But twin networks measure a distance from sample to sample.

2.1 Uses of twin networks

Siamese Net measures the relationship between two inputs, that is, whether two samples are similar or not.

One such task, NIPS, published an article in 1993 called “Signature Verification Using a ‘Siamese’ Time Delay Neural Network” for verifying signatures on CHECKS in the United States, Verify that the signature on the cheque matches the signature reserved by the bank. At that time, the convolution network has been used in the paper to do verification… I wasn’t even born.

Later, In 2010, Hinton published “Rectified Linear Units Improve Restricted Boltzmann Machines” on ICML, which was used to verify human face and achieved good results. The input is two faces, the output issame or different.

Conceivably, twin networks can do sorting tasks. In my opinion, the twin network is not a network structure, it’s not a network structure like Resnet, it’s a network framework, I can think of Resnet as the backbone of the twin network or something like that.

Since backbone of twin network can be CNN, it can also be LSTM, which can realize semantic similarity analysis of vocabulary.

There was a question pair contest on Kaggle before, which measured whether two questions asked the same question. The TOP1 solution was Siamese net, the structure of the twin network.

Later, there seems to be a visual tracking algorithm based on Siamese network, which I have not yet learned, and I will have a look at this paper if I have the opportunity. Fully- Convolutional Siamese Networks for Object Tracking. Dig a hole first.

2.2 Pseudo-twin network

The problem comes, the twin network looks like two networks, but actually shares the power system as one network. Suppose we really give it two networks, then can one be LSTM and one CNN to realize the similarity comparison of different modes?

Yes, this is called pseudo-siamese network. One input is text, the other input is picture, judge whether the text description is picture content; One is a short headline, one is a long article, determine whether the content of the article is a headline. (High school Chinese composition perennial off-topic player savior, later to the teacher said this algorithm said my article did not off-topic, you would like to see again? Will the teacher beat me to death?

However, the code of this paper and the next one are all based on Siamese Network, backbone is also developed with CNN convolutional network and images.

2.3 triplets

Now that there’s a network of triplets, there’s also Triplet, called the Triplet Network, Deep Metric Learning Using Triplet Network. The effect is said to be better than Siamese network, I wonder if there are quads and quintuplets.

3. Loss function

Softmax and cross entropy are commonly used for classification tasks, but it has been suggested that models trained in this way do not perform well in “inter-class” differentiation, and immediately fail when used against sample attacks. I’ll talk a little bit later on about fighting sample attacks and digging a hole. To put it simply, if face recognition is assumed, then each person is a category. If you ask a model to do a task of thousands of categories, and the data of each category is very small, you will also feel the difficulty of this training.

To solve this problem, twin networks have two loss functions that are relatively classical:

  • Contrastive Loss
  • Triplte Loss

3.1 Contrastive Loss

  • Dimensionality Reduction by Learning an Invariant Mapping

Now we know:

  • Figure 1 is modeled to get PREd1
  • Figure 2 is modeled to get PreD2
  • Pred1 and pred2 are calculated to obtain loss

A calculation formula like this is given in the paper:

First of all, the pred1 and pred2 obtained by the model are vectors. The process is equivalent to extracting features from images through CNN, and then an implicit vector is obtained, which is an Encoder feeling.

Then compute the Euclidean distance between the two vectors, which (if the model is trained correctly) reflects the correlation between the two input images. We enter two images at a time, and we need to determine in advance whether the two images are the same or different, which is similar to a label, which is Y in the formula above. If it’s one of a kind, then Y is 0, and if it’s not, Y is 1

Similar to the binary cross entropy loss function, we need to pay attention to:

  • If Y=0, the loss is: (1−Y) LS(DWi) (1-y) L_S(D_W^ I) (1−Y) LS(DWi)
  • When Y=1, the loss is YLD(DWi)YL_D(D_W^ I)YLD(DWi).
  • LD,LSL_D,L_SLD and LS in the paper are constants. The default value is 0.5 in the paper
  • I is the meaning of a power. In both the paper and the commonly used Contrastive Loss, the default is I =2, which is the square of Euclidean distance.
  • For category 1 (different), we naturally expect the Euclidean distance between pred1 and pred2 to be as large as possible. So how big is this a size? The loss function is moving in the smaller direction, so what do I have to do? Add a margin as the maximum distance. If the distance between pred1 and pred2 is greater than margin, the two sample distances are considered to be large enough, and their loss is considered to be 0. Max (margin−distance,0) Max (margin-distance,0) Max (margin−distance,0).
  • The W in the figure above I understand as the weight of the neural network, and then X⃗1\vec X_1X 1, which represents the original image to be entered.

So the loss function looks like this:

To sum up, it should be noted that for the two different pictures, a margin should be set, and then the calculated loss of less than margin is 0, and the loss of greater than margin is 0.

3.2 Contrastive Loss pytorch

# Custom Contrastive Loss
class ContrastiveLoss(torch.nn.Module) :
    """ Contrastive loss function. Based on: http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf """

    def __init__(self, margin=2.0) :
        super(ContrastiveLoss, self).__init__()
        self.margin = margin

    def forward(self, output1, output2, label) :
        euclidean_distance = F.pairwise_distance(output1, output2)
        loss_contrastive = torch.mean((1-label) * torch.pow(euclidean_distance, 2) +     # calmp clamp usage
                                      (label) * torch.pow(torch.clamp(self.margin - euclidean_distance, min=0.0), 2))     
 

        return loss_contrastive
Copy the code

The only need to talk about is probably the torch. The nn. Functional. Pariwise_distance, this is the Euclidean distance calculation of corresponding element, for example:

import torch
import torch.nn.functional as F
a = torch.Tensor([[1.2], [3.4]])
b = torch.Tensor([[10.20], [30.40]])
F.pairwise_distance(a,b)
Copy the code

The output is:

Then see if this number is Euclidean distance:

No problem.

3.3 Triplte Loss

  • FaceNet: A Unified Embedding for Face Recognition and Clustering

This paper proposes FactNet and then uses Triplte Loss. A Triplet Loss is a Triplet Loss, which we’ll introduce in detail.

  • Triplet Loss definition: Minimizes the distance between an anchor point and a positive sample with the same identity, and minimizes the distance between an anchor point and a negative sample with different identities. This should be the loss function of the triplet network. Enter three samples at the same time, one image, then an image of the same category and a different image.
  • Triplet Loss’s goals: Triplet Loss of the characteristics of the target is to make the same label on the spatial location close to, and at the same time the characteristics of the different labels on the spatial location away from as far as possible, in order not to let the sample at the same time the characteristics of the polymerization in a very small space required for the same class of two cases and a negative example, negative patients than is at least the distance the margin. As shown below:

So how do we construct the loss function? Given what we want:

  • Let the Euclidean distance of the vector obtained by Anchor and Positive be smaller, the better;
  • The larger the Euclidean distance of the vector obtained by Anchor and negative, the better;

So expect the following formula to be true:

  • To put it simply, the distance between Anchor and positive is smaller than that between anchor and negative, and the gap should be at least larger than α\alphaα. My personal thought is that T here is a set of triples. It is often possible to build a very large number of triples for a single data set, so I personally feel that this kind of task is usually used for tasks with a large number of categories and a small amount of data, otherwise the number of triples will explode

3.4 Triplte Loss keras

Here is a code for Triplte Loss for Keras

def triplet_loss(y_true, y_pred) :
        "" Loss function of Triplet Loss ""

        anc, pos, neg = y_pred[:, 0:128], y_pred[:, 128:256], y_pred[:, 256:]

        # Euclidean distance
        pos_dist = K.sum(K.square(anc - pos), axis=-1, keepdims=True)
        neg_dist = K.sum(K.square(anc - neg), axis=-1, keepdims=True)
        basic_loss = pos_dist - neg_dist + TripletModel.MARGIN

        loss = K.maximum(basic_loss, 0.0)

        print "[INFO] model - triplet_loss shape: %s" % str(loss.shape)
        return loss
Copy the code

References:

[1] Momentum Contrast for Unsupervised Visual Representation Learning, 2019, Kaiming He Haoqi Fan Yuxin Wu Saining Xie Ross Girshick

[2] Dimensionality Reduction by Learning an Invariant Mapping, 2006, Raia Hadsell, Sumit Chopra, Yann LeCun