[Paper Note] Clothing matching generation based on similarity condition learning under weakly supervised condition

Learning Similarity Conditions Without Explicit Supervision

Address: arxiv.org/pdf/1908.08…

Code address: github.com/rxtan2/Lear…

In this paper, starting from mp.weixin.qq.com/s/WHnYNvWWk…

Contact information:

Github：github.com/ccc013

Zhihu column: Machine learning and Computer Vision, AI paper notes

Paper notes

1. Introduction

Collocation in my current job is depending on a variety of similar conditions, such as color, type and shape similarity, by learning to based on the embedding condition, the model can learn different similar information, but also limited by this kind of practice as well as the explicit label problem, cause they can’t generate have never seen a category.

Therefore, under the condition of weak supervision, this paper hopes to learn the corresponding feature subspace by taking different similar conditions and attributes as a hidden variable, as shown in the figure below. The method in this paper is compared with some previous work.

Previous work required user-defined tags to learn feature subspaces of different similarity, such as the top and pants subspaces, or the pants and shoes subspaces. For the method in this paper, these explicit labels are not needed to learn feature subspaces.

In this paper, a Similarity Condition Embedding Network (SCE-NET) model is proposed to learn different Similarity conditions jointly from a unified vector space. The schematic diagram of an overall structure is shown as follows:

Each image will go through a CNN network and then be mapped to a unified vector space
The core of the network is a series of parallel, similar conditional masks, namely C1,C2,…,CMC_1, C_2, \cdots, C_MC1,C2,…,CM, which have been learned from the conditional weight branches in the diagram.
The conditional weight branch in the figure can be seen as a kind of attention mechanism, which dynamically assigns each conditional mask to the object being compared.

The contributions of this paper are summarized as follows:

In this paper, a CE-NET model is proposed, which can learn rich features of different similar conditions from images without explicit category or attribute supervision.
The CE-NET model proposed in this paper is also suitable for the new categories and attributes of zero-shot tasks.
More importantly, we show that a dynamic weighting mechanism is indispensable in helping a weakly supervised model learn representations of different similar concepts.

Method 2.

This part will introduce the SCE-NET model proposed in this paper, which takes different similar conditions and attributes as hidden variables under a weakly supervised condition, so as to learn the corresponding feature subspace.

The first input image will be input into CNN to extract features. Here, we use G (x; Theta) g (x; \theta)g(x; θ), where x represents the input image and θ\thetaθ represents the model parameters. The network in this paper mainly consists of two parts:

A set of masks for parallel similarity conditions;
A conditional weight branch

This will be described separately in the next two sections, and then in the third section the deformation of the conditional weight branch under different input forms.

2.1 Learn similar conditions

A key component of the model in this paper is a set of M parallel similar conditional masks, denoted as C1,C2, CMC_1, C_2,\cdots,C_MC1,C2,… CM, whose dimension is D, where M is the value obtained by experiments held out data.

The image features are mapped to a second – order semantic subspace encoding different similar substructures.

Let CjC_jCj represent the mask of each similarity condition and ViV_iVi represent the features of the generated image, then the above operations can be as follows:

The dimension of the above output results is M×DM\times DM×D, let O represent the output of the mask operation, that is, O=[Ei1,… EiM]O=[E_{i1},\cdots,E_{iM}]O=[Ei1,… EiM], so the final output is:

Here w is a weight vector with dimension M, which is computed by the conditional weight branch.

2.2 Conditional weight branch

Instead of pre-defining a set of similar conditions, this article chooses to use a conditional weight branch to let the model automatically determine the conditions to learn.

The conditional weight branch determines the relevance of each conditional mask based on a given pair of objects for comparison. For a pair of images XI, Xjx_I, X_jxi, Xj, their features extracted by CNN are calculated as follows:

Concat here represents the connection operation, as shown in the overall structure diagram given before. After feature extraction by CNN and concAT operation, the conditional weight branch will be entered, mainly including multiple full connection layers and RELU activation functions. Finally, a Softmax layer is used to obtain the m-dimension vector W.

Triplet Loss is a commonly used method for learning complex similarity relationships. We define a triplet xi,xj,xk{x_i, X_j, X_K}xi,xj, xK, where Xix_ixi is the target object, and xj,xkx_j, X_Kxj,xk are positive and negative samples respectively, that is, under some invisible condition C, Two semantically similar and dissimilar samples of xix_ixi. The calculation of triplet loss is as follows:

Where d(Ei,Ej)d(E_i, E_j)d(Ei,Ej) takes the Euclidean distance, and then the interval μ\muμ is a hyperparameter.

In addition, L1 Loss is also used for similar conditional masks to encourage sparsity and separation. In addition, L2 Loss is also used to constrain the learning image features, so the final objective function of the whole model is shown as follows:

2.3 Variants of SCE-NET

In addition to only inputting pictures, this paper also conducted experiments on other different input forms, including:

Text characteristics: This article can also enter a pair of text to represent a category label or a text description of an image. For a sentence that will be preprocessed with pre-trained word vector, the input features of the conditional weight branch are shown as follows:

Visual-text features: For a given pair of image features (Vi,Vj)(V_i, V_j)(Vi,Vj) and their text features (Ti,Tj)(T_i, T_j)(Ti,Tj), the features input to the conditional weight branch are as follows:

There are other ways to manipulate text and image features, such as concatenating and mapping to the same vector space, but the above direct dot product is the most efficient in this experiment.

Experiment 3.

Three data sets were used for experiments in this paper, namely Maryland-Polyvore, Polyore-Horse Outfits and UT-Zappos 50K. The first two data sets contain validation sets for two tasks, namely collocation matching prediction and fill-in-the-blank task, while the third data set is used to evaluate the ability of the model in this paper to identify attributes of different intensities.

3.1 the data set

Maryland-polyvore: This dataset contains 21799 sets of matches on the social networking site Polyvore. Here, 17,316 sets of training sets, 1,407 sets of verification sets and 3076 sets of matches are used in the segmentation provided by the author.
Polyvore – Outfits: This is a bit larger than the previous data set, containing 53,306 matching training sets, 10,000 matching test sets, and 5,000 matching validation sets, also from Polyvore, but unlike Maryland-Polyvore, The data set also contains label information of clothing category and related text description.
UT – Zappos50k: This is a data set containing 50000 shoes pictures, together with some annotation information. Here, the triplet sampled based on four different conditions provided in Conditional Similarity Networks, including shoe type, shoe gender, heel height and shoe closure mechanism is adopted. Therefore, the number of triples obtained for each feature is 200,000 training sets, 20,000 verification sets and 40,000 test sets respectively. However, when training the model in this paper, all triples from the same feature are gathered into a separate training set.

3.2 Experimental Details

For the two Polyvore datasets, a Resnet18 is used as the network model to extract image features, and then the size is 64. For the representation method of text description, HGLMM Fisher Vector of Word2VEc is adopted here, and PCA is used to reduce the dimension to 6000. In addition, two Loss, VSE and Sim are added, respectively:

VSE: Visual-semantic loss function, whose goal is to make the image features of the same object in the triad closer to its corresponding text features;
Sim: a loss function that aims to make similar images or similar text features closer to each other;

Therefore, the final loss is as follows:

For the third dataset, since the input of triples is used, the input to the conditional weight branch looks like this:

Corresponding to input images xi,xj,xk{x_i, X_j, x_K}xi,xj,xk respectively.

3.3 Experimental Results

For the experimental results on two Polyvore data sets, as shown below, the comparison method is Siamese network and Type-aware Embedding network model:

Experiments using the number of conditions, shown below, show that for Polyvore data sets, the model performs best when only five conditions are used.