Core issues:

The training of unsupervised ReID strategy based on clustering depends on the pseudo-labels formed by clustering, and the pseudo-labels in turn affect the clustering results, and so on and so on. This strategy has two problems: 1. It is difficult to determine the number of clustering centers; 2

As shown in Figure (a), A and B are the same person, but A and C are close to each other in the feature space. According to the clustering initialization of pseudo-labels, the distance between A and B further increases rather than decreases in the trained feature space.

Solution:

Abandon the clustering strategy and use no absolute category to describe images with unknown labels. Instead, mine the relationship between images without labels as soft constraints to make similar images have similar feature representation. In this paper, relaxed labels are adopted. Unlike traditional one-hot labels that make images belong to a certain category, labels are regarded as a distribution in this paper, and an image is encouraged to be associated with multiple related categories.

The proposed method:

This paper first introduces the Baseline model, and then adjusts the Baseline model to get the final model.

Baseline: initialize Hard Labels

Training set X={x1,x2… ,xN}X = \{x_1, x_2, … , x_N\}X={x1,x2,… ,xN}, where xix_ixi is the picture of the unlabeled person. Consider xix_ixi’s Index III as its initial label yiy_iyi. Therefore, the images in each training set are classified into a separate class that has only itself.

A classification model is used to extract the features of each image and store them with dictionary structure.

After that, the feature v of each image is normalized to get v=ϕ(θ; X) ∣ ∣ ϕ (theta; X) ∣ ∣ v = \ frac {\ phi (\ theta; x)}{||\phi(\theta; X) | |} v = ∣ ∣ ϕ (theta. X) ∣ ∣ ϕ (theta; X), and then calculate the probability that each image belongs to Class III


p ( y i x . V ) = e x p ( V i T v / tau ) j = 1 N x p ( V j T v / tau ) p(y_i|x, V) = \frac{exp(V_i^Tv/\tau)}{\sum^N_{j=1} xp(V_j^Tv/\tau)}

, where VVV represents the cache dictionary and ViV_iVi represents the characteristics of class III.

The loss function is defined as follows:


L = j = 1 N log ( p ( y j x i . V ) ) t ( y j ) L = -\sum^N_{j=1}\log(p(y_j|x_i, V))t(y_j)

Where t(yj)t(y_j)t(yj) is the conditional empirical distribution on the class label. Sets the probability of this distribution to 1 if gt and 0 otherwise.

Model Learning with Softened Similarity

In order to find pictures of the same person, this paper selected pictures from the pictures with the smallest dissimilarity in the samples of each data set. Dissimilarity is defined as the Euclidean distance of two image features.

Then, for each image in the training set, the k images closest to it were considered to be a trusted image set Xreliable={xi1, XI2… ,xik}X^{reliable} = \{x_i^1, x_i^2, … , x_i^k\}Xreliable={xi1,xi2,… , xik}.

Rather than treating the reliable concentrated images as homogenous, a reminder classification network was used to learn similarities among individuals in a smoother way during training. In the training, this paper hopes that the network can not only predict the GT category of each image, but also allow the image to be predicted as its reliable category. Therefore, the author sets reliable Labels in the target label to a non-zero value. The target label distribution for xix_iXI should be

Where λ\lambdaλ is a hyperparameter to balance the GT label and reliabel label. When λ\lambdaλ is large, the model is the baseline model. When λ\lambdaλ is small, the model may fail to make gt predictions.

So the loss function changes to zero


L = Lambda. log ( p ( y j x i . V ) 1 Lambda. k j = 1 k log ( p ( y i j x i . V ) ) L =-\lambda\log(p(y_j|x_i, V) -\frac{1-\lambda}{k}\sum^k_{j=1}\log(p(y_i^j|x_i, V))

Similarity Estimation with Auxiliary Information

Part similarity exploration

The common layering strategy is p horizontal layers and then part features are obtained by average pooling.

The cross-camera encouragement

This paper also considers whether it comes from the same camera as a kind of information into the similarity of two pictures. Generally speaking, pictures taken by the same person with different cameras are more dissimilar and have a longer distance than those taken from the same camera. Therefore, when two images are from the same camera, a camera distance is introduced in this paper to help the same person’s images from other cameras have a chance to enter reliable set.

Overall dissimilarity

The final definition of the distance between the two pictures is as follows:

The specific implementation

The author uses RESnet-50 as backbone of CNN, and initializes the model with pre-training on ImageNet. Others are as follows:

The experimental results

The results of the experiment will be reviewed after the recent unsupervised ReID paper, and a comparison of the respective methods and effects will be written.

reference

  • [1] Unsupervised Person Re-identification via Softened Similarity Learning