CVPR2018: Pedestrian re-recognition based on Spatio-temporal Model unsupervised transfer Learning

CVPR2018: Unsupervised Cross-dataset Person Re-identification by Transfer Learning of Spatio-temporal Patterns

This paper can be downloaded on arXiv. It is the first CCF Class A paper of our lab. This method is called TFusion.

Code: github.com/ahangchen/T…

The solution targets Person Reid across datasets
It belongs to unsupervised learning
The method is multi-modal data fusion + transfer learning
In terms of experimental effect, it exceeds all unsupervised Person Reid methods, approaches supervised methods, and even exceeds supervised methods in some data sets

This article explains CVPR2018 TFusion

Please indicate the author for reprint
The dream of tea

Task

Person re-identification is an image retrieval problem. Given a set of probe images, for each image in probe, the most likely image belonging to the same pedestrian can be found from the candidate gallery.

Pedestrian re-recognition data set is captured by a series of surveillance cameras, and the detection algorithm is used to pick out pedestrians for pedestrian matching. In these data sets, faces are very fuzzy and cannot be used as matching features. In addition, due to the different shooting angles of multiple cameras, the same individual may be photographed in front, side and back with different visual features, so it is a difficult image matching problem. There are many commonly used data sets, which can be found on this website.

Related Work

There are several common solutions to the pedestrian re-identification problem:

Visual pedestrian re-recognition

Such methods usually extract the features of pedestrian images and measure the distance between the features to determine whether they are the same person.

Supervised learning

Such methods usually need to provide pedestrian images and pedestrian ID tags (person1, Person2, etc.), train the model, extract image features, and calculate the similarity of each graph in Probe and gallery according to the distance between the features of the two graphs (cosine distance, Euclidean distance, etc.). Sort the pictures in the gallery according to similarity, and the higher the order, the more likely they are to be the same person.

Papers in this area are represented by TOMM2017: A Discriminatively Learned CNN Embedding for Person re-identification, the basic image classifier we use is based on this paper and implemented by Keras, which will be discussed in detail later.

Unsupervised learning

Prior to CVPR2018, the only officially published unsupervised work in the Person Reid domain was THE UMDL of CVPR2016: Unsupervised cross-dataset Transfer Learning for Person re-identification, based on dictionary Learning method, learns cross-dataset invariant dictionary on multiple source data sets and migrates it to target data sets. Accuracy, however, remains low.

Pedestrian re-recognition with camera topology

There is a certain distance between the cameras and a certain speed limit for pedestrians’ movement, so the movement time of pedestrians between the cameras will show a certain pattern. For example, the distance between cameras AB is 10 meters, and the walking speed of people is 2m/s. If camera AB captures two pictures within 1s, Therefore, the two images cannot be of the same person, so we can use the camera topological constraints to improve the accuracy of pedestrian recognition.

However, such methods often have the following drawbacks:

Some methods need to know the camera topology in advance (distance between AB cameras)
Some methods can infer the camera topology from the image data taken, but the image needs to be labeled (whether it is the same person).
Even if the camera topology is deduced, the fusion result with the image is still poor

The migration study

Transfer learning is now a common practice in deep learning, pretraining on the source dataset and fine-tuning on the target dataset to make the model on the source dataset fit the target scenario. The paper representatives in this aspect include UMDL and Deep transfer learning Person re-identification mentioned above. However, most of the current transfer learning requires tags, and the effect of unsupervised transfer learning is poor, so there is still a lot of room for improvement.

For more on Person Reid, check out some of the research I wrote on my blog

Motivation

Does the existing pedestrian re-identification data set contain temporal and spatial information? Are there temporal and spatial laws in inclusion?
How to mine the spatio-temporal information and build a spatio-temporal model without the label whether two spatio-temporal points belong to the same pedestrian?
How to fuse two weak classifiers? Boosting is an algorithm for supervised fusion, but unsupervised?
How to conduct effective transfer learning in the absence of labels?

There should be three innovations

Unsupervised spatio-temporal modeling
Spatio-temporal image model fusion based on Bayesian inference
Transfer Learning based on Learning to Rank

Next, we’ll take a closer look at our approach.

Space-time model

Spatio-temporal patterns in data sets

The so-called space-time model is the distribution of migration time of pedestrians between two given cameras in a camera network.

We looked through all Reid data sets and found that there were three data sets with spatial and temporal information, namely Market1501, GRID and DukemtMC-Reid. Among them, DukemtMC-Reid was created in the second half of 2017, so the paper did not include any experiments related to it. Market1501 is a relatively large Person Reid dataset, GRID is a relatively small Person Reid dataset, and both have six cameras (GRID only has data for six cameras, although it introduces eight cameras).

For example, the space-time information of a picture in Marke1501 is written in the name of the picture:

0007 _c3s3_077419_03. JPG:

0007 represents the Person ID,
C3 means it was taken from camera 3, which is spatial information,
S3 represents the third time series (GRID and DukeMTMC do not have information about this series. In Market1501, videos of different sequences belong to different starting time, and videos of different cameras in the same series have similar starting time).
077419 is the frame number, which is the time information.

What I want to make fun of is that space-time information is actually very easy to save. As long as you know when the image was taken and which camera was used, you can record the space-time information and use it effectively. I hope that with more attention paid to multi-mode data fusion, people who make data sets will pay more attention to the information that can be saved.

Firstly, we calculate the migration time of all image pairs corresponding to the space-time point pairs in the training set through the real pedestrian label in Market1501. Here, we visualize the distribution of the time needed for pedestrians starting from camera 1 to reach other cameras.

As you can see, to different target the peak position of cameras, including 1 from camera to camera 1, means by a single camera more consecutive frames, so the peak concentration at near zero, from camera to camera 2, 1 peak concentration near – 600, means that the majority of people is a one-way from the camera movement to the camera 1, 2, and so on, and, This indicates that there are significant spatio-temporal regularities available in this data set.

Unsupervised space-time model construction

We named the migration time difference delta to make it easier to say.

If we can count all deltas in a data set, given a new delta (calculated from two spacetime points corresponding to two new images), we can use maximum likelihood estimates, The occurrence frequency of deltas within a certain range (say, 100 frames) before and after the delta (= number of deltas within the target range/total number of deltas) is used as the probability of the occurrence of a new time difference, that is, the probability that two points in space-time are produced by the same person.

But! The problem is that we often don’t have pedestrian marker data on target scenes!

So we thought,

Can we determine whether two space-time points belong to the same person based on whether the two graphs corresponding to them belong to the same person?
And whether the two images belong to the same person is actually a dichotomous problem of image matching, which we can do with some visual models,
However, this kind of visual model usually needs to be trained with labels, and the visual model without labels is usually weak
It doesn’t matter if the visual model is weak! We believe that when combined with the space-time model, it can become a powerful classifier! Have faith!
As long as we can construct the spatio-temporal model without supervision, combined with the weak image classifier, because of the spatio-temporal information, we can certainly beat other unsupervised models!

The idea, the implementation is very natural,

We pre-train a convolutional neural network on other data sets (so we can say this is a cross-data set task),
And then using this convolutional neural network to extract features from the target data set,
Cosine distance is used to calculate feature similarity
Consider the top 10 most similar people as the same person
This “same person” information + maximum likelihood estimation is used to construct a spatio-temporal model

For image classifiers, we use LiangZheng’s Siamese network here. Their source code is realized by MATLAB. I use Keras to reproduce it:

The maximum likelihood estimate for the space-time model can be seen here

Clever readers should note that this image classifier is on top of other data and, in the process of the training due to data distribution is different in the feature space, the image classifier is too weak, for the target data set, the top ten, there will be many wrong sample in structured model and real time model of space and time have deviation

As you can see, the constructed model is a little bit different from the real model, but the peak position is still pretty much the same, so it should work to a certain extent, but we still want the constructed model to be as close to the real model as possible.

So we started thinking

What are the factors that lead to the deviation of the model? It’s the wrong sample pair
How to get rid of the influence of wrong samples? Can we isolate the wrong pair of samples? What if there is no label?
(Flashes of inspiration) Isn’t the wrong sample just like the one I picked out of the blue? Can I randomly pick sample pairs and compute a random delta distribution
Remove the random delta distribution from the estimated delta distribution, and what is left over, which is due to correct pedestrian migration, is the real delta distribution.

So we visualized a random delta distribution

You can see,

It’s really different from estimated models and real models
There is more jitter

This random time difference distribution also shows a certain central trend, which actually reflects the time difference distribution of sampling. For example, most of the pictures taken by camera no. 1 are taken in a certain time period, and most of the pictures taken by camera No. 2 are taken in this time period, but most of the pictures taken by camera No. 3 are taken in other time periods.

Considering that there is so much jitter in the frequency chart of time difference, we add mean filtering to calculate the time difference in a certain area and truncate the time difference in a certain area, including resetting the minimum probability value to a minimum probability value and the maximum time difference value to a maximum time difference.

Next, how do you filter out the wrong model from the estimated model? How do you combine the space-time model with the image model?

Model fusion based on Bayesian inference

So first of all, the integration of the space-time model and the image model, we have a visual similarity Pv, a spatio-temporal probability Pst, and the intuitive idea is that the joint score could be Pv * Pst, and if we want to suppress the random Prandom, we can do a division, which is Pv * Pst/Prandom

Does that look like a conditional probability formula? So we started to derive (a lot of formula warnings) :

Let’s take a look at our resources: Now we have a weak image classifier, which can extract two visual features vi and Vj for two images, with two spatiotemporal points. The spatial feature is the two camera numbers CI and Cj, and the time feature is the time difference ∆ij taken by two images. It is assumed that the person IDS corresponding to the two images are respectively Pi and Pj, So our goal is to find the probability that, given these characteristics, two graphs belong to the same person

Pr (Pi = Pj | vi, vj, ci, cj, ∆ ij) formula (article 6)

By the formulas of conditional probability P (A | B) = P (B | A) * P/P (A) (B), available

Pr (Pi = Pj | vi, vj, ci, cj, ∆ ij)

= Pr as (vi, vj, ci and cj, delta ij | Pi = Pj) * Pr (Pi = Pj)/Pr as (vi, vj, ci and cj, delta ij)

Based on the assumption of the independence of spatial and temporal distribution and image distribution (people who look alike may not necessarily move like each other), we can disintegrate the first term and obtain

= Pr(vi,vj|Pi=Pj)
As Pr (ci, cj, delta ij | Pi = Pj)Pr (Pi = Pj)/Pr as (vi, vj, ci and cj, delta ij)

So Pr of Pi equals Pj is a hard term, so let’s try to replace it,

Exchange order first (commutative law of multiplication)

= Pr (vi, vj | Pi = Pj) * Pr (Pi = Pj) * Pr as (ci, cj, delta ij | Pi = Pj)/Pr as (vi, vj, ci and cj, delta ij)

By the formulas of conditional probability P (A | B) P (B) = P (A) P (B | A)

= Pr (Pi = Pj | vi, vj) * (vi = vj) * Pr Pr as (ci, cj, delta ij | Pi = Pj)/Pr as (vi, vj, ci and cj, delta ij)

You can see

Pr (Pi = Pj | vi, vj) can be understood as two pictures from the judgement on visual feature similarity for the probability of the same person
As Pr (ci, cj, delta ij | Pi = Pj) is the point of both time and space are the same person mobile probability

Using the independent assumption of spatio-temporal distribution and image distribution again, the denominator is disintegrated

= Pr(Pi=Pj|vi,vj) * Pr(vi=vj)
As Pr (ci, cj, delta ij | Pi = Pj)/(vi, vj) PrAs P (ci, cj, delta ij)

About Pr (vi = vj),

= Pr (Pi = Pj | vi, vj) * Pr as (ci, cj, delta ij | Pi = Pj)/P as (ci, cj, delta ij)

That is

= Visual similarity * the probability of the same person making such a movement/the probability of any two points in space-time making such a movement

This is also the equation (7) of the paper, which is our initial conjecture: Pv * Pst/Prandom

Looks like we’re getting close to what we have, but,

We don’t know that both of the ideal picture of Pr visual similarity (Pi = Pj | vi, vj), only two pictures of our image classifier of Pr visual similarity (Si = Sj | vi, vj),
We can’t calculate the true probability of the same person produce this mobile Pr as (ci, cj, delta ij | Pi = Pj), we estimate the probability of space and time only on the basis of visual classifier Pr as (ci, cj, delta ij | Si = Sj),
We do have the probability P of CI, Cj,∆ij for any two points in the data set.

So we want to use Pr as (ci, cj, delta ij | Si = Sj), P as (ci, cj, delta ij) to approximate, get it

= Pr (Si = Sj | vi, vj) * Pr as (ci, cj, delta ij | Si = Sj)/P as (ci, cj, delta ij)

And you can see that this gives you a sense of how we do fusion, and in fact we do most of our experiments using this approximation formula.

In terms of implementation, first simulate the two spatiotemporal models, calculate the image similarity, and then substitute the formula for the fusion score. See GitHub for details

But can we do this approximation? Let’s do the error analysis (a lot of derivation, not interested can skip to the second diagram, does not affect the following understanding, but the analysis of a wave will be more rigorous).

In fact, the error is introduced by the image classifier, assuming that the error rate of the image classifier judging that two images are the same person is Ep, and the error rate of the image classifier judging that two images are not the same person is En,

There are,

Ep = Pr (Pi indicates Pj | Si = Sj) (paper formula 1)

En = Pr (Pi = Pj | Si indicates Sj) (paper formula 2)

The Pr (Pi = Pj | vi, vj) and Pr (Si = Sj | vi, vj) can be expressed as the relationship between:

Pr(Pi=Pj|vi,vj)

= Pr (Pi = Pj | Si = Sj) * Pr (Si = Sj | vi, vj) + Pr (Pi = Pj | Si indicates Sj) * Pr (Si indicates Sj | vi, vj)

= (1-Ep) * Pr(Si=Sj|vi,vj) + En* (1-Pr(Si=Sj|vi,vj) )

Pr = (1 – Ep – En) * (Si = Sj | vi, vj) + En formula (8)

As deduction, Pr (ci, cj, delta ij | Pi = Pj) and Pr as (ci, cj, delta ij | Si = Sj) relationship (this can’t direct deduction, like visual similarity because the causal relationship between different)

As Pr (ci, cj, delta ij | Si = Sj)

As = Pr (ci, cj, delta ij | Pi = Pj) * (Pr (Pi = Pj) | Si = Sj) + Pr as (ci, cj, delta ij | Pi indicates Pj) * (Pr (Pi = Pj) | Si indicates Sj)

As = Pr (ci, cj, delta ij | Pi = Pj) * (1 – Ep) + Pr as (ci, cj, delta ij | Pi indicates Pj) * Ep

You can also get

As Pr (ci, cj, delta ij | Si indicates Sj)

As = Pr (ci, cj, delta ij | Pi = Pj) * En + Pr as (ci, cj, delta ij | Pi indicates Pj) * (1 – Ep)

Simultaneous two formulas above equation, cancel Pr as (ci, cj, delta ij | Si indicates Sj) can get

As Pr (ci, cj, delta ij | Pi = Pj)

= (1 – Ep – En) – 1 (1 – En) * Pr as (ci, cj, delta ij | Si = Sj) – Ep * Pr as (ci, cj, delta ij | Si indicates Sj) formula (paper 5)

One new concept Pr as (ci, cj, delta ij | Si indicates Sj), means the image classifier that is not the same time, the probability of this point of time and space, also on the implementation is easy, visual similarity statistics after top 10 points corresponding to the time difference, as the probability model of time and space.

We substitute two approximations (Formula 5 and 8) into formula 7,

You can get

Pr (as Pi = Pj | vi, vj, delta ij, ci, cj)

= (M1 + En/(1 – En – Ep)) ((1 – En) M2 – EpM3)/Pr (∆ ij, ci, cj)) (9) paper formula

Among them,

M1 = Pr (Si = Sj | vi, vj), visual similarity

As the M2 = Pr (delta ij, ci, cj | Si = Sj), is the probability model of time and space

As the M3 = Pr (delta ij, ci, cj | Si indicates Sj), the probability model of time and space

Denominator Pr(∆ij, CI, Cj)) is a random probability model

The above four items can be solved by combining the image classifier from the unlabeled target data set. In addition, when En=Ep=0 (meaning that the image classifier is completely accurate), this formula can be reduced to approximate solution:

Pr (Si = Sj | vi, vj) * Pr as (ci, cj, delta ij | Si = Sj)/P as (ci, cj, delta ij)

At this point, do you think we can use Formula 9 to calculate the fusion score? No, in formula 9, there is a problem: Ep, En are unknown!

So if we want to do Ep En seriously, we need to label the target data set, and then we use the image classifier to go through it, count what’s wrong, and then we can figure out Ep En. So we replace Ep and En with two constants α and β, and the whole approximation of the model is centered on these two constants.

In the experiments related to Table1,2,3,4 and Fig6 in the paper, α=β=0, and in Fig5, we set other constants to check the sensitivity of the model to this approximation

It can be seen that although the accuracy decreases when α and β are larger, it still maintains a certain level. When you look at the accuracy of pure image classifier, it can also be found that the accuracy of fusion model is always higher than that of pure image classifier.

You may have noticed that alpha +β is less than 1 because the Ep+En of the fusion is less than the Ep+En of the image only if Ep+En</sub><1 and alpha +β<1. In human terms, the Ep+En of the fusion is less than the Ep+En of the image only if the image is not particularly bad and the approximate parameters are normal. Only when the fusion model is more accurate than a single image model can the fusion be meaningful. The specific proof of this theorem is put in the appendix of the paper. If you are interested, you can send a private email to me to see the appendix. It is too much to show here.

Therefore, we obtain a multi-modal data fusion method supported by conditional probability inference, which is called Bayesian fusion

Take a look at the resulting space-time map:

Take another look at how strong the fused model is statistically:

Source data set the target data set of pure image results rank space-time fusion results rank – 1-5 rank – rank 10 rank – 1-5 rank – 10. Cuhk01grid10 7020.2023.8030.9063.7079.10 VIPeRGRID9.7017.4021.5 028.4065.6080.40 Market1501GRID17. 8031.2036.8049.6081.4088.70 GRIDMarket150120.7235.3942.9951.1665.0870.04 VIPeRMarket15012 4.7040.9149.5256.1871.5076.48 CUHK01Market150129 3945.4652.5556.5370.2274.64

As you can see,

Direct migration across data sets is really poor
After fusion, Rank1 is two to four times more accurate

It shows that this fusion method is really effective.

Transfer Learning based on Learning to Rank

As mentioned above, the image classifier is too weak. Although the fusion effect is good (at this time, we actually want to cast a NIPS like this), the fusion effect will be better theoretically if the image classifier can be improved. Now that we have a powerful fusion classifier, can we use this fusion classifier to label images in the target dataset and in turn train the image classifier?

A common unsupervised learning routine is to divide pairs of images into positive and negative samples (labeled with false labels) according to their fusion score and feed them to the image classifier for learning.

We also tried this method, but found that the negative samples in the data set were far more than the positive samples. There were quite a lot of negative samples correctly classified by the fusion classifier, but the positive samples correctly classified were very few, and the positive samples incorrectly classified were many, and the wrong samples were too many. The training effect was very poor, and even some hard Ming techniques could not be used.

So we thought,

We can’t provide the correct 01 tags, so the classifier will learn a lot of wrong 01 tags
Can we provide some soft labels and let the classifier learn to regression the score between the two samples instead of learning the dichotomous labels directly?
This is an image retrieval problem, can we use some learning methods in information retrieval to accomplish this task?

Naturally, Learning to Rank came to mind

Ranking

Problem definition: Given an object, find the results that are most relevant to it, in order of relevance
Common methods:
Point-wise: Calculates an absolute score for each result and then sorts it by score
Pair-wise: For every two results, count who has the highest score and sort by the relative score
List-wise: Enumerates all permutations and calculates the highest overall score as the sorting result

The comprehensive score often needs many complex conditions to calculate, which may not be applicable to our scenario, so the exclusion of list-wise, Point-wise and pairwise can be adopted. The score can be directly expressed by fusion score. Pairwise can calculate two scores by using a group of positive and reverse samples. Calculating relative score to learn means Triplet loss, so pair-wise method is adopted in the experiment.

Pair-wise Ranking

Given sample xi, its ranking score is OI,
Given sample Xj, its sorting score is Oj,
Define oij= OI-oj. If Oij >0, it means xi is higher than Xj.
To make this ranking probabilistic, Pij = EOIj /(1+ EOIj) is defined as the probability that XI ranks higher than Xj.
For any arrangement of length N, as long as we know the probability Pi, I +1 of n-1 adjacent items, we can deduce the sorting probability of any two items
For example, given Pik and Pkj, Pij = Pik * Pkj = EOIK + OKj /(1 + EOIK + OKj), where OIK =ln(Pik/(1-PIK))

RankNet: Pair-wise Learning to Rank

RankNet is a pin-wise Learning to Rank approach that uses a neural network to learn the mapping between two input samples (and one query sample) and their ranking probabilities (defined above).

Let’s talk about our problem

Given query picture A, given images B and C to be matched
The similarity between AB Sab is predicted to be the absolute ranking score of B by neural network, and the similarity between AC Sac is calculated to be the absolute ranking score of C

For specific neural networks
Keras implementationAnd the visualization looks like this:

The input is three images, respectively extracted with Resnet52 features and flatten
After flatten, write a Lambda layer + full connection layer to calculate the geometric distance with weight of feature vector, and get score1 and score2
Use score1 and score2 and the real score to calculate the cross entropy Loss.

Then the probability of B ranking higher than C is:

Pbc= eobc/(1+ eobc) = eSab- Sac / (1 + eSab- Sac)

Pbc is used to fit the real rank probability, and the regression loss is expressed by the cross entropy of predicted probability and real probability

C(obc) = -P’bcln Pbc – (1-P’bc)ln (1 – Pbc)

Network implementation is super simple, the main trouble in the sample triplet construction

Transfer Learning to rank

The whole Learning to rank process is shown in the figure below

We use fusion classifier to score image pairs in the target data set, construct triples and input RankNet, where Si is the query graph, Sj is the image extracted from the top1-top25 similarity degree fusion with Si, Sk is the image extracted from the top25-top50 similarity degree fusion with Si. Fed to RankNet for learning, part of resnet52 convolution layer can fully learn the visual features of the target scene.

Learning to Rank effect

Source data set the target data set of pure image results rank space-time fusion results rank – 1-5 rank – rank 10 rank – 1-5 rank – 10. Cuhk01grid17 4033.9041.1050.9078.6088.30 VIPeRGRID18.5031.4040. 5052.7081.7089.20 Market1501GRID22. 3038.1047.2060.4087.3093.40 GRIDMarket150122.3839.2548.0758.2272.3376.84 VIPeRMarket1501 25.2341.9850.3359.1773.4978.62 CUHK01Market150130 5847.0954.6060.7574.4479.25

Compared with the effect before Learning to Rank, the accuracy has been improved, especially on GRID data sets.

There are monitoring methods in contrast to SOA

On the one hand, we applied the above unsupervised cross-dataset algorithm to GRID and Market1501 data sets and compared them with the current best methods. On the other hand, we also tested the effect of supervised version, which means that the source data set is consistent with the target data set, such as GRID pre-training ->GRID fusion space-time, and the effect is as follows:

GRID

MethodRank 1JLML37.5TFusion unsupervised 60.4TFusion supervised 64.1

Due to the obvious temporal and spatial patterns in this data set (the correct time differences are concentrated in a very small range), a large number of misclassification results can be filtered out, so the accuracy even beats all supervised methods.

Market1501

1 s – MethodRank CNN65.88 DLCE79.5 SVDNet82.3 JLML88.8 TFusion unsupervised 73.13 60.75 TFusion supervision

In the data set Market1501, the unsupervised method approaches the supervised method of 2016 (our image classifier is just a ResNet52), and the supervised method surpasses the supervised method of 2016. Although it is not as good as the supervised method of 2017, if combined with other better image classifiers, It should have a better effect.

Contrast the SOA unsupervised approach

We asked the UMDL authors for the code and replicated the following cross-dataset migration experiments

MethodSourceTargetRank1UMDLMarket1501GRID3. 77 umdlcuhk01grid3. 58 umdlvipergrid3. 97 umdlgridmarket150130. 46 umdlcuhk01market1 50129.69 UMDLVIPeRMarket150130.34 TFusionMarket1501GRID60. 4 tfusioncuhk01grid50. 9 tfusionvipergrid52. 7 tfusiongridmarket15015 8.22 TFusionCUHK01Market150159. 17 tfusionvipermarket150160. 75

Among them, the result of UMDL’s migration to Market1501 is similar to that reproduced by hehefan and LiangZheng in university of technology Sydney, so our reproduction is reliable.

As you can see, unsupervised TFusion runs over UMDL.

More detailed experimental results can be examined in the paper.

Multiple iterative transfer learning

Review the whole architecture, we use image classifier estimate model of time and space, get the fusion model, using the fusion model in turn up image classifier model, image classifier and can continue to enhance fusion model, forming a closed loop, in theory, the closed-loop cycle times, can let the infinite classifier fusion image classifier, In order to obtain a powerful image classifier in the target scene, we made several iterative attempts:

See in from the current experiment results, the first migration learning promotion is larger, the increase is relatively small, this phenomenon at best can be a fast convergence, but at worst, although image classifier has improved, but there was no image classifier increase is greater than the classifier fusion phenomenon, so there should have something.

Afterword.

Research, visualization, looking for ideas, looking for data sets, doing experiments, debugging, adjusting parameters, and writing papers. I wrote A CVPR in nine months, which was also the first CCF class A paper in our lab. It was A hard-won pioneering work. Now we are continuing our exploration in the field of Person Reid. We are building a camera network based on raspberry PI and constructing our own data set. On this basis, we will carry out a series of researches on pedestrian detection, multi-modal data fusion, lightweight depth model, distributed collaborative terminal, video hashing, image indexing and so on. Welcome to follow my Github and continue to pay attention to our lab’s blog

After watching it for so long, you still don’t give me Github star!