Nvidia opened source dG-Net a few days ago. Let’s review this CVPR19 Oral paper.

The paper is NVIDIA, UTS, Joint Discriminative and Generative Learning for Person Re-Identification, presented orally at CVPR19 by researchers at the Australian National University (ANU). Deep learning model training often requires a large amount of annotated data, but it is often difficult to collect and annotate a large amount of data. The author explores a method of using generated data to assist the training of pedestrian re-recognition. By generating high-quality pedestrian images, it is integrated with the pedestrian re-recognition model to improve the quality of pedestrian generation and the accuracy of pedestrian re-recognition. Links to papers: arxiv.org/abs/1904.07… B station video: www.bilibili.com/video/av514… Tencent Video: v.qq.com/x/page/t086…


<iframe src=”//player.bilibili.com/player.html?aid=51439240&cid=90036752&page=1″ scrolling=”no” border=”0″ frameborder=”no” framespacing=”0″ allowfullscreen=”true”> </iframe>


Code: github.com/NVlabs/DG-N…

Why: what are the pain points of the previous paper?

  • Generating high quality pedestrian images is difficult. Some previous work generated pedestrian images of relatively low quality (see above). It is mainly reflected in two aspects: 1. The generated authenticity: the pedestrian is not real enough, the image is blurred, and the background is not real; 2. Additional annotation is needed to assist generation: additional annotation of human skeleton or attribute is needed.
  • If these low-quality pedestrian generated images are used to train the pedestrian re-recognition model, differences with the original data set (BIAS) will be introduced. Therefore, the previous work either regarded all generated pedestrian images as outliers to regularize the network; Either extra – train a model to generate the image and fuse it with the original model; Or they don’t use the generated images at all.
  • At the same time, due to the annotation difficulty of data sets, the data volume of pedestrian re-recognition training sets (such as Market and DukemtMC-Reid) is generally about 2W, much smaller than ImageNet and other data sets, and the problem of easy over-fitting has not been well solved.

What does the paper propose and solve?

  • High-quality pedestrian images can be generated without additional annotations (such as pose, attribute, keypoints, etc.). By exchanging extracted features, the appearance of two pedestrian images can be exchanged. These appearances are real variations of the training set, not random noise.

  • No part matching is required to enhance the result of pedestrian re-recognition. Simply showing the model more training samples can improve the model’s performance. Given N images, we first generate the training images of NxN, which are used to train the pedestrian re-recognition model. (The first line and the first column below are real image input, the rest are generated images)

  • There is a cycle in the training: generated images are fed to the pedestrian re-recognition model to learn good pedestrian features, and the features extracted from the pedestrian re-recognition model are fed to the generated model to improve the quality of generated images.

How does this article achieve this goal?

  • Definition of features: In this article, we first define two features. One is the appearance feature, the other is the structure feature. Appearance features are related to pedestrian ID, and structural features are related to low-level visual features.

  • Generated parts:
  1. Same ID reconstruction: The appearance code of different photos of the same person should be the same. As shown in the figure below, we can have a self-reconstructed Loss (above, similar to auto-encoder), and postive sample with the same ID can also be used to construct the generated image. Here we use pixel-level L1 Loss.

  1. Different ID generation: This is the most critical part. Given two input images, we can swap their appearance and structure code to generate two interesting outputs, as shown below. The corresponding Loss includes: GAN Loss to maintain authenticity, and the corresponding a/ S feature reconstruction Loss can be reconstructed in the generated image.
There are no random parts in our network, so the changes in the generated images are all from the training set itself. So it’s closer to the original training set.

  • ReID’s part: For real images, we still use the classified cross entropy loss. For image generation, we use two Loss, one is L_{prime}. The baseline model is trained as a teacher to provide soft label for image generation and minimize the KL distance between prediction results and teacher model. Another loss is used to mine the details retained after the appearance of some images has changed, which is L_{fine}. (See the paper for details.)

Results:

  • Qualitative indicators:
  1. We tested the results on three data sets and saw that our methods were relatively robust for occlusion/large illumination changes.

  1. Surface interpolation. Does the network remember what the generated image looks like? So we did a gradual change of appearance experiment, and you can see that the appearance is gradually and smoothly changed.

  1. A case of failure. Unusual patterns such as logos cannot be restored.

  • Quantitative indicators:
  1. The fidelity (FID) and diversity (SSIM) of generated images were compared. The smaller the FID, the better; the larger the SSIM, the better.

  1. ReID results on multiple data sets (Market-1501, DuKEMTMC-Reid, MSMT17, CuHK03-NP).



Attached: Video Demo

B standing video backup: www.bilibili.com/video/av514… Tencent Video backup: v.qq.com/x/page/t086…

Finally, thank you for watching. Because we are also in the preliminary trial and exploration stage, it is inevitable that we will not think comprehensively about some issues. If you find any unclear places, welcome to put forward valuable suggestions and discuss with us, thank you!

reference

[1] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. ICCV, 2017.

[2] Y. Huang, J. Xu, Q. Wu, Z. Zheng, Z. Zhang, and J. Zhang. Multi-pseudo regularized label for generated samples in person reidentification. TIP, 2018.

[3] X. Qian, Y. Fu, T. Xiang, W. Wang, J. Qiu, Y. Wu, Y.-G. Jiang, and X. Xue. Pose-normalized image generation for person reidentification. ECCV, 2018.

[4] Y. Ge, Z. Li, H. Zhao, G. Yin, X. Wang, and H. Li. Fd-gan: Pose-guided feature distilling gan for robust person re-identification. In NIPS, 2018.

Author’s brief introduction

The first author of this paper, Zhendong Zheng, is a PhD student in the School of Computer Science at THE University of Technology, Sydney, who is expected to graduate in June 2021. The thesis was the result of her internship at Nvidia.

Jeong has published eight papers so far. One of them, ICCV17 Spotlight, was cited more than 300 times. It is the first time that GAN images are used to assist pedestrians in feature learning. A TOMM journal paper was selected by Web of Science as a highly cited paper of 2018 with over 200 citations. He also contributed to the community’s benchmark code for pedestrian re-identification issues, which has been widely adopted over 1,000 times on Github star.

Other authors of the paper include Nvidia Research’s Video expert Yang Xiaodong, Human Face expert Yu Zhiling (Sphere Face, LargeMargin), pedestrian re-recognition expert Dr. Liang Zheng, Prof. Yi Yang (three CVPR Oral papers this year), and Jan Kautz, VP of Nvidia Research.

Zdzheng.xyz /