FAIR’s Kaiming He recently published research into unsupervised learning by video: A large-scale Study on Unsupervised Spatiotemporal Representation Learning. This paper was included in CVPR2021. The core of the thesis is to apply the unsupervised learning method of recent images to the unsupervised training of video. The work experiment is so extensive that only companies like Facebook and Google have the resources to do it on such a large scale.

Four unsupervised learning methods are selected in this paper: MoCo, BYOL, SimCLR and SwAV. MoCo and SimCLR are comparative learning methods that require negative samples, while BYOL and SwAV are unsupervised learning methods that only rely on positive samples. On the other hand, MoCo and BYOL both use momentum encoder, while SimCLR and SwAV do not. Two of the four methods are proposed by Facebook (MoCo and SwAV), and the other two, SimCLR and BYOL, are proposed by Google.

These four methods are originally used for unsupervised training of images. Compared with images, videos have only one more time dimension. These methods can be easily extended to unsupervised learning of videos. Whether it is image classification or video classification, unsupervised is to learn feature invariants. With respect to images, the above methods require different augmentation of the same image into an encoder network to learn unchanging features. So for video classification, in addition to the transformation of the image itself, time sequence dimension is also added. This paper adopts unsupervised learning from samples of different video clips in the same video (which can actually be regarded as the unique augmentation of video), which is in fact the hope of learning temporally persistent features. In this paper, SlowFast R-50 is selected as the most encoder. The following figure shows three different clips extracted from a video:

If only one clip is extracted, learning actually only depends on the transformation of the image itself, which is obviously not enough for video classification. The experiment also proves that more clips are beneficial. As can be seen from the table below, with the increase of clips, the performance of the four types of methods will be improved, which indicates that learning space-time persistence within a video is important for unsupervised learning.

The larger the timespan between positives is, the more effective they are in sampling, and this is understandable because the harder augmentation of the images is, the better they become. The greater the difference between different clips, the hard positive will be generated, which is beneficial to learning. However, if it is a long video, the clips with large time difference may have semantic changes, which, according to the experimental results of this paper, has little influence on the effect (the random crop of image classification will also change the semantics, for example, crop reaches the background area, but it seems that noise can be allowed by training). As shown in the following table, for ig-curated-1M data set, when timespan is greater than 60s, the performance is improved. For ig-Uncurated-1m data sets, the performance deteriorates only slightly when timespan is greater than 600s.

In terms of the four methods, the experimental results show that although there is no obvious difference in the performance of the four methods, the effects of MoCo and BYOL are slightly higher than those of SimCLR and SwAV. As mentioned above, momentum Encoder is adopted by the former. Momentum Encoder works to keep the model output as consistent as possible, perhaps more so for video classification. There is no specific explanation in the paper. For video classification, as more resources are required for training, will it be impossible to use large batch sizes (64*8=512 in this paper), resulting in a slightly worse effect of SimCLR? There are many variables involved here, which may need further study.

When unsupervised is used for downstream tasks, unsupervised training methods can even outperform supervised training methods in some data sets. For example, byOL-based application on AVA and SSv2 data sets after k400-240K unsupervised training can outperform finetune on two data sets after directly k400-240K supervised training.

There are more experiments in the paper, more can be seen in the paper: arxiv.org/pdf/2104.14…

This paper proves the effectiveness of unsupervised learning in video classification through a large number of experiments. As stated at the end of the paper, there is room for further improvement in the future:

We observed that linear readout on Kinetics is a good indicator of the performance on other datasets and that unsupervised pre-training can compete with the supervised counterpart on several datasets, but there is room for improvement. We hope that our baselines will foster research and provide common ground for future comparisons

Recommended reading

CPVT: A convolution implicitly encodes location information

DETR: Target detection based on Transformers

MoCo V3: I’m not who you think I am!

The application of Transformer in semantic segmentation

ViT: Transformer is All You Need!

PVT: Pyramid Vision Transformer for backbone on intensive tasks!

FixRes: Beat SOTA on ImageNet data set twice

How can Transformer break into CV and kill CNN?

Try MoCo instead of pretrain on ImageNet!

Machine learning algorithm engineer


A heart of the public account