Click follow and update two computer vision articles a day

1. Original title of the paper:

Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos[C]//Advances in neural information processing systems. 2014: 568-576

2. Main Contributions:

1. A dual-flow convolutional network model is proposed, which includes spatial network and temporal network.

2. On a limited data set, a model with good performance on multiple dense optical flow frames is proposed. (Again referring to the two-flow convolutional network)

3. Multi-task learning is proposed, and the model is trained on two different behavioral classification data sets, which increases the training data and improves the training effect.

  1. Dual flow model structure

As shown in the figure above, the model is divided into two parts. The spatial Stream part takes single frame image as input, and the Temporal Stream part takes optical flow of multi-frame image as input, and the two parts conduct late fusion after SoftMax. In this paper, two fusion methods are proposed, one is to use average method, the other is to train a multi-classification SVM classifier, SVM classifier uses L2-Softmax as input.

3.1 Optical flow convolutional network

L+1 frames were used to calculate the optical flow (all of which were calculated before the model started), and 2L frames (every 2 frames can get an optical flow in the x direction and y direction) were obtained. The sequence of 2L frames is 2K-1 in the X direction and 2K in the Y direction. (The paper conducted a comparative experiment on the value of L and found that 10 is better)

Bidirectional optical flow: at time t, t+L /2 is taken as the forward flow and T-L /2 as the reverse flow (the reverse flow here refers to the calculation of the displacement in the opposite direction in the paper, I am not clear about its significance and its use), so the 2L frame is still obtained.

Subtract the average flow: Generally speaking, the optical flow of two frames includes not only the movement of objects in the image, but also the movement of the camera. However, the model only needs to train the moving flow of objects, so the paper proposes to calculate the mean value of optical flow, and then subtract the mean value point by point, so as to avoid the influence of camera movement on model training.

The model is a fixed size: 224x224x2L. (Note that the L here refers to optical flow, not the original image)

Data sets: UCF-101 and HDB-51

4. Multi-task training

Too little data set will lead to over-fitting. In order to avoid this situation, this paper sets up two Softmax layers, one for UCF-101 and the other for HDB-51. They have their respective Loss functions, and the sum of the two is used as loss for training.

5. Evaluate

5.1 There are three training methods for spatial network:

Retrain on ucF-101

Pre-training on ILSVRC-2012, finetune on UCF-101.

Fixed pretraining network, training only the last layer.

It turned out that the third method worked better. (Dropout is used here to prevent overfitting.)

5.2 Time network mainly tests the effects of different L values, the effect comparison between the trajectory tracking mode I ignored above (it is unnecessary to mention, not the focus of the paper) and optical flow tracking mode, as well as the effect comparison of whether the average flow is subtracted.

The conclusion is as follows: it is better for L to be 10, and the effect can be improved by subtracting the average flow, but it is not obvious. The tracking effect of track flow is not as good as that of optical flow

5.3 The effects of unidirectional optical flow and bidirectional optical flow are compared, the effects of fusion by means of average fusion and SVM training, the effects of traditional recognition methods, and the effects of multi-task training or not are comparedConclusion: Multi-tasks learning is effectiveConclusion: For the fusion of convolutional networks, the fusion effect of SVM classifier is better than that of average, and bidirectional optical flow has no effect. Using bi-directional flow is not beneficial in the case of ConvNet fusion, as it turns out.

Conclusion: This method is better than traditional method.

This article comes from the public number CV technical guide welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation. A summary PDF of the following articles can be obtained by replying to the keyword “Technical Summary” in the public account.

Public number other articles

Shi Baixin, Peking University: From the perspective of reviewers, talk about how to write a CVPR paper

Siamese network summary

Summary of computer vision terms (a) to build the knowledge system of computer vision

Summary of under-fitting and over-fitting techniques

Summary of normalization methods

Summary of common ideas of paper innovation

Summary of efficient Reading methods of English literature in CV direction

A review of small sample learning in computer vision

A brief overview of intellectual distillation

Optimize the read speed of OpenCV video

NMS summary

Loss function technology summary

Summary of attention mechanism technology

Summary of feature pyramid technology

Summary of pooling techniques

Summary of data enhancement methods

Summary of CNN structure Evolution (I) Classical model

Summary of CNN structural evolution (II) Lightweight model

Summary of CNN structure evolution (iii) Design principles

How to view the future trend of computer vision

Summary of CNN visualization technology (I) – feature map visualization

Summary of CNN visualization Technology (ii) – Convolutional kernel visualization

CNN Visualization Technology Summary (iii) – Class visualization

CNN Visualization Technology Summary (IV) – Visualization tools and projects