Click follow and update two computer vision articles a day

The original title of the paper: Zhang B, Wang L, Wang Z, et al. Real-time action recognition with enhanced motion vector CNNs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 2718-2726.

Background: Dual-flow convolutional network requires optical flow, and optical flow calculation is time-consuming. It takes 60ms to calculate each frame on K40GPU, which greatly affects its real-time performance. Therefore, this paper emerges as The Times require.

Prerequisite knowledge:

In terms of image and video compression, H264 protocol is used to compress images or videos. The specific understanding is as follows: I frame can be obtained by intra-frame compression of the first frame, and P frame and B frame can be obtained by inter-frame compression of the frames after the first frame. The fully encoded frame is called an i-frame, and the reference to the previous i-frame generation only containsDifferential partial coding“Is called p-frame (meaning that P-frame indicates the pixel change value of a certain frame in the previous frame, and it needs to get its next frame through the previous frame and P-frame, and so on to get the whole video image), and there is another kindRefer to the before and after framesThe encoded frame is called a B-frame (this one is more widely used and more compressed, and b-frame represents the difference between one frame and the other). The motion vection mentioned in this paper refers to P frame, that is, pixel changes between frames. I frame does not contain motion information. In this paper, the previous frame (P frame) is used to replace the I frame encountered.

Main contributions:

  • The motion vector CNN (MV-CNN) is trained by optical flow CNN (OF-CNN) to realize the combination OF accuracy OF optical flow and real-time motion vector.
  • Three training methods are proposed to combine optical flow and motion vector to improve the real-time performance of the model

Data set: UCF-101, THUMOS14

Model: Dual-flow convolutional network for video behavior recognition

Main ideas:

A motion vector is based on relatively large blocks of pixels, while an optical flow is based on pixels. Therefore, the recognition effect of motion vector is not good, but the calculation speed is fast and the real time is very high, while the optical flow is just the opposite, the recognition effect is good and the calculation speed is slow, resulting in the real time is very low. The main content OF this paper is to propose three training methods to combine the advantages OF OF-CNN and MV-CNN and avoid their disadvantages. Short for complementary strengths and weaknesses. Here the network knot OF OF-CNN and MV-CNN is the same.

Teacher Initialization (TI) : Since both light flow and motion vectors describe local information (light flow is specific to pixels and motion vectors describe blocks of pixels), they are intrinsically related. Considering this, the paper proposes to use the trained parameters OF OF-CNN as the initialization parameters OF MV-CNN. And then finetune MV-CNN until they merge. The second Supervision Transfer (ST) : Considering that the use OF the first method may lead to the loss OF semantic information brought from OF-CNN in the finetune process, the FC Layer OF OF-CNN is adopted in the second method to supervise the training OF MV-CNN. This inspiration comes from knowledge distillation, but knowledge distillation pays more attention to learning the effect OF the large model from a small model. Here, the structure OF the two models is the same, so it pays more attention to learning the effect OF of-CNN from MV-CNN rather than condense mV-CNN to a small size.

So how does that work here? The method is as follows:

(1) Using Softmax to calculate the output OF FC layer, Tn(O) = SoftMax (T(N −1)(O)), Sn(V) = SoftMax (S(N −1)(V)), Tn(O), Sn(O), (T refers to teacher, namely OF CNN, S refers to student, mV-CNN, and n-1 refers to the n-1 level, which is a subscript). Softmax is used here to map the output values to probabilities of various categories. In order to teach the features learned by OF-CNN to MV-CNN, the following cross entropy loss function is used to make Tn(o) and Sn(o) as close as possible.

(2) Use temp to soften the output of FC layer, which means almost the same as converting Hard labels to soft labels. The implementation function is PT = softmax(T N −1/T EMP), PS = softmax(Sn−1/T EMP).

(3) In addition to the loss function of supervision, the gap between the output of MV-CNN and the ground truth should be minimized (Formula 3). For these two loss functions, the weighted sum is adopted as the final loss function (Formula 4).

Third Combination(TI+ ST) : in simple terms, it is to train OF-CNN first, and then pass the trained parameters to MV-CNN, which is the first method, and then supervise the training in the second method.

Evaluation:

1. The thesis compares the accuracy OF THE EMV-CNN obtained from the zero-training MV-CNN by using three methods in two data sets (UCF-101 and THUMOS-14).


Conclusion: Using the enhanced motion vector (the three training methods) can reduce the precision loss caused by the addition of motion vector to Temporal CNN, but the speed is greatly improved. The method proposed in this paper improves the generalization ability of MV-CNN.

  1. The speed of computing motion vector and optical flow and both was tested on two data sets.


Conclusion: Real time is very good, if using the third method, can calculate more than 390 frames per second.

3. Set w to temp^2 and temp to 1, 2, and 3. In this paper, the three values are compared, the difference is not big, the accuracy is 78.9%, 79.3%, 79.2% respectively. The next experiment is temp 2.

4. Finally, the model in this paper is based on dual-flow convolution, so Spatial CNN is also added to form EMV+RGB-CNN to carry out real-time tests on two data sets and different devices, and to compare the accuracy and real-time performance with other SOTA models.

Conclusion: The accuracy is slightly lower than the dual-flow convolution model, but the real-time performance is nearly 30 times that of the model. Both accuracy and real-time performance are higher than other models.

This article comes from the public CV technical guide technical summary series. Feel free to point out any mistakes or irrationalities in the comments.

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

A summary PDF of the following articles can be obtained by replying to the keyword “Technical Summary” in the public account.

Other articles on visual object detection and recognition past, present and possible

Siamerse network summary

Summary of computer vision terms (a) to build the knowledge system of computer vision

Summary of under-fitting and over-fitting techniques

Summary of normalization methods

Summary of common ideas of paper innovation

Summary of efficient Reading methods of English literature in CV direction

A review of small sample learning in computer vision

A brief overview of intellectual distillation

Optimize the read speed of OpenCV video

NMS summary

Loss function technology summary

Summary of attention mechanism technology

Summary of feature pyramid technology

Summary of pooling techniques

Summary of data enhancement methods

Summary of CNN structure Evolution (I) Classical model

Summary of CNN structural evolution (II) Lightweight model

Summary of CNN structure evolution (iii) Design principles

How to view the future trend of computer vision

Summary of CNN visualization technology (I) – feature map visualization

Summary of CNN visualization Technology (ii) – Convolutional kernel visualization

CNN Visualization Technology Summary (iii) – Class visualization

CNN Visualization Technology Summary (IV) – Visualization tools and projects