How to apply deep learning to video action recognition

The author | Rohit Ghosh

Compile | Zhang Jianxin

Edit | Debra

AI Front Line introduction:In this article, the author will summarize the literature related to video motion recognition, introduce what motion recognition is, why it is so difficult, an overview of solutions, and a summary of relevant papers.

Please pay attention to the wechat public account “AI Front”, (ID: AI-front)

Medical images such as MRI and CT (three-dimensional images) are very similar to video — they encode two-dimensional spatial information in a third dimension. Much like anomaly diagnosis from a THREE-DIMENSIONAL image, motion recognition from a video requires capturing contextual information from the entire video rather than just from each frame.

Figure 1: Left: head CT scan example. Right: Sample video from the motion recognition dataset. Z in CT volume map and time dimension in video are similar.

In this article, I will summarize the literature related to video motion recognition. This post is divided into three chapters

What is motion recognition and why is it hard
Solution overview
Thesis summed up

What is motion recognition?

The motion recognition task involves identifying different actions from a video clip (a sequence of two-dimensional frames) that may or may not occur throughout the video. This is a bit like a natural extension of the task of image classification, which is to identify images in multiple frames of video and then aggregate the predicted results from each frame. Despite the success of deep learning frameworks in the field of image classification (ImageNet), architectures in the field of video classification and presentation learning have made slow progress.

Why is movement recognition so hard?

1. Huge computational cost A simple 2d convolutional network of 101 classification has only about 5M parameters, while the same architecture will grow to about 33M parameters when extended to a 3D structure. Training a 3D Convolution network (3DConvNet) on UCF101 takes 3 to 4 days and about 2 months on Sports-1M, which makes the extension framework exploration difficult and possibly over-fitting [1].

2. Capture long-term context action recognition involves capturing spatio-temporal contexts across frames. In addition, captured spatial information must be compensated for camera movement. Even strong spatial object detection capabilities are not sufficient because motion information also carries much more detail. For robust prediction, it is necessary to capture both local context and global W.R.T motion information context. Take the video shown in Figure 2 for example. A powerful image classifier can recognize humans and water in the two videos, but not the difference between the time periodic motion characteristics of freestyle and breaststroke.

Figure 2: Above, freestyle. Below is breaststroke. Capturing temporal motion is the key to distinguishing these two seemingly similar situations. Also note how the camera Angle suddenly changes in the middle of the freestyle video.

Designing an architecture that captures spatio-temporal information involves evaluating a number of unusual and costly options. For example, some alternative policies are:

A network that captures spatio-temporal information versus two networks, one that captures temporal information and one that captures spatial information.
Fusion prediction across multiple clips
End-to-end training VS feature extraction and classification respectively

UCF101 and Sports1M have long been the most popular benchmark datasets. Exploring a sensible architecture based on Sports1M is costly. For UCF101, although the number of frames is comparable to ImageNet, the high spatial correlation between videos makes the actual diversity in training much less. In addition, consideration of similar topics (movement) remains a problem in the generalization of data sets, benchmark frameworks to other tasks. This has recently been addressed with the introduction of Kinetics data sets [2].

Example UCF-101 illustration. Source (http://www.thumos.info/)

It is important to note that 3d medical image anomaly detection does not involve all of the challenges mentioned here. The differences between motion recognition and medical images are as follows:

For medical imaging, time context may not be as important as motion recognition. For example, CT scan detection of massive head bleeding should involve less time context across segments. Massive intracranial hemorrhage can be detected from a single fragment. In contrast, detecting pulmonary nodules from a chest CT scan involves capturing time context because nodules, along with bronchi and blood vessels, look like round objects on a TWO-DIMENSIONAL scan. The nodules of a spherical object can be distinguished from the blood vessels of a cylindrical object only by capturing a three-dimensional context.

For motion recognition, most research concepts rely on the use of pre-trained two-dimensional convolutional neural networks as a starting point to obtain better convergence results. For medical imaging, there is no such pre-training network available.

Solution overview

Prior to deep learning, most variants of traditional CV (Computer Vision) algorithms used for motion recognition can be divided into the following three generalized steps:

Local high-dimensional visual features describing the video area are extracted from self-dense [3] or sparse set of feature points [4][5].

The extracted features comprise a fixed specification of the video level description. A popular variant of this step is to package visual terms to encode features at the video level.

The final prediction is obtained by training the classifier such as SVM or RF based on the visual vocabulary package.

Among these algorithms using surface artificial features in step 1, the iDT algorithm using Dense Trajectories[6] (improved Dense trajectory algorithm) is the most advanced. Meanwhile, the 3D convolution algorithm used for motion recognition in 2013 did not bring much help [7]. Shortly after 2014, two groundbreaking research papers were published that form the backbone of all the papers we will discuss in this article. The main difference is the design choice around combining spatio-temporal information.

Scheme 1: single-flow network

In this paper, in June 2014, https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42455.pdf 】【 Its author, Karpathy et al., explores several methods to fuse time information from continuous frames using pre-trained 2d convolutional networks. [8]

Figure 3: Blending ideas. source

As figure 3 shows, successive frames of video are the inputs in all Settings. Single Frame uses a Single frame that incorporates information from all frames in the final stage. Late Fusion uses two networks with shared parameters, 15 frames apart, and then combined predictions at the end. Early Fusion combines 10 frames in the first layer by convolution. Slow Fusion involves multi-stage fusion and is a balance between early fusion and late fusion. To make a final prediction, multiple clips are sampled from the entire video and their predicted scores are averaged at the end.

Despite extensive experiments, the authors found that the results were significantly worse than existing algorithms based on artificial features. There are many reasons for this failure:

The temporal and spatial features of learning do not capture the motion features

There is relatively little diversity in data sets, making it difficult to learn such detailed features.

Scheme 2: dual-stream network

This pioneering papers in the Simmoyan and Zisserman [https://arxiv.org/pdf/1406.2199.pdf in June, 2014], the author learned the previous Karpathy et al. The failure of the thesis. Considering the difficulty of learning the depth framework of motion features, the author explicitly models motion features in the form of stacked light flow vector. Thus, the framework has two separate networks — one for the spatial context (pre-trained) and one for the motion context, rather than just a single network for the spatial context. The input to the spatial network is a single frame in the video. The authors experimented with time network inputs and found that bidirectional optical flow across 10 consecutive frames performed best. The two streams were trained separately and then combined using SVM. The final prediction is the same as the previous paper, averaging the prediction scores of multiple sampling frames.

Figure 4: Dual-flow architecture. Source (https://arxiv.org/pdf/1406.2199.pdf)

Although this method improves the performance of the single-flow method by obviously capturing local time motion, there are still some disadvantages:

Since the video-level prediction is obtained from the predicted score of the average sample clip, the long-term temporal information is still lost in the learning features.
Because trained clips are uniformly sampled from the video, they suffer from mislabel assignment. The basic assumption that every clip is the same is at odds with the basic assumption that the action may only take place in a very small period of time throughout the video.
This method involves predicting the optical streamers and storing them separately. In addition, training for the two streams is separate, meaning end-to-end training has a long way to go before landing.

conclusion

The following papers are based on the evolution of the two papers (single-stream and dual-stream) in a certain way:

LRCN
C3D
Conv3D & Attention
TwoStreamFusion
TSN
ActionVlad
HiddenTwoStream
I3D
T3D

Frequent themes in these papers can be summarized as follows. All papers are improvisation based on these basic ideas.

A topic that often appears in papers. Source: https://arxiv.org/pdf/1705.07750.pdf

For each paper, I listed their main contributions and explained them. I also showed them in UCF101 – split1 benchmark score (http://crcv.ucf.edu/data/UCF101.php).

LRCN

Long-term Recurrent Convolutional Networks for visual recognition and description

Submitted by Donahue et al., 17 November 2014

Arxiv link: https://arxiv.org/abs/1411.4389

Main contributions:

Building on previous work, it uses a recurrent neural network (RNN) rather than stream-based design
Encoding – decoding architecture extension for video presentation
An end – to – end trainable framework for motion recognition is proposed

Explanation:

In a previous paper by Ng et al [9], the author has explored the idea of using LSTM on the feature graph of separation training to see if it can capture temporal information from clips. Unfortunately, they concluded that temporal pooling of features proved more effective than LSTM stacking of trained feature maps. In this paper, the authors are based on the same idea of using LSTM blocks (decoders) after convolution blocks (encoders), but the entire architecture uses end-to-end training. They also compared RGB and optical flow as input options and found that weighted predicted scores based on both inputs were the best.

Figure 5: Action recognition oriented LRCN on the left. On the right is the generic LRCN architecture for all tasks. Source (https://arxiv.org/pdf/1411.4389.pdf)

Algorithm:

During training, 16 frame clips were sampled from the video. This architecture uses an RGB or 16-frame clip optical stream as input for end-to-end training. The final prediction for each clip is the average prediction for each time step. The final forecast for the video level is the average forecast for each clip.

Baseline (UCF101-SPLIT1) :

Comment on:

Although the author proposes an end-to-end training framework, there are still some disadvantages:

Incorrect label assignment due to video editing
Cannot capture long-term time information
Using optical flow means calculating flow characteristics separately

Varol et al., in their paper [10], attempted to remedy the flawed time range problem by using a smaller video spatial resolution and a longer clip (60 frames), which significantly improved performance.

C3D

Learning spatio-temporal features with three-dimensional convolutional networks

Du Tran et al.

Submitted on 02 December 2014

Arxiv link: https://arxiv.org/pdf/1412.0767

Main contributions:

Three-dimensional convolutional networks are used as feature extractors
An extensive exploration of optimal 3d convolution cores and architectures
A deconvolution layer is used to explain modeling decisions

Explanation:

The author constructs this paper based on Karpathy et al. ‘s paper (single Stream). However, they use three-dimensional convolution algorithms on video volume, rather than two-dimensional convolution algorithms across frames. The idea is to train these networks on Sports1M and then use them (or a set of networks with varying time depths) as feature extractors for other data sets. What they found was a simple svM-like linear classifier based on a full set of extracted features that was more efficient than the best existing algorithms. The model performs better with artificial features like iDT.

Differences between C3D papers and single-stream papers. Source (https://arxiv.org/pdf/1412.0767)

Another interesting part of this paper is to use deconvolution (http://blog.qure.ai/notes/visualizing_deep_learning) explain the link layer to explain the decision. They found that the network focused on the appearance of space in the first few frames and tracked movement in subsequent frames.

Algorithm:

During training, five 2-second clips were randomly extracted from each video, assuming that the distribution of the action across the entire video was known. During the test, 10 clips were randomly sampled and their predicted scores were averaged to get the final prediction.

Convolution the three-dimensional convolution applied to a space-time cube.

Benchmark (UCF101 – split1) :

Comment on:

Long-term time modeling is still a problem. Moreover, training such a large network is computationally difficult — especially for medical imaging, pre-training of natural images is not much help.

Remark:

Almost at the same time, Sun et al. [11] proposed the concept of factorization of THREE-DIMENSIONAL convolution network (FSTCN), in which the author explored the decomposition of three-dimensional convolution into spatial two-dimensional convolution and temporal one-dimensional convolution. This one-dimensional convolution is placed behind the two-dimensional convolution layer and implemented as a two-dimensional convolution over the time and channel dimensions. The results of factorization of three-dimensional convolution (FSTCN) on UCF101 split are also considerable.

Thesis and three-dimensional factorization. Source (https://arxiv.org/pdf/1510.00562.pdf)

Conv3D & Attention

Use time structure to describe the video

Yao et al.

Submitted on 25 April 2015

Arxiv link: https://arxiv.org/abs/1502.08029

Main contributions:

Novel 3D CNN-RNN coding-decoding architecture for local spatio-temporal information capture
Use the Attention mechanism and cnN-RNN coding-decoding framework to capture the global context

Explanation:

Although this paper is not directly related to motion recognition, it is a paper of great significance related to video representation. In this paper, the author uses a 3D CNN+LSTM architecture as an infrastructure for video description tasks. On this basis, the authors used a pre-trained 3D CNN to improve the results.

Algorithm:

The setup is almost identical to the coding-decoding architecture described in LRCN, with only two differences:

The 3D CNN feature graph of the clip is cascaded with the 2D feature accumulation graph of the same frame set to enrich the expression of each frame I {v1, v2… , vN}, instead of passing features from 3D CNN to LSTM. Note: 2D & 3D CNN used is pre-trained, not end-to-end training like LRCN.
Use weighted averaging to combine time features rather than averaging time vectors across all frames. The attention weight is determined based on the LSTM output of each time step.

The Attention mechanism for action recognition. Source (https://arxiv.org/abs/1502.08029)

Benchmark:

Comment on:

This is a significant paper in 2015, which first proposed the attention mechanism for video representation.

TwoStreamFusion

Dual-flow convolutional network fusion for video action recognition

Feichtenhofer et al.

Submitted on 22 April 2016

Arxiv link: https://arxiv.org/abs/1604.06573

Main contributions:

Long-term time modeling through long-term losses
A new multi-level integration architecture

Explanation:

In this paper, the authors use a basic dual-flow architecture and two novel schemes to improve performance without significantly increasing parameter size. The author explores the efficacy of both ideas.

Fusion of spatial flow and temporal flow (how and when) — For task recognition between washing your hair and brushing your teeth, the spatial network is able to capture spatial dependencies (such as hair or teeth) in the video, while the temporal network is able to capture the existence of periodic operation of each spatial location in the video. Therefore, it is very important to map the spatial feature map corresponding to a specific facial region to the corresponding region of the temporal feature map. To achieve the same goal, the two networks need to fuse early so that responses at the same pixel location are placed in place, rather than (as in the basic dual-stream architecture) merging late.
Long-term dependencies are also modeled by combining temporal network outputs across time frames.

Algorithm: Almost identical to the dual-stream architecture, except:

As shown in the figure below, the conv_5 layer outputs from the two streams are fused by conv+pooling. There is another fusion at the last layer. The final fused output is used for spatio-temporal loss assessment.

Possible strategies for fusing spatial and temporal flows. The strategy on the right performed better. Source (https://arxiv.org/abs/1604.06573)

2. For time fusion, the fusion of outputs from time network stacked across time and then conv+pooling is used for time loss. Converged architecture.

Double flow has two paths, one is the first step, the other is the second step. Source (https://arxiv.org/abs/1604.06573)

Benchmark (UCF101-split) :

Comment on:

The authors argue that the TwoStreamFusion approach is superior because its performance exceeds that of C3D without the additional parameters used in C3D.

TSN

Time network: Good practice for deep action recognition

Wang et al.

Submitted on 02 August 2016

Arxiv link: https://arxiv.org/abs/1608.00859

Main contributions:

An effective solution for long-term time modeling

Good practices will be established using Batch normalization, dropout, and pre-training for neural network units during training for deep learning networks

Explanation:

In this paper, the authors optimize the dual-flow architecture to produce the best results. Compared with the original paper, there are two main differences:

They suggest sampling clips sparsely from the video to obtain better long-term signal modeling, rather than sampling randomly throughout the video.
For the ultimate prediction of video rating, the authors explore several strategies. The best strategy is:

By averaging each fragment, combine the scores of the spatial and temporal streams (and other streams, if other input forms are involved) separately
The final spatial and temporal scores were obtained using a weighted average method, and Softmax was applied on all classes.

Another important part of the paper is to establish the over-fitting problem (due to the small size of the data set) and demonstrate popular techniques such as batch standardization, dropout, and pre-training. The authors also evaluate two new input forms in addition to optical flow, namely warped optical flow and RGB difference.

Algorithm:

During the training and prediction process, a video is divided into K segments of the same time period. After that, a random sample of K fragments was taken. The rest of the steps are similar to dual-flow architecture, except for some of the changes described above.

Time segment network architecture. Source (https://arxiv.org/pdf/1608.00859.pdf)

Baseline (UCF101-SPLIT1) :

Comment on:

This paper attempts to address two big challenges in the field of motion recognition — overfitting and long-term modeling due to small data sets, and the results are very good. However, the problem of computational optical flow and related input forms remains a problem.

ActionVLAD

ActionVLAD: Spatio-temporal aggregation learning for action classification

Girdhar et al.

Submitted on April 10, 2017

Arxiv link: https://arxiv.org/pdf/1704.02895.pdf

Main contributions:

Learnable aggregation of video-level features
An end-to-end trainable model with video aggregation features to capture long-term dependencies

Explanation:

In this paper, the authors’ most notable contribution is the use of learnable feature aggregation (VLAD) as compared to normal aggregation using MaxPool, the latter avgPool. This aggregation technique is similar to visual word packs. With multiple learned anchors (e.g. C1,… Ck) based words represent the spatio-temporal features associated with K typical actions (or sub-actions). The output of each flow in a two-flow architecture is encoded with K “action vocabulary” related features — each feature is different from the output of the corresponding anchor point for any given location in space or time.

ActionVLAD – Visual “vocabulary” based on action packs. Source (https://arxiv.org/pdf/1704.02895.pdf)

By max-pooling, you divide an input image into several rectangular areas and output the maximum value for each subarea. Represents the complete distribution of feature points because a single descriptor may be suboptimal for representing a complete video composed of multiple subactions. On the contrary, the video aggregation proposed in this paper expresses the distribution of a complete descriptor as multiple subactions by pooling the descriptor space into K units and in each unit.

Although maximum or average pooling works well for similar features, they do not adequately capture the full distribution. ActionVLAD aggregates appearance and motion features and gathers the rest from the nearest cluster center. Source (https://arxiv.org/pdf/1704.02895.pdf)

Algorithm:

Except for the use of the ActionVLAD layer, everything is pretty much the same as the two-stream architecture. The author tries multi-layer architecture, putting ActionVLAD layer and late Fusion layer after CONV layer as the best strategy.

Baseline (UCF101-SPLIT1) :

My comments:

The use of VLAD as an effective pooling method has long been proven. In early 2017, the same extensions in the end-to-end framework made this technology very robust and advanced for most motion recognition tasks.

HiddenTwoStream

Hidden dual-flow convolutional networks for action recognition

Zhu et al.

Submitted on 2 April 2017

Arxiv link: https://arxiv.org/abs/1704.00389

Main contributions:

A novel architecture for generating dynamic optical flow inputs using isolated networks

Explanation:

The use of optical flow in dual-flow architectures makes it necessary to force the estimation of optical flow before each sample frame, which adversely affects storage and speed. This paper advocates the use of an unsupervised architecture to generate optical flow against the frame stack.

Optical flow can be regarded as an image reconstruction problem. Assuming a set of adjacent frames I1 and I2 as inputs, our convolutional neural network generates a flow field V. Then, using the predicted flow fields V and I2, I1 can be reconstructed as I1′ using the inverse wraping method, minimizing the difference between I1 and its reconstructed body.

Algorithm:

The authors explore various strategies and architectures to generate optical flow with maximum FPS and minimum parameters without compromising accuracy as much as possible. The final architecture is the same as the dual-stream architecture, but with the following changes:

The time stream now has an optical flow generation net (MotionNet) stacked on top of the common time stream architecture. The input to the time stream is now a natural frame rather than a preprocessed optical stream.
There is an additional level of loss for unsupervised training in MotionNet. The authors show performance gains from using a TSN-based fusion rather than a convolution architecture for a two-flow scheme.

HiddenTwoStream — MotionNet generates dynamic optical flow. Source (https://arxiv.org/pdf/1704.00389.pdf)

Baseline (UCF101-SPLIT1) :

Comment on:

The main contribution of this paper is to improve the speed and associated costs of forecasting. With the automatic generation of the flow, the authors reduce the dependence on the slower traditional method of generating optical flow.

I3D

Where will motion recognition go from here? A novel model and Kinetics dataset

Carreira et al.

Submitted on 22 May 2017

Arxiv link: https://arxiv.org/abs/1705.07750

Main contributions:

Pretraining is used to combine 3D models into dual-flow architectures
Kinetics data set for future benchmarks and improved action data set diversity

Explanation:

This paper is based on C3D. Instead of a single THREE-DIMENSIONAL network, the authors use two different three-dimensional networks in a dual-flow architecture. In addition, in order to take advantage of the pre-training 2d model, the author reuses the 2d pre-training weights in the third dimension. Spatial stream inputs now include time-dimensional stacks of frames rather than individual frames in the base dual-stream architecture.

Algorithm:

The same as the basic dual-stream architecture, except for a three-dimensional network for each stream.

Baseline (UCF101-SPLIT1) :

Comment on:

The main contribution of this paper is to present evidence of the benefits of using pre-trained 2d convolutional networks. The open source Kinetics data set is another important contribution to the paper.

T3D

Temporal 3D ConvNets: A new architecture and transfer learning algorithm for video classification

Diba et al.

Submitted on 22 November 2017

Arxiv link: https://arxiv.org/abs/1711.08200

Main contributions:

Variable depth of time information composition architecture
New training architectures and techniques for supervising transfer learning from 2d pre-training network to 3D network

Explanation: The authors extend the work done on I3D, but suggest using an architecture based on single-stream 3D DesnseNet and multi-depth Temporal Transition layers piled up after dense blocks to capture different time depths. The multi depth pooling is implemented by pooling cores of varying time scales.

TTL layer and the rest of the DenseNet architecture. Source (https://arxiv.org/abs/1711.08200)

In addition, we also design a new supervised transfer learning technique between pre-trained 2d convolutional networks and T3D. Both the 2D pre-training network and T3D are derived from frames and clips of video, which may or may not be from the same video. This architecture is trained to predict 0/1 based on the accuracy and error of predictions propagated back by the T3D network, thus efficiently transferring knowledge.

Supervised transfer learning. Source (https://arxiv.org/abs/1711.08200)

Algorithm:

This framework is basically a three-dimensional modification of DenseNet[12], in which variable temporal pooling is added.

Baseline (UCF101-SPLIT1) :

Comment on:

Although it did not ultimately improve I3D results, much of this could be attributed to the much lower model record compared to I3D. The most recent contribution of this paper is supervised transfer learning techniques.

The original author

Rohit Ghosh

英文原文 :

http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review

[1] ConvNet Architecture Search for Spatiotemporal Feature Learning by Du Tran et al.

[2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

[3] Action recognition by dense trajectories by Wang et. al.

[4] On space-time interest points by Laptev

[5] Behavior recognition via sparse spatio-temporal features by Dollar et al

[6] Action Recognition with Improved Trajectories by Wang et al.

[7] 3D Convolutional Neural Networks for Human Action Recognition by Ji et al.

[8] Large-scale Video Classification with Convolutional Neural Networks by Karpathy et al.

[9] Beyond Short Snippets: Deep Networks for Video Classification by Ng et al.

[10] Long-term Temporal Convolutions for Action Recognition by Varol et al.

[11] Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks by Sun et al.

[12] Densely Connected Convolutional Networks by Huang et al.

How to apply deep learning to video action recognition

Related Posts

Introduction to Spark (III) — Classic Spark word statistics

Machine learning Chapter 8 integrated learning

Map the MapReduce workflow in detail