From :blog.aistudyclub.com

With the popularity of short video applications on the Internet, now all kinds of short videos have occupied more than 50% of the traffic on the Internet, so how to classify short videos has become a problem. At present, convolutional networks can be used to classify images effectively with high accuracy. So can neural networks classify video? The answer is yes, this article takes you through video categorization tasks using the ResNet3D network.

This paper interprets and reproduces the ResNet3D paper. The ResNet3D network mainly comes from the following two papers:

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Would Mega-Scale Datasets Further Enhance Spatiotemporal 3D CNNs

Project address: github.com/kenshohara/…

1. The target

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? The goal of. We already know that CNN has achieved great success in the CV field. Under a large number of image data sets, such as ImageNet data sets, CNN can achieve high accuracy. Then, the existing video data set is used to adjust the 2-d convolution kernel of the existing CNN network to 3-D convolution kernel. Is there sufficient data to meet the training requirements of CNN network? This paper mainly uses ResNet3D network to verify this.

The second paper “Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs” is based on the first paper, and its goal is to verify whether the super-large data set can Enhance the performance of CNN network.

2. Job content

In the first paper, the authors found that only on a large data set such as Kinetics400, ResNet series models can converge well. In UCF101, HDB-51 and ActivityNet, the model has been fitted, making it difficult to train high-precision models. On Kinetics400 data, as the depth of the model increases, such as after ResNet152, the accuracy gains are slight. It shows that if there is a large-scale data set, 3D CNN network can be used for training. Now that we have a well-trained model for large-scale data sets, we can fine-tune the training on small-scale data sets, and we get good results.On the basis of the first paper, the second paper fuses different large-scale data sets, obtains different combined super-large scale data sets, and trains ResNet3D networks with different layers respectively. The saved model weights can be used as pre-training weights for fine-tuning training on small-scale data sets, and the accuracy of different data sets and models can be improved to varying degrees.

3. Model structure

In general, the convolutional kernel of ResNet network used for image classification only makes sliding Windows on 2D images and calculates feature graphs. The shape of the convolutional kernel is generally [out_channel, IN_channel, W, H]. In the video classification task, a sequence of video is generally input to the network, such as 16 or 32 frames. In this way, a dimension of time T is added to the original WH dimension, and the shape of the convolution kernel is [out_channel, IN_channel, T, W, H]. At this point, the convolution kernel not only slides on the 2D plane, but also needs to move on the third dimension T to extract the associated features between frames. This required a transformation of the 2D ResNet into a 3D ResNet network. ResNet3D left the original ResNet architecture unchanged, replacing the convolution kernel in each block with a Conv3D library, and the pooling layer with a 3D pooling layer.The overall network structure is described as follows:

4. Training methods

In the first paper, an RGB image with width and height of 112 pixels and 16 frames was used as the input sample for training on the Kinetics400 dataset. At the same time, 5 different size clipping methods were used to randomly clipping the image, and random horizontal flip was added to enhance the data. The cross entropy was used as the loss function, and the SDG optimizer was used to set the regularization coefficient to 0.001, the momentum coefficient to 0.9, and the initial learning rate to 0.1. If the loss of the verification set is not reduced after 10 epochs, the learning rate is divided by 10. The SDG optimizer learning rate was set to 0.001 and the regularization coefficient was set to 1E-5 when the model was fine-tuned on a small data set. At the same time, the author has drawn a conclusion through several experiments: in training, better results can be obtained by freezing the first four blocks and only fine-tuning the weight of the fifth block and the full connection layer.

The second paper made adjustments on the basis of the training strategy of the first paper. Because the model was fine-tuned by using a super-large data set, other parameters remained unchanged, but the initial learning rate of the SGD optimizer was set to 0.003, and the regularization coefficient and momentum coefficient remained 0.001 and 0.9.

5. Experimental results

The following results were obtained through combination training in a variety of super-large data sets and a variety of model structures.The conditions and objectives of this reappearance are as follows:

Pre-training model: ResNet50 pre-training weights on K+M data sets

Data set: UCF-101

Objective: Top1 is 92.9%

6. Reproduction of the paper

Project address: github.com/txyugood/Pa…

1. Download and extract the data set dataset address data sets: aistudio.baidu.com/aistudio/da…

mkdir dataset/
unzip UCF-101.zip -d dataset/
Copy the code

2. Convert video to picture

mkdir /home/aistudio/dataset/UCF-101-jpg/
cd Paddle-ResNets/ && python generate_video_jpgs.py /home/aistudio/dataset/UCF-101 /home/aistudio/dataset/UCF-101-jpg/ ucf101
Copy the code

3. Transform the PyTorch pre-training model

PIP install torch = = 0.4.1cd Paddle-ResNets/model/ && python convert_to_paddle.py
Copy the code

4. The training network verifies the accuracy of clips in the training process and saves the model with the highest accuracy of clips. The final accuracy of clips is 90%.

Training methods:

  • During the training, four reading processes are started, and the data set is divided into four blocks with batCH_size of 128. Data is read asynchronously for model training.

  • The total training time of each iteration is about 2.3 seconds, and the data reading time is about 1.8 seconds.

  • CPU usage is close to 100% and memory usage is in the 80-90% range.

  • The GPU usage is 50%, indicating that the training bottleneck is in the data reading part of the CPU. If the CPU and memory are expanded, the training speed can be further improved.

  • Momentum, learning rate 0.003, Momentun =0.9 L2Decay 0.001.

  • The data enhancement method is RandomResizedCrop.

  • According to the method mentioned in the paper, conv1, conv2, conv3 and conv4 of the Resnet50 network are frozen and only the conv5 and fc layers are trained.

cd Paddle-ResNets/ && python train.py
Copy the code

5. The verification network segmented the video with 16 frames as a clip according to the method in the paper, and finally calculated the average value of all clips of a video as the video classification result.

Finally, a val.json file is generated for calculating top-1 accuracy. This is consistent with the pyTorch version code logic.

cd Paddle-ResNets/ && python test.py
Copy the code

6. The calculation accuracy of top-1 is 93.55%, which is higher than 92.9% in the paper.

cd Paddle-ResNets/ && python eval_accuracy.py
load ground truth
number of ground truth: 3783
load result
number of result: 3783
calculate top-1 accuracy
top-1 accuracy: 0.9355009251916468
Copy the code