The arrival of the Euro 2021 football tournament has touched the hearts of millions of football fans. Behind the wonderful football videos, what is unexpected is that AI technology is reshaping the way the sports video industry produces content.

At the recently held International Conference on Computer Vision and Pattern Recognition CVPR 2021, The International Challenge on Activity Recognition (ActivityNet) Workshop, one of the most influential groups in the field of video understanding, has announced several competitions. The competition attracted well-known enterprises such as Baidu, Alibaba, Bytedance, Tencent and Huawei, as well as universities and institutions at home and abroad such as Tsinghua university, Peking University, Stanford University, Massachusetts Institute of Technology and The Chinese Academy of Sciences. Among them, In the soccernet-V2 football video Understanding competition, the first in the world with the goal of all-round understanding of football game video, Baidu Research Won the championship of all two tasks with absolute superiority.

Baidu won all two tasks

Video link:

Baidu-ai-ar-1512380202189-8487.bj.bcebos.com/%E8%B6%B3%E…

The soccernet-V2 dataset, the largest in football understanding, includes 500 videos from European football’s five major leagues and the Champions League between 2014 and 2017, with a total of 764 hours of video and 300,000 manual annotations. It has become an important standard for international AI teams to measure the understanding ability of football videos.

The soccerNET-V2 competition included two missions: Action spotting and Replay Grounding. Among them, action spotting is to find some key events in live football matches and determine the moment when they happen. Critical incidents include 17 categories, covering important events such as goals, penalties, free-kicks, red cards, yellow cards and corner kicks, as well as fouls, offsides, shots on target and misses that humans cannot immediately identify. At the same time, some events are not even directly filmed, which needs to be inferred based on the context, which is also a challenge to the ability of video action recognition and event detection.

Replay Grounding is about matching playback footage from televised football games to the original events. There are often multiple replays after an exciting event occurs in a football match video, and there may be hundreds of seconds between the replay and the original event, and the shooting Angle is often different. Whether the replay clip can be matched with the original clip is an examination of the understanding ability of ultra-long distance video.

The VidPress team of Baidu Research Institute focuses on algorithm research and application innovation, and won the champion of two tasks this time, showing its superior technical strength. The system adopts a two-stage method. First, the feature extractor extracts the features of football video, and then takes the extracted features as the input of the specific task module in the second stage for event location or playback traceability.

System process for event location and playback tracing

In the feature extraction stage, the team believed that the feature extractor fine-tuned for football video was more conducive to improving the performance of the two downstream tasks of event location and playback tracing. Therefore, five pre-trained feature extractor models were fine-tuned for SoccerNetv2 data: TPN, GTA, VTN, irCSN and I3D-SLOW. All five feature extractor models have been among the best performing in the video understanding field in recent years for classification tasks, and also ranked high on the standard dataset Kinetic-400.

Based on the five feature extractor models, the team also made full use of the data to design a variety of fine-tuning strategies for feature extractor models and develop new methods for feature extraction. After obtaining the features extracted from the football video by each feature extractor, the five features are connected and normalized, which makes the optimized features have strong expression ability for the football game video and lays a solid foundation for the following downstream tasks.

Transformer structure is adopted in the stage of event location and playback traceability. Transformer architecture is characterized by clearer, more standardized, large model capacity and strong scalability, which can adapt to computer vision, natural language and other services. Transformer structure embodies the ability of precise timing processing of visual semantic features in these two tasks, which is better than the learning ability and training speed of Siamese network in baseline algorithm. In the training process, mix-up data is used to enhance event location, which makes more efficient use of training data and reduces over-fitting. In the task of playback tracing, the change of model structure reduced the training time to one eighth of the original.

Combined with the aforementioned semantic features of visual information and the customized Transformer structure for the new task, Baidu Research Has achieved a significant lead in the competition results. In the event location task, the average mAP of baseline was improved from 52.54% to 74.84%, an increase of 22.3 percentage points, nearly twice that of the second place. In the playback tracing task, the average mAP was increased from 40.75% of the baseline to 71.90%, an increase of 31.15 percentage points, 8 percentage points higher than the second ranking of 63.91%.

The reason why Baidu Research Institute can stand out in this competition is inseparable from the technological accumulation of algorithm ability based on large-scale video data.

This technology has very high practical value, which can be applied to the video of sports events on a large scale. Through the intelligent recognition of the whole game, it can accurately and real-time segment the action clips of goals, shots and fouls without manual intervention.

Based on this capability, the team developed and successfully implemented a number of application tools.

The first is the industry leading custom football highlights generation tool. After entering the player and selecting the game, it can automatically generate the highlight moments of the player’s video highlights and slow-motion playback. The system has been installed on more than 400 football player and team pages on Baidu Baike.

Enter player name + match name to generate video highlights of the player

Secondly, the team also built a video platform for one-click conversion of football text and text reports through the understanding of text semantics and video images. Input text broadcast content or broadcast room address, it can intelligently aggregate and generate corresponding video content, improving the generation efficiency and readability of battle report.

Intelligently generate corresponding clips and videos according to text broadcast content

In addition, the team has built an intelligent video production line based on image scene recognition, which can quickly understand uploaded long videos, detect if there is a goal, accurately locate the moment of a goal in the video, and complete automatic editing.

Upload a video of the match and automatically identify and generate a goal clip

Based on continuous innovation and accumulation in intelligent video technology, Baidu Research Institute launched VidPress, an intelligent text and text transfer video tool, at the beginning of 2020, which is the first universal and large-scale automatic video production technology in the industry. VidPress can support one-click import of text and text links, automatically realize video content production of dubbing, subtitles and pictures, and reduce the time cost of material collection, sorting and matching. At present, VidPress has been the core capability of Baidu brain intelligent creation platform, enabling intelligent video production of many media organizations such as People’s Daily. Generate thousands of players’ wonderful moments video for second understand encyclopedia intelligently; Provide one-click video generation services for end users of platforms such as Hundred and Good vision. Based on the integrated technical capabilities of natural language processing, knowledge mapping, vision and voice, Baidu Brain intelligent creation platform provides creators with multiple capabilities to facilitate the whole process of news production, including planning, collecting, editing, reviewing and issuing, and comprehensively improve content production efficiency.

With the advent of the era of full video, all walks of life have put forward new upgrading requirements for the application, experience and efficiency of video. The driving force behind the changing trend of intelligent video is inevitably THE figure of AI. In the future, Baidu will continue to make breakthroughs and iterations in relevant fields, and continue to enable applications and products to be implemented, so as to inject sufficient power into the development and reform of the video industry.

Baidu AI developer community ai.baidu.com/forum provides a platform for developers from all over the country to communicate, share and answer questions, so that developers can no longer “fight alone” on the way of research and development, and find better technical solutions through continuous communication and discussion. If you want to try all kinds of AI technologies and explore application scenarios, join baidu AI community immediately. All your ideas about AI can be realized here!

Scan the qr code below, add the little assistant wechat “JINGdong card, small customized peripheral, mysterious gift box, luggage” and more benefits you come to get ~