3260 papers • 126 benchmarks • 313 datasets
Temporal Action Localization aims to detect activities in the video stream and output beginning and end timestamps. It is closely related to Temporal Action Proposal Generation.
(Image credit: Papersgraph)
These leaderboards are used to track progress in temporal-action-localization-23
Use these libraries to find temporal-action-localization-23 models and implementations
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 (\( 69.4\,\% \)) and UCF101 (\( 94.2\,\% \)). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices (Models and code at https://github.com/yjxiong/temporal-segment-networks).
A new spatiotemporal convolutional block "R(2+1)D" is designed which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.
A novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data.
An effective proposal generation method, named Boundary-Sensitive Network (BSN), which adopts "local to global" fashion and significantly improves the state-of-the-art temporal action detection performance.
This work proposes an effective, efficient and end-to-end proposal generation method, named Boundary-Matching Network (BMN), which generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously, and can achieve state-of-the-art temporal action detection performance.
This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.
The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.
The proposed TSN framework, called temporal segment network (TSN), aims to model long-range temporal structure with a new segment-based sampling and aggregation scheme and won the video classification track at the ActivityNet challenge 2016 among 24 teams.
A new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed.
Adding a benchmark result helps the community track progress.