3260 papers • 126 benchmarks • 313 datasets
Action Localization is finding the spatial and temporal co ordinates for an action in a video. An action localization model will identify which frame an action start and ends in video and return the x,y coordinates of an action. Further the co ordinates will change when the object performing action undergoes a displacement.
(Image credit: Papersgraph)
These leaderboards are used to track progress in action-localization-7
No benchmarks available.
Use these libraries to find action-localization-7 models and implementations
The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
This work introduces a new laparoscopic dataset, CholecT40, consisting of 40 videos from the public dataset Cholesc80 in which all frames have been annotated using 128 triplet classes and proposes a trainable 3D interaction space, which captures the associations between the triplet components.
YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation and is currently the fastest state-of-the-art architecture on spatiotemporal action localization task.
The key idea is to hide patches in a training image randomly, forcing the network to seek other relevant parts when the most discriminative part is hidden, which obtains superior performance compared to previous methods for weakly-supervised object localization on the ILSVRC dataset.
This work proposes a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks that attains state-of-the-art results on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with its weak supervision.
This work proposes a one-stage framework named TriDet, a Trident-head to model the action boundary via an estimated relative probability distribution around the boundary, and designs a decoupled feature pyramid network with separate feature pyramids to incorporate rich spatial context from the large model for localization.
An Actor-Context-Actor Relation Network (ACAR-Net) is designed which builds upon a novel High-order Relation Reasoning Operator and an Actor- Context Feature Bank to enable indirect relation reasoning for spatio-temporal action localization.
The proposed ACtion Tubelet detector (ACT-detector) takes as input a sequence of frames and outputs tubelets, i.e., sequences of bounding boxes with associated scores, based on anchor cuboids that outperforms the state-of-the-art methods for frame-mAP and video-m AP on the J-HMDB and UCF-101 datasets, in particular at high overlap thresholds.
Adding a benchmark result helps the community track progress.