Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

computer-vision-7

Action Localization

3260 papers • 126 benchmarks • 313 datasets

Action Localization is finding the spatial and temporal co ordinates for an action in a video. An action localization model will identify which frame an action start and ends in video and return the x,y coordinates of an action. Further the co ordinates will change when the object performing action undergoes a displacement.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in action-localization-7

Trend

Dataset

Best Model

Actions

No benchmarks available.

Libraries

i

Use these libraries to find action-localization-7 models and implementations

Pilhyeon/Learning-Action-Completene…

3 papers 82

Datasets

COIN

HACS

CVB

Subtasks

Temporal Action Localization Action Segmentation Spatio-Temporal Action Localization

Most implemented papers

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

Chen Sun, R. Sukthankar, Chunhui Gu, Sudheendra Vijayanarasimhan, C. Pantofaru, David A. Ross, G. Toderici, Yeqing Li, Susanna Ricco, C. Schmid, J. Malik•Mon May 22 2017

The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.

1133

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

open-mmlab/mmaction2

2 papers 3,861

0

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Jean-Baptiste Alayrac, Antoine Miech, I. Laptev, Josef Sivic, Dimitri Zhukov, Makarand Tapaswi•Thu Jun 06 2019

It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

1376 0

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

Jean-Baptiste Alayrac, Antoine Miech, Andrew Zisserman, I. Laptev, Josef Sivic, Lucas Smaira•Thu Dec 12 2019

This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

757 0

Recognition of Instrument-Tissue Interactions in Endoscopic Videos via Action Triplets

P. Mascagni, N. Padoy, D. Mutter, J. Marescaux, C. Nwoye, Tong Yu, Cristians Gonzalez•Thu Jul 09 2020

This work introduces a new laparoscopic dataset, CholecT40, consisting of 40 videos from the public dataset Cholesc80 in which all frames have been annotated using 128 triplet classes and proposes a trainable 3D interaction space, which captures the associations between the triplet components.

107 0

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

G. Rigoll, Okan Köpüklü, Xiangyu Wei•Thu Nov 14 2019

YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation and is currently the fastest state-of-the-art architecture on spatiotemporal action localization task.

159 0

Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization

Yong Jae Lee, Krishna Kumar Singh•Wed Apr 12 2017

The key idea is to hide patches in a training image randomly, forcing the network to seek other relevant parts when the most discriminative part is hidden, which obtains superior performance compared to previous methods for weakly-supervised object localization on the ILSVRC dataset.

720 0

Weakly Supervised Action Localization by Sparse Temporal Pooling Network

Bohyung Han, P. Nguyen, Ting Liu, Gautam Prasad•Wed Dec 13 2017

This work proposes a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks that attains state-of-the-art results on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with its weak supervision.

370 0

Temporal Action Localization with Enhanced Instant Discriminability

Ding Shi, Qiong Cao, Yujie Zhong, Shan An, Jian Cheng, Haogang Zhu, Dacheng Tao•Sun Sep 10 2023

This work proposes a one-stage framework named TriDet, a Trident-head to model the action boundary via an estimated relative probability distribution around the boundary, and designs a decoupled feature pyramid network with separate feature pyramids to incorporate rich spatial context from the large model for localization.

15 0

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

Hongsheng Li, Jing Shao, Junting Pan, Siyu Chen, Zheng Shou•Sat Jun 13 2020

An Actor-Context-Actor Relation Network (ACAR-Net) is designed which builds upon a novel High-order Relation Reasoning Operator and an Actor- Context Feature Bank to enable indirect relation reasoning for spatio-temporal action localization.

172 0

Action Tubelet Detector for Spatio-Temporal Action Localization

Philippe Weinzaepfel, V. Ferrari, C. Schmid, Vicky Kalogeiton•Wed May 03 2017

The proposed ACtion Tubelet detector (ACT-detector) takes as input a sequence of frames and outputs tubelets, i.e., sequences of bounding boxes with associated scores, based on anchor cuboids that outperforms the state-of-the-art methods for frame-mAP and video-m AP on the J-HMDB and UCF-101 datasets, in particular at high overlap thresholds.

340 0

Adding a benchmark result helps the community track progress.

Action Localization | State-of-the-Art