3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in spatio-temporal-action-localization-5
Use these libraries to find spatio-temporal-action-localization-5 models and implementations
No subtasks available.
An Actor-Context-Actor Relation Network (ACAR-Net) is designed which builds upon a novel High-order Relation Reasoning Operator and an Actor- Context Feature Bank to enable indirect relation reasoning for spatio-temporal action localization.
The proposed ACtion Tubelet detector (ACT-detector) takes as input a sequence of frames and outputs tubelets, i.e., sequences of bounding boxes with associated scores, based on anchor cuboids that outperforms the state-of-the-art methods for frame-mAP and video-m AP on the J-HMDB and UCF-101 datasets, in particular at high overlap thresholds.
This technical report introduces the winning solution to the spatio-temporal action localization track, AVA-Kinetics Crossover, in ActivityNet Challenge 2020, based on Actor-Context-Actor Relation Network, which outperforms other entries by a large margin.
This paper proposes a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images and introduces a Markov chain model which adds cues successively.
This work model spatio-temporal relations to capture the interactions between human actors, relevant objects and scene elements essential to differentiate similar human actions and shows that ACRN outperforms alternative approaches which capture relation information.
This paper shows that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue, and proposes a simple yet effective architecture utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features.
This work proposes utilizing fast and efficient key-point based bounding box prediction to spatially localize actions and introduces a tube-linking algorithm that maintains the continuity of action tubes temporally in the presence of occlusions, eliminating the need for a two-stream architecture.
This paper designs a region-based pretext task which requires the model to transform instance representations from one view to another, guided by context features, and introduces a simple network design that successfully reconciles the simultaneous learning process of both holistic and local representations.
Adding a benchmark result helps the community track progress.