3260 papers • 126 benchmarks • 313 datasets
Next action anticipation is defined as observing 1, ... , T frames and predicting the action that happens after a gap of T_a seconds. It is important to note that a new action starts after T_a seconds that is not seen in the observed frames. Here T_a=1 second.
(Image credit: Papersgraph)
These leaderboards are used to track progress in action-anticipation-15
Use these libraries to find action-anticipation-15 models and implementations
No subtasks available.
This paper introduces EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments, and had the participants narrate their own videos (after recording), thus reflecting true intention, and crowd-sourced ground-truths based on these.
This work tackles the problem proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs to summarize the past and formulate predictions about the future using a novel Modality ATTention mechanism which learns to weigh modalities in an adaptive fashion.
Thorough experimental evaluation has shown that the hallucination task indeed helps improve performance on action recognition, action quality assessment, and dynamic scene recognition tasks and can enable deployment in resource-constrained scenarios, such as with limited computing power and/or lower bandwidth.
Rolling-Unrolling LSTM is contributed, a learning architecture to anticipate actions form egocentric videos that achieves competitive performance on ActivityNet with respect to methods not based on unsupervised pre-training and generalizes to the tasks of early action recognition and action recognition.
This work addresses questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework and shows that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention.
A new action anticipation method that achieves high prediction accuracy even in the presence of a very small percentage of a video sequence, and develops a multi-stage LSTM architecture that leverages context-aware and action-aware features, and introduces a novel loss function that encourages the model to predict the correct class as early as possible.
A Reinforced Encoder-Decoder (RED) network that takes multiple history representations as input and learns to anticipate a sequence of future representations, designed to encourage the system to make correct predictions as early as possible.
This work proposes a solution for the problem of pedestrian action anticipation at the point of crossing using a novel stacked RNN architecture in which information collected from various sources, both scene dynamics and visual features, is gradually fused into the network at different levels of processing.
Adding a benchmark result helps the community track progress.