3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in egocentric-activity-recognition-5
Use these libraries to find egocentric-activity-recognition-5 models and implementations
No subtasks available.
This paper proposes a long-term feature bank—supportive information extracted over the entire span of a video—to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds.
The primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets.
This work tackles the problem proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs to summarize the past and formulate predictions about the future using a novel Modality ATTention mechanism which learns to weigh modalities in an adaptive fashion.
This work introduces an effective probabilistic approach to integrate human gaze into spatiotemporal attention for egocentric activity recognition by representing the locations of gaze fixation points as structured discrete latent variables to model their uncertainties.
This work collects RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations, and sees clear benefits of using hand pose as a cue for action recognition compared to other data modalities.
The proposed method is appropriate for the representation of high-dimensional features such as those extracted from convolutional neural networks (CNNs) and results in highly discriminative features which can be linearly classified.
An end-to-end trainable deep neural network model for egocentric activity recognition is proposed that surpasses by up to +6% points recognition accuracy the currently best performing method that leverages hand segmentation and object location strong supervision for training.
This paper proposes LSTA as a mechanism to focus on features from spatial relevant parts while attention is being tracked smoothly across the video sequence, achieving state-of-the-art performance on four standard benchmarks.
This work proposes a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets, and demonstrates the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects.
The "Ego-Exo" framework can be seamlessly integrated into standard video models; it outperforms all baselines when fine-tuned for egocentric activity recognition, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100.
Adding a benchmark result helps the community track progress.