3260 papers • 126 benchmarks • 313 datasets
Action Segmentation is a challenging problem in high-level video understanding. In its simplest form, Action Segmentation aims to segment a temporally untrimmed video by time and label each segmented part with one of pre-defined action labels. The results of Action Segmentation can be further used as input to various applications, such as video-to-text and action localization. Source: TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation
(Image credit: Papersgraph)
These leaderboards are used to track progress in action-segmentation-5
Use these libraries to find action-segmentation-5 models and implementations
A class of temporal models that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection, which are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks.
This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
This work designs a group-aware attention module, which can be easily plugged into existing AQA methods, to enrich the clip-wise representations based on contextual group information and achieves state-of-the-art on the LOGO dataset.
A multi-stage architecture for the temporal action segmentation task that achieves state-of-the-art results on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.
A framework for temporal action segmentation, the ASRF, which divides temporal action segmentsation into frame-wise action classification and action boundary regression, and refines frame-level hypotheses of action classes using predicted action boundaries is proposed.
A global-to-local search scheme that exploits both global search to find the coarse combinations and local search to get the refined receptive field combination patterns further and an expectation guided iterative local search scheme to refine combinations effectively.
VideoCLIP is presented, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks, revealing state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches.
This work proposes to find better receptive field combinations through a global-to-local search scheme, and proposes an expectation-guided iterative local search scheme to refine combinations effectively.
This paper introduces a unified framework for video action segmentation via sequence to sequence (seq2seq) translation in a fully and timestamp supervised setup, and extends the framework to the timestamp supervised setting via the proposed constrained k-medoids algorithm to generate pseudo-segmentations.
Adding a benchmark result helps the community track progress.