3260 papers • 126 benchmarks • 313 datasets
Temporal sentence grounding (TSG) aims to locate a specific moment from an untrimmed video with a given natural language query. For this task, different levels of supervision are used. 1) Weak supervision: video-level action category set; 2) Semi-weak supervision: video-level action category set, and action annotations at several timestamps; 3) Full supervision: Action category and action interval annotations of all actions in untrimmed videos.
(Image credit: Papersgraph)
These leaderboards are used to track progress in temporal-sentence-grounding-2
Use these libraries to find temporal-sentence-grounding-2 models and implementations
No subtasks available.
A Mutual Matching Network (MMN) is presented, to directly model the similarity between language queries and video moments in a joint embedding space and suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in aointing space.
A novel semantic conditioned dynamic modulation mechanism, which leverages the sentence semantics to modulate the temporal convolution operations for better correlating and composing the sentence-relevant video contents over time.
A series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models.
A novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism that significantly outperforms state-of-thearts on three public datasets, demonstrating the effectiveness of the proposed localization framework.
Temporal sentence grounding aims to detect the most salient moment corresponding to the natural language query from untrimmed videos. As labeling the temporal boundaries is labor-intensive and subjective, the weakly- supervised methods have recently received increasing attention. Most of the existing weakly-supervised methods gen-erate the proposals by sliding windows, which are content- independent and of low quality. Moreover, they train their model to distinguish positive visual-language pairs from negative ones randomly collected from other videos, ignoring the highly confusing video segments within the same video. In this paper, we propose Contrastive Proposal Learning(CPL) to overcome the above limitations. Specifi-cally, we use multiple learnable Gaussian functions to gen-erate both positive and negative proposals within the same video that can characterize the multiple events in a long video. Then, we propose a controllable easy to hard neg-ative proposal mining strategy to collect negative samples within the same video, which can ease the model opti-mization and enables CPL to distinguish highly confusing scenes. The experiments show that our method achieves state-of-the-art performance on Charades-STA and Activi-tyNet Captions datasets. The code and models are available at https://github.com/minghangz/cpl.
This study investigates a recently proposed glance-supervised temporal sentence grounding task, which requires only single frame annotation (referred to as glance annotation) for each query and proposes a Dynamic Gaussian prior based Grounding framework with Glance annotation, which outperforms the state-of-the-art weakly supervised methods by a large margin and narrows the performance gap compared to fully supervised methods.
Two novel methods are proposed: a TwinNet structure that enables the model to learn about upcoming events; and a language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query.
This paper proposes to artificially merge clips to train for temporal grounding in a contrastive manner using text-conditioning attention and this Clip Merging (CliMer) approach is shown to be effective when compared with a high performing TSG method.
A boundary-aligned moment detection transformer, equipped with a dual-pathway decoding process that refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively, enabling precise refinement of moment predictions.
To learn moderately coupled Gaussian mixture capturing diverse events, this work newly proposes a pull-push learning scheme using pulling and pushing losses, each of which plays an opposite role to the other.
Adding a benchmark result helps the community track progress.