This paper designs an adaptive cross-attention layer with dummy tokens, and uses a moment-adaptive saliency detector to exploit each video clip’s degrees of text engagement, and validate the superiority of CG-DETR with the state-of-the-art results on various benchmarks for both moment retrieval and highlight detection.