CoVR-2: Automatic Data Construction for Composed Video Retrieval (2023-08-28T00:00:00.000000Z)

TL;DR

This work proposes a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include Composed Video Retrieval (CoVR), and provides extensive ablations to analyze the design choices on a new CoVR benchmark.

Abstract

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include Composed Video Retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks.

References80 items

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

CoVR: Learning Composed Video Retrieval from Web Video Captions

Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking

Vision-by-Language for Training-Free Compositional Image Retrieval

Llama 2: Open Foundation and Fine-Tuned Chat Models

Visual Instruction Tuning

Zero-Shot Composed Image Retrieval with Textual Inversion

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

LLaMA: Open and Efficient Foundation Language Models

Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

Learning Video Representations from Large Language Models

Fine-tuned CLIP Models are Efficient Video Learners

InstructPix2Pix: Learning to Follow Image Editing Instructions

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

LAION-5B: An open large-scale dataset for training next generation image-text models

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback

Effective conditioned and composed image retrieval combining CLIP-based features

A CLIP-Hitchhiker's Guide to Long Video Retrieval

Flamingo: a Visual Language Model for Few-Shot Learning

ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

FILIP: Fine-grained Interactive Language-Image Pre-Training

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback

CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback

Dual Compositional Learning in Interactive Image Retrieval

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Learning Transferable Visual Models From Natural Language Supervision

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval

TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback

Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval

Modality-Agnostic Attention Fusion for visual search with text feedback

Image Search With Text Feedback by Visiolinguistic Attention Learning

Language Models are Few-Shot Learners

Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Speech2Action: Cross-Modal Supervision for Action Recognition

CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

UNITER: UNiversal Image-TExt Representation Learning

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Composing Text and Image for Image Retrieval - an Empirical Odyssey

A Corpus for Reasoning about Natural Language Grounded in Photographs

Representation Learning with Contrastive Predictive Coding

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Hierarchical Neural Story Generation

Dialog-based Interactive Image Retrieval

Decoupled Weight Decay Regularization

Billion-Scale Similarity Search with GPUs

VQA: Visual Question Answering

Microsoft COCO: Common Objects in Context

BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions

CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval

“Luminosoinsight/wordfreq: v2.2,”

13 [ / as negative sentiment and profanity detectors

Ourcode,datasets,andmodelsarepubliclyavailableathttps

CordeliaSchmid (Fellow,IEEE)receivedtheMSde- greeincomputersciencefromtheUniversityofKarl-sruhe,andtheDoctoratedegreeincomputersciencefromtheInstitutNationalPolytechniquedeGrenoble