1
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
2
CoVR: Learning Composed Video Retrieval from Web Video Captions
3
Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking
4
Vision-by-Language for Training-Free Compositional Image Retrieval
5
Llama 2: Open Foundation and Fine-Tuned Chat Models
6
Visual Instruction Tuning
7
Zero-Shot Composed Image Retrieval with Textual Inversion
8
CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion
9
LLaMA: Open and Efficient Foundation Language Models
10
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
11
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
12
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
13
Learning Video Representations from Large Language Models
14
Fine-tuned CLIP Models are Efficient Video Learners
15
InstructPix2Pix: Learning to Follow Image Editing Instructions
16
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
17
LAION-5B: An open large-scale dataset for training next generation image-text models
18
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
19
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment
20
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
21
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
22
FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback
23
Effective conditioned and composed image retrieval combining CLIP-based features
24
A CLIP-Hitchhiker's Guide to Long Video Retrieval
25
Flamingo: a Visual Language Model for Few-Shot Learning
26
ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity
27
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
28
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
29
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
30
FILIP: Fine-grained Interactive Language-Image Pre-Training
31
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
32
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
33
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models
34
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
35
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
36
Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback
37
CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback
38
Dual Compositional Learning in Interactive Image Retrieval
39
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
40
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
41
RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network
42
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
43
Learning Transferable Visual Models From Natural Language Supervision
44
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
45
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
46
SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval
47
TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback
48
Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval
49
Modality-Agnostic Attention Fusion for visual search with text feedback
50
Image Search With Text Feedback by Visiolinguistic Attention Learning
51
Language Models are Few-Shot Learners
52
Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training
53
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
54
Speech2Action: Cross-Modal Supervision for Action Recognition
55
CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data
56
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
57
UNITER: UNiversal Image-TExt Representation Learning
58
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
59
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
60
Composing Text and Image for Image Retrieval - an Empirical Odyssey
61
A Corpus for Reasoning about Natural Language Grounded in Photographs
62
Representation Learning with Contrastive Predictive Coding
63
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
64
Hierarchical Neural Story Generation
65
Dialog-based Interactive Image Retrieval
66
Decoupled Weight Decay Regularization
67
Billion-Scale Similarity Search with GPUs
68
VQA: Visual Question Answering
69
Microsoft COCO: Common Objects in Context
70
BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions
71
CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval
72
“Luminosoinsight/wordfreq: v2.2,”
73
13 [ / as negative sentiment and profanity detectors
74
Ourcode,datasets,andmodelsarepubliclyavailableathttps
75
CordeliaSchmid (Fellow,IEEE)receivedtheMSde- greeincomputersciencefromtheUniversityofKarl-sruhe,andtheDoctoratedegreeincomputersciencefromtheInstitutNationalPolytechniquedeGrenoble
77
Thisarticlehas supplementary downloadable material
78
Aerial view of forest&&Aerial view autumn forest −> Change season to autumn\n
79
Aerial view of a sailboat anchored in the mediterranean sea