1
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
2
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
3
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
4
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
5
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
6
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
7
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
8
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
9
OmniVL: One Foundation Model for Image-Language and Video-Language Tasks
10
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
11
Clover: Towards A Unified Video-Language Alignment and Fusion Model
12
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
13
Visual Question Answering: From Theory to Application
14
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
15
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
16
Revealing Single Frame Bias for Video-and-Language Learning
17
GIT: A Generative Image-to-text Transformer for Vision and Language
18
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
19
CoCa: Contrastive Captioners are Image-Text Foundation Models
20
Flamingo: a Visual Language Model for Few-Shot Learning
21
DeiT III: Revenge of the ViT
22
Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)
23
All in One: Exploring Unified Video-Language Pre-Training
24
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
25
GroupViT: Semantic Segmentation Emerges from Text Supervision
26
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
27
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
28
End-to-end Generative Pretraining for Multimodal Video Captioning
29
Bridging Video-text Retrieval with Multiple Choice Questions
30
A ConvNet for the 2020s
31
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
32
Co-training Transformer with Videos and Images Improves Action Recognition
33
FLAVA: A Foundational Language And Vision Alignment Model
34
Grounded Language-Image Pre-training
35
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
36
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
37
Scaling Up Vision-Language Pretraining for Image Captioning
38
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
39
Florence: A New Foundation Model for Computer Vision
40
Swin Transformer V2: Scaling Up Capacity and Resolution
41
Masked Autoencoders Are Scalable Vision Learners
42
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
43
CLIP4Caption: CLIP for Video Caption
44
Pix2seq: A Language Modeling Framework for Object Detection
45
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
46
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
47
CBNet: A Composite Backbone Network Architecture for Object Detection
49
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
50
Scaling Vision Transformers
51
MERLOT: Multimodal Neural Script Knowledge Models
52
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
53
MDETR - Modulated Detection for End-to-End Multi-Modal Understanding
54
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
55
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
56
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
57
ViViT: A Video Vision Transformer
58
MoViNets: Mobile Video Networks for Efficient Video Recognition
59
Learning Transferable Visual Models From Natural Language Supervision
60
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
61
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
62
Is Space-Time Attention All You Need for Video Understanding?
63
VinVL: Making Visual Representations Matter in Vision-Language Models
64
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
65
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation
66
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
67
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
68
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
69
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
70
Language Models are Few-Shot Learners
71
End-to-End Object Detection with Transformers
72
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
73
5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
74
UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
75
ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training
76
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
77
PyTorch: An Imperative Style, High-Performance Deep Learning Library
78
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
79
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
80
Objects365: A Large-Scale, High-Quality Dataset for Object Detection
81
UNITER: UNiversal Image-TExt Representation Learning
82
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
83
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
84
VisualBERT: A Simple and Performant Baseline for Vision and Language
85
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
86
Leveraging Pre-trained Checkpoints for Sequence Generation Tasks
87
RoBERTa: A Robustly Optimized BERT Pretraining Approach
88
XLNet: Generalized Autoregressive Pretraining for Language Understanding
89
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
90
Unified Language Model Pre-training for Natural Language Understanding and Generation
91
MASS: Masked Sequence to Sequence Pre-training for Language Generation
92
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
94
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
95
Decoupled Weight Decay Regularization
96
Video Question Answering via Gradually Refined Attention over Appearance and Motion
97
Localizing Moments in Video with Natural Language
98
Attention is All you Need
99
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
101
Modeling Context in Referring Expressions
102
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
103
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
104
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
105
Deep Residual Learning for Image Recognition
106
Generation and Comprehension of Unambiguous Object Descriptions
107
A Neural Attention Model for Abstractive Sentence Summarization
108
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books
109
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
110
A dataset for Movie Description
111
Deep visual-semantic alignments for generating image descriptions
112
Microsoft COCO: Common Objects in Context
113
Im2Text: Describing Images Using 1 Million Captioned Photographs
114
Collecting Highly Parallel Data for Paraphrase Evaluation
115
ImageNet: A large-scale hierarchical image database
116
2022d), we concatenate all descriptions of a video as a paragraph, and evaluate the paragraph-to-video
117
2022a), the dataset is divided into 9K and 1K videos for training and testing
118
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
119
Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
120
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
121
ELEC-TRA: pre-training text encoders as discriminators rather than generators
122
2018) respectively. C = 768 and C = 1024 for mPLUG-2Base and mPLUG-2 . We set S = 2 for universal layers for the good empirical performance, and choose G = C for multi-group mechanism
123
The kinetics human action video
124
Dataset Description Text-to-Video Retrieval We evaluate mPLUG-2 on three popular text-to-video retrieval datasets including MSRVTT (Xu et al., 2016)
125
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
126
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
127
ImageNet-1K contains 1.28M training images and 50K validation images from 1,000 classes