1
LAION-5B: An open large-scale dataset for training next generation image-text models
2
OmniVL: One Foundation Model for Image-Language and Video-Language Tasks
3
Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis
4
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
5
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection
6
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
7
Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization
8
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding
9
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation
10
SelfReformer: Self-Refined Network with Transformer for Salient Object Detection
11
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
12
Hierarchical Text-Conditional Image Generation with CLIP Latents
13
MatteFormer: Transformer-Based Image Matting via Prior-Tokens
14
Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
15
A Unified Transformer Framework for Group-Based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection
16
Detecting Twenty-thousand Classes using Image-level Supervision
17
High-Resolution Image Synthesis with Latent Diffusion Models
18
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
19
Image Segmentation Using Text and Image Prompts
20
RegionCLIP: Region-based Language-Image Pretraining
21
Vector Quantized Diffusion Model for Text-to-Image Synthesis
22
Florence: A New Foundation Model for Computer Vision
23
Swin Transformer V2: Scaling Up Capacity and Resolution
24
On Model Calibration for Long-Tailed Object Detection and Instance Segmentation
25
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
26
Dynamic Head: Unifying Object Detection Heads with Attentions
27
CogView: Mastering Text-to-Image Generation via Transformers
28
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
29
Probabilistic two-stage detection
30
Learning Transferable Visual Models From Natural Language Supervision
31
FASA: Feature Augmentation and Sampling Adaptation for Long-Tailed Instance Segmentation
32
Zero-Shot Text-to-Image Generation
33
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
34
Equalization Loss v2: A New Gradient Balance Approach for Long-tailed Object Detection
35
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation
36
Open-Vocabulary Object Detection Using Captions
37
1st Place Solution of LVIS Challenge 2020: A Good Box is not a Guarantee of a Good Mask
38
Seesaw Loss for Long-Tailed Instance Segmentation
39
The Devil is in Classification: A Simple Framework for Long-tail Instance Segmentation
40
A survey on instance segmentation: state of the art
41
U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection
42
InstaBoost: Boosting Instance Segmentation via Probability Map Guided Copy-Pasting
43
LVIS: A Dataset for Large Vocabulary Instance Segmentation
44
Photorealistic Image Synthesis for Object Instance Detection
45
Modeling Visual Context is Key to Augmenting Object Detection Datasets
46
Bootstrapping the Performance of Webly Supervised Semantic Segmentation
47
Exploring the Limits of Weakly Supervised Pretraining
48
On Pre-Trained Image Features and Synthetic Images for Deep Learning
50
Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection
51
Webly Supervised Semantic Segmentation
53
Weakly Supervised Semantic Segmentation Using Web-Crawled Videos
54
Playing for Data: Ground Truth from Computer Games
55
Instance-Aware Semantic Segmentation via Multi-task Network Cascades
56
Deep Residual Learning for Image Recognition
57
STC: A Simple to Complex Framework for Weakly-Supervised Semantic Segmentation
58
Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views
59
Microsoft COCO: Common Objects in Context
60
ImageNet: A large-scale hierarchical image database
61
PromptDet: Expand Your Detector Vocabulary with Uncurated Images
62
Comparison of open-vocabulary detection performance on LVIS, ∗ means they use external supervised data for training
63
Open-Vocabulary Image Segmentation
64
X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion