Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (2023-08-24T00:00:00.000000Z)

TL;DR

The Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks.

Abstract

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

Authors

Jingren Zhou

13 papers

Shuai Bai

3 papers

Junyang Lin

6 papers

TL;DR

Abstract

Authors

References86 items

TouchStone: Evaluating Vision-Language Models by Language Models

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

Generative Pretraining in Multimodality

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Kosmos-2: Grounding Multimodal Large Language Models to the World

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

PaLI-X: On Scaling up a Multilingual Vision and Language Model

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

ImageBind One Embedding Space to Bind Them All

Otter: A Multi-Modal Model With In-Context Instruction Tuning

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

DataComp: In search of the next generation of multimodal datasets

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Visual Instruction Tuning

Language Is Not All You Need: Aligning Perception with Language Models

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

LAION-5B: An open large-scale dataset for training next generation image-text models

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

CoCa: Contrastive Captioners are Image-Text Foundation Models

Flamingo: a Visual Language Model for Few-Shot Learning

GRIT: General Robust Image Task Benchmark

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

FLAVA: A Foundational Language And Vision Alignment Model

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

OCR-Free Document Understanding Transformer

Florence: A New Foundation Model for Computer Vision

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

LiT: Zero-Shot Transfer with Locked-image text Tuning

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

VinVL: Revisiting Visual Representations in Vision-Language Models

M6: A Chinese Multimodal Pretrainer

Learning Transferable Visual Models From Natural Language Supervision

UniT: Multimodal Multitask Learning with a Unified Transformer

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

DocVQA: A Dataset for VQA on Document Images

Language Models are Few-Shot Learners

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

TextCaps: a Dataset for Image Captioning with Reading Comprehension

UNITER: UNiversal Image-TExt Representation Learning

OCR-VQA: Visual Question Answering by Reading Text in Images

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

VizWiz Grand Challenge: Answering Visual Questions from Blind People

DVQA: Understanding Data Visualizations via Question Answering

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

A Diagram is Worth a Dozen Images

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Generation and Comprehension of Unambiguous Object Descriptions

Microsoft COCO Captions: Data Collection and Evaluation Server

CIDEr: Consensus-based image description evaluation

ReferItGame: Referring to Objects in Photographs of Natural Scenes

Microsoft COCO: Common Objects in Context

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Im2Text: Describing Images Using 1 Million Captioned Photographs

Introducing qwen-7b: Open foundation and human-aligned models (of the state-of-the-arts)

Coyo-700m: Image-text pair dataset, 2022