InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition (2023-09-26T00:00:00.000000Z)

TL;DR

This work proposes InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition that achieves competitive text-image composition scores compared to public solutions, including GPT4-V and GPT3.5.

Abstract

We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition. The innovative nature of our model is highlighted by three appealing properties: 1) Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. Simply provide a writing instruction, and our system will generate the corresponding manuscript. It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates. 2) Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on an extensive multi-modal multilingual database with carefully crafted strategies, resulting in a deep understanding of visual content. 3) State-of-the-art Performance: Our model consistently achieves state-of-the-art results across various mainstream benchmarks for vision-language foundational models, including MME Benchmark, MMBench, MMBench-CN, Seed-Bench, CCBench (Chinese Cultural Benchmark), QBench and Tiny LVLM. Owing to the absence of established metrics for quantitatively assessing text-image composition, we have devised a robust evaluation procedure that comprises both human and GPT4-Vision (GPT4-V) to ensure reliability. Notably, our InternLM-XComposer achieves competitive text-image composition scores compared to public solutions, including GPT4-V and GPT3.5. Collectively, InternLM-XComposer seamlessly blends advanced text-image comprehension and composition, revolutionizing vision-language interaction and offering new insights and opportunities. The InternLM-XComposer model series are publicly available at https://github.com/InternLM/InternLM-XComposer.

References82 items

Baichuan 2: Open Large-scale Language Models

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

MLLM-DataEngine: An Iterative Refinement Approach for MLLM

VIGC: Visual Instruction Generation and Correction

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Llama 2: Open Foundation and Fine-Tuned Chat Models

MMBench: Is Your Multi-modal Model an All-around Player?

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Kosmos-2: Grounding Multimodal Large Language Models to the World

A survey on multimodal large language models

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

Otter: A Multi-Modal Model With In-Context Instruction Tuning

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Visual Instruction Tuning

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

GPT-4 Technical Report

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

PaLM-E: An Embodied Multimodal Language Model

LLaMA: Open and Efficient Foundation Language Models

Grounding Language Models to Images for Multimodal Inputs and Outputs

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Reveal: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

LAION-5B: An open large-scale dataset for training next generation image-text models

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

GLIPv2: Unifying Localization and Vision-Language Understanding

A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge

GIT: A Generative Image-to-text Transformer for Vision and Language

OPT: Open Pre-trained Transformer Language Models

Visual Spatial Reasoning

PaLM: Scaling Language Modeling with Pathways

Training language models to follow instructions with human feedback

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Grounded Language-Image Pre-training

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning

LoRA: Low-Rank Adaptation of Large Language Models

GLM: General Language Model Pretraining with Autoregressive Blank Infilling

Learning Transferable Visual Models From Natural Language Supervision

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Language Models are Few-Shot Learners

TextCaps: a Dataset for Image Captioning with Reading Comprehension

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

OCR-VQA: Visual Question Answering by Reading Text in Images

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

Towards VQA Models That Can Read

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Attention is All you Need

Visual Dialog

VQA: Visual Question Answering

Microsoft COCO Captions: Data Collection and Evaluation Server

Im2Text: Describing Images Using 1 Million Captioned Photographs

ImageNet: A large-scale hierarchical image database

Aligning Large Multi-Modal Model with Robust Instruction Tuning

Retrieval-Augmented Multimodal Language Modeling

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Improving Language Understanding by Generative Pre-Training

Stanford alpaca: An instruction-following llama model

Internlm: A multilingual language model with progressively enhanced capabilities

Minigpt-4: Enhancing vision-language 13

OpenAI

Timoth ´ ee Lacroix, Baptiste

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Introducing qwen-7b: Open foundation and humanaligned models (of the state-of-the-arts)