SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models (2023-11-13T00:00:00.000000Z)

TL;DR

SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings is presented, and an efficient strategy aiming to better capture fine-grained appearances of high-resolution images is proposed.

Abstract

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

References91 items

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Improving Compositional Text-to-image Generation with Large Vision-Language Models

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

Improved Baselines with Visual Instruction Tuning

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

DreamLLM: Synergistic Multimodal Comprehension and Creation

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

ImageBind-LLM: Multi-modality Instruction Tuning

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

PointLLM: Empowering Large Language Models to Understand Point Clouds

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Llama 2: Open Foundation and Fine-Tuned Chat Models

MMBench: Is Your Multi-modal Model an All-around Player?

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Kosmos-2: Grounding Multimodal Large Language Models to the World

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

PandaGPT: One Model To Instruction-Follow Them All

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Evaluating Object Hallucination in Large Vision-Language Models

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

ImageBind One Embedding Space to Bind Them All

Otter: A Multi-Modal Model With In-Context Instruction Tuning

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Visual Instruction Tuning

DINOv2: Learning Robust Visual Features without Supervision

Inpaint Anything: Segment Anything Meets Image Inpainting

Instruction Tuning with GPT-4

Segment Anything

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

GPT-4 Technical Report

Universal Instance Perception as Object Discovery and Retrieval

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners

Language Is Not All You Need: Aligning Perception with Language Models

LLaMA: Open and Efficient Foundation Language Models

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge

OPT: Open Pre-trained Transformer Language Models

Visual Spatial Reasoning

Flamingo: a Visual Language Model for Few-Shot Learning

Training language models to follow instructions with human feedback

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

High-Resolution Image Synthesis with Latent Diffusion Models

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning

Resolution-robust Large Mask Inpainting with Fourier Convolutions

Learning Transferable Visual Models From Natural Language Supervision

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Language Models are Few-Shot Learners

TextCaps: a Dataset for Image Captioning with Reading Comprehension

OCR-VQA: Visual Question Answering by Reading Text in Images

LVIS: A Dataset for Large Vocabulary Instance Segmentation

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

Towards VQA Models That Can Read

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

VizWiz Grand Challenge: Answering Visual Questions from Blind People

Attention is All you Need

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Deep Residual Learning for Image Recognition

Generation and Comprehension of Unambiguous Object Descriptions

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

VQA: Visual Question Answering

ReferItGame: Referring to Objects in Photographs of Natural Scenes

ImageNet Large Scale Visual Recognition Challenge

Microsoft COCO: Common Objects in Context

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

3D-LLM: Injecting the 3D World into Large Language Models

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Unsupervised Multitask Learners

Improving Language Understanding by Generative Pre-Training

A method for stochastic optimization

ShareGPT

Introducing our multimodal models

Minigpt-4: En-hancing vision-language understanding with advanced large language models

Stanford alpaca: An instruction-following llama model

OpenAI

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models