Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models (2023-12-11T00:00:00.000000Z)

TL;DR

Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while enjoying more excellent fine-grained perception and understanding ability and is competent in new document parsing features (OCR or markdown conversion).

Abstract

Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary -- CLIP, which can cover most common vision tasks. However, for some special vision task that needs dense and fine-grained vision perception, e.g., document-level OCR or chart understanding, especially in non-English scenarios, the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision knowledge and even suffer out-of-vocabulary problem. Accordingly, we propose Vary, an efficient and effective method to scale up the vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to produce the desired vocabulary via autoregression. In the next, we scale up the vanilla vision vocabulary by merging the new one with the original one (CLIP), enabling the LVLMs can quickly garner new features. Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while enjoying more excellent fine-grained perception and understanding ability. Specifically, Vary is competent in new document parsing features (OCR or markdown conversion) while achieving 78.2% ANLS in DocVQA and 36.2% in MMVet. Our code will be publicly available on the homepage.

Authors

Chunrui Han

4 papers

Zheng Ge

3 papers

Xiangyu Zhang

3 papers

TL;DR

Abstract

Authors

References52 items

Merlin: Empowering Multimodal LLMs with Foresight Minds

Improved Baselines with Visual Instruction Tuning

Qwen Technical Report

DreamLLM: Synergistic Multimodal Comprehension and Creation

Nougat: Neural Optical Understanding for Academic Documents

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

Otter: A Multi-Modal Model With In-Context Instruction Tuning

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Segment Anything

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Language Is Not All You Need: Aligning Perception with Language Models

LLaMA: Open and Efficient Foundation Language Models

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

GLM-130B: An Open Bilingual Pre-trained Model

Emergent Abilities of Large Language Models

OPT: Open Pre-trained Transformer Language Models

Flamingo: a Visual Language Model for Few-Shot Learning

Exploring Plain Vision Transformer Backbones for Object Detection

End-to-end Document Recognition and Understanding with Dessurt

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Training language models to follow instructions with human feedback

LaTr: Layout-Aware Transformer for Scene-Text VQA

OCR-Free Document Understanding Transformer

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Learning Transferable Visual Models From Natural Language Supervision

DocVQA: A Dataset for VQA on Document Images

Language Models are Few-Shot Learners

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

OCR-VQA: Visual Question Answering by Reading Text in Images

Towards VQA Models That Can Read

Decoupled Weight Decay Regularization

Deep Reinforcement Learning from Human Preferences

SGDR: Stochastic Gradient Descent with Warm Restarts

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Microsoft COCO: Common Objects in Context

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

HumanLiker: A Human-like Object Detector to Model the Manual Labeling Process

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Unsupervised Multitask Learners

Binary codes capable of correcting deletions, insertions, and reversals

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Introducing mpt-7b: A new standard for open-source, commercially usable llms (

Introducing qwen-7b: Open foundation and human-aligned models (of the state-of-the-arts)

Stanford alpaca: An instruction-following llama model

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names