Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Published in

Conference on Empirical Methods in Natural Lang...(2023)

External Links:

Generate Graph

TL;DR

This work unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM, and establishes a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.

Abstract

Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers.In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.As a result, Video-LLaVA outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Additionally, our Video-LLaVA also achieves superior performances on a broad range of 9 image benchmarks.Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM.

Authors

Bin Lin

5 papers

Munan Ning

3 papers

Bin Zhu

3 papers

References57 items

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Improved Baselines with Visual Instruction Tuning

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

ImageBind-LLM: Multi-modality Instruction Tuning

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Published in

Conference on Empirical Methods in Natural Lang...(2023)

External Links:

Generate Graph

TL;DR

Abstract

Authors

Bin Lin

5 papers

Munan Ning

3 papers

Bin Zhu

3 papers

References57 items

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Improved Baselines with Visual Instruction Tuning

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

ImageBind-LLM: Multi-modality Instruction Tuning

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Llama 2: Open Foundation and Fine-Tuned Chat Models

MMBench: Is Your Multi-modal Model an All-around Player?

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

Valley: Video Assistant with Large Language model Enhanced abilitY

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Evaluating Object Hallucination in Large Vision-Language Models

PaLM 2 Technical Report

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

VideoChat: chat-centric video understanding

ImageBind One Embedding Space to Bind Them All

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Otter: A Multi-Modal Model With In-Context Instruction Tuning

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Visual Instruction Tuning

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

ViperGPT: Visual Inference via Python Execution for Reasoning

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

LLaMA: Open and Efficient Foundation Language Models

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Flamingo: a Visual Language Model for Few-Shot Learning

Training language models to follow instructions with human feedback

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Masked Autoencoders Are Scalable Vision Learners

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Language Models are Few-Shot Learners

PALM: Pre-training an Autoencoding&autoregressive Language Model for Context-conditioned Generation

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

Towards VQA Models That Can Read

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

VizWiz Grand Challenge: Answering Visual Questions from Blind People

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Gaussian Error Linear Units (GELUs)

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Collecting Highly Parallel Data for Paraphrase Evaluation

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

An open web-scale filtered dataset

2024. Moe-llava: Mixture of experts for large vision-language models

2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

2023. Mme: A comprehensive evaluation benchmark for multimodal large language models

2023. Stanford alpaca: An instruction-following llama model

Field of Study

Computer Science

Journal Information

Name

ArXiv

Volume

abs/2005.00687

Venue Information

Name

Conference on Empirical Methods in Natural Language Processing

Type

conference

URL

https://www.aclweb.org/portal/emnlp

Alternate Names

Empir Method Nat Lang Process
Empirical Methods in Natural Language Processing
Conf Empir Method Nat Lang Process
EMNLP

TL;DR

Abstract

Authors

References57 items

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Improved Baselines with Visual Instruction Tuning

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

ImageBind-LLM: Multi-modality Instruction Tuning

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

TL;DR

Abstract

Authors

References57 items

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Improved Baselines with Visual Instruction Tuning

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

ImageBind-LLM: Multi-modality Instruction Tuning

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Llama 2: Open Foundation and Fine-Tuned Chat Models

MMBench: Is Your Multi-modal Model an All-around Player?

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

Valley: Video Assistant with Large Language model Enhanced abilitY

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Evaluating Object Hallucination in Large Vision-Language Models

PaLM 2 Technical Report

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

VideoChat: chat-centric video understanding

ImageBind One Embedding Space to Bind Them All

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Otter: A Multi-Modal Model With In-Context Instruction Tuning

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Visual Instruction Tuning

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

ViperGPT: Visual Inference via Python Execution for Reasoning

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

LLaMA: Open and Efficient Foundation Language Models

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Flamingo: a Visual Language Model for Few-Shot Learning

Training language models to follow instructions with human feedback

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Masked Autoencoders Are Scalable Vision Learners

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Language Models are Few-Shot Learners

PALM: Pre-training an Autoencoding&autoregressive Language Model for Context-conditioned Generation

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

Towards VQA Models That Can Read

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

VizWiz Grand Challenge: Answering Visual Questions from Blind People

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Gaussian Error Linear Units (GELUs)

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Collecting Highly Parallel Data for Paraphrase Evaluation

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Open-clip

An open web-scale filtered dataset

2024. Moe-llava: Mixture of experts for large vision-language models

2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

2023. Mme: A comprehensive evaluation benchmark for multimodal large language models

2023. Stanford alpaca: An instruction-following llama model

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names