MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Published in

Computer Vision and Pattern Recognition(2023)

External Links:

Generate Graph DownloadPDF

TL;DR

A comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 chal-lenging video tasks that cannot be effectively solved with a single frame, and develops a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with di-verse instruction-tuning data.

Abstract

With the rapid development of Multi-modal Large language Models (MLLMs), a number of diagnostic bench-marks have recently emerged to evaluate the comprehension capabilities of these models. However, most bench-marks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 chal-lenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we au-tomatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video an-notations, avoiding the biased scoring of LLMs. More-over, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with di-verse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from sat-isfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Published in

Computer Vision and Pattern Recognition(2023)

External Links:

Generate Graph DownloadPDF

TL;DR

Abstract

References102 items

STAR: A Benchmark for Situated Reasoning in Real-World Videos

LvBench: A Benchmark for Long-form Video Understanding with Versatile Multi-modal Question Answering

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Llama 2: Open Foundation and Fine-Tuned Chat Models

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

MMBench: Is Your Multi-modal Model an All-around Player?

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

FunQA: Towards Surprising Video Comprehension

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Valley: Video Assistant with Large Language model Enhanced abilitY

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Paxion: Patching Action Knowledge in Video-Language Foundation Models

Evaluating Object Hallucination in Large Vision-Language Models

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Self-Chained Image-Language Model for Video Localization and Question Answering

VideoChat: chat-centric video understanding

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension

Otter: A Multi-Modal Model With In-Context Instruction Tuning

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Visual Instruction Tuning

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

EVA-CLIP: Improved Training Techniques for CLIP at Scale

PaLM-E: An Embodied Multimodal Language Model

Language Is Not All You Need: Aligning Perception with Language Models

LLaMA: Open and Efficient Foundation Language Models

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

GLM-130B: An Open Bilingual Pre-trained Model

Video Graph Transformer for Video Question Answering

ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Flamingo: a Visual Language Model for Few-Shot Learning

PaLM: Scaling Language Modeling with Pathways

All in One: Exploring Unified Video-Language Pre-Training

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Finetuned Language Models Are Zero-Shot Learners

LoRA: Low-Rank Adaptation of Large Language Models

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

VisualMRC: Machine Reading Comprehension on Document Images

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

MovieNet: A Holistic Dataset for Movie Understanding

DocVQA: A Dataset for VQA on Document Images

Language Models are Few-Shot Learners

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

TextCaps: a Dataset for Image Captioning with Reading Comprehension

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

OCR-VQA: Visual Question Answering by Reading Text in Images

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

Scene Text Visual Question Answering

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

Towards VQA Models That Can Read

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

TVQA: Localized, Compositional Video Question Answering

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Moments in Time Dataset: One Million Videos for Event Understanding

Video Question Answering via Gradually Refined Attention over Appearance and Motion

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

TALL: Temporal Activity Localization via Language Query

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

A Hierarchical Approach for Generating Descriptive Image Paragraphs

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Microsoft COCO: Common Objects in Context

A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

Im2Text: Describing Images Using 1 Million Captioned Photographs

Collecting Highly Parallel Data for Paraphrase Evaluation

ImageNet: A large-scale hierarchical image database

GPT-4V(ision) System Card

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Perception Test : A Diagnostic Benchmark for Multimodal Models

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

: TOWARDS

100

OpenAI

101

Internlm: A multilingual language model with progressively enhanced capabilities

102

Vicuna: An open-source chatbot impress-ing gpt-4 with 90% chatgpt quality