MMBench: Is Your Multi-modal Model an All-around Player? (2023-07-12T00:00:00.000000Z)

TL;DR

MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-language models, and incorporates multiple-choice questions in both English and Chinese versions, enabling an apples-to-apples comparison of VLMs' performance under a bilingual context.

Abstract

Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development in this domain. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but lack fine-grained ability assessment and robust evaluation metrics. Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, which is not scalable and may display significant bias. In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of the following key features: 1. MMBench is meticulously curated with well-designed quality control schemes, surpassing existing similar benchmarks in terms of the number and variety of evaluation questions and abilities; 2. MMBench introduces a rigorous CircularEval strategy and incorporates large language models to convert free-form predictions into pre-defined choices, which helps to yield accurate evaluation results for models with limited instruction-following capabilities. 3. MMBench incorporates multiple-choice questions in both English and Chinese versions, enabling an apples-to-apples comparison of VLMs' performance under a bilingual context. To summarize, MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-language models. We hope MMBench will assist the research community in better evaluating their models and facilitate future progress in this area. The evalutation code of MMBench has been integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.

References72 items

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Monkey: Image Resolution and Text Label are Important Things for Large Multi-Modal Models

mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

CogVLM: Visual Expert for Pretrained Language Models

Improved Baselines with Visual Instruction Tuning

Qwen Technical Report

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Kosmos-2: Grounding Multimodal Large Language Models to the World

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

PandaGPT: One Model To Instruction-Follow Them All

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

Otter: A Multi-Modal Model With In-Context Instruction Tuning

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Visual Instruction Tuning

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

GPT-4 Technical Report

LLaMA: Open and Efficient Foundation Language Models

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks

Visual Spatial Reasoning

Flamingo: a Visual Language Model for Few-Shot Learning

Training language models to follow instructions with human feedback

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

LoRA: Low-Rank Adaptation of Large Language Models

GLM: General Language Model Pretraining with Autoregressive Blank Infilling

Language Models are Few-Shot Learners

KonIQ-10k: An Ecologically Valid Database for Deep Learning of Blind Image Quality Assessment

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

Towards VQA Models That Can Read

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Places: A 10 Million Image Database for Scene Recognition

VizWiz Grand Challenge: Answering Visual Questions from Blind People

Dual-Glance Model for Deciphering Social Relationships

ShapeWorld - A new test methodology for multimodal language understanding

Towards Automatic Learning of Procedures From Web Instructional Videos

Modularity of Mind

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Microsoft COCO Captions: Data Collection and Evaluation Server

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Précis of Bayesian Rationality: The Probabilistic Approach to Human Reasoning

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

nocaps: novel object captioning at scale

Language Models are Unsupervised Multitask Learners

Image Quality : Determine the objective quality of the image, such as whether it is blurry, bright or dark, contrast, etc. 13

Gemini: a family of highly capable multimodal models

W3c school

Minicpm: Unveiling the potential of end-side large language models, 2024. 8

Internlm: A multilingual language model with progressively enhanced capabilities

E Details of the evaluated models In Table

XTuner Contributors

Nature Relation : Other abstract relationships that exist in nature. Examples: predation, symbiosis, coexistence, etc

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models

Omnilmm: Large multi-modal models for strong performance and efficient deployment

Physical Relation : All relationships that exist in the physical world, 3D spatial relationships and the connections between objects are

Image Emotion : Determine which subjective emotion is conveyed by the overall image, such as cold, cheerful, sad, or oppressive

Action Recognition : Recognizing human actions, including pose motion

Image Scene : Determine which environment is shown in the image, such as indoors, outdoors, forest, city, mountains, waterfront, sunny day, rainy day, etc

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality