MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning (2023-11-15T00:00:00.000000Z)

TL;DR

A large-scale MultiModal ChartInstruction (MMC-Instruction) dataset is introduced comprising 600k instances supporting diverse tasks and chart types and an instruction-tuning methodology and benchmark to advance multimodal understanding of charts is proposed.

Abstract

With the rapid development of large language models (LLMs) and their integration into large multimodal models (LMMs), there has beenimpressive progress in zero-shot completion of user-oriented vision-language tasks. However, a gap remains in the domain of chartimage understanding due to the distinct abstract components in charts. To address this, we introduce a large-scale MultiModal ChartInstruction (MMC-Instruction) dataset comprising 600k instances supporting diverse tasks and chart types. Leveraging this data, we de-velop MultiModal Chart Assistant (MMCA), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks. Recognizing the need for a comprehensive evaluation of LMM chart understanding, we also propose a MultiModal Chart Benchmark (MMC-Benchmark), a comprehensive human-annotated benchmark with nine distinct tasks evaluating reasoning capabilities over charts.Extensive experiments on MMC-Benchmark reveal the limitations of existing LMMs on correctly interpreting charts, even for the mostrecent GPT-4V model. Our work provides an instruction-tuning methodology and benchmark to advance multimodal understanding ofcharts. Code and data are available at https://github.com/FuxiaoLiu/MMC.

Authors

Wenlin Yao

4 papers

Jianshu Chen

3 papers

Kaiqiang Song

5 papers

TL;DR

Abstract

Authors

References45 items

Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead

Improved Baselines with Visual Instruction Tuning

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

VisText: A Benchmark for Semantically Rich Chart Captioning

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

Transfer Visual Prompt Generator across LLMs

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Segment Anything

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

LLaMA: Open and Efficient Foundation Language Models

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

PaLM: Scaling Language Modeling with Pathways

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Chart-to-Text: A Large-Scale Benchmark for Chart Summarization

SciCap: Generating Captions for Scientific Figures

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

VisualNews : Benchmark and Challenges in Entity-aware Image Captioning

DocVQA: A Dataset for VQA on Document Images

Language Models are Few-Shot Learners

PlotQA: Reasoning over Scientific Plots

LEAF-QA: Locate, Encode & Attend for Figure Question Answering

DVQA: Understanding Data Visualizations via Question Answering

FigureQA: An Annotated Figure Dataset for Visual Reasoning

LineFormer: Line Chart Data Extraction Using Instance Segmentation

Aligning Large Multi-Modal Model with Robust Instruction Tuning

HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

Character-Aware Sampling and Rectification for Scene Text Recognition

OpenAI. 2023b. Gpt-4v(ision) system card

2022. Ocr-free document understanding transformer

2023. Investigating the catastrophic forgetting in multimodal large language models

2023b. Next-gpt: Any-to-any multi-modal llm

2022. Introducing

2023a. Gpt-4 technical report

2023a. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

2023. An overview of bard: an early experiment with generative ai

2024. A survey on knowledge distillation of large language models

2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models

Field of Study

Venue Information

Name

Type

URL

Alternate Names