Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels (2023-12-28T00:00:00.000000Z)

TL;DR

The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure and unify the three tasks into one model, termed the OneAlign.

Abstract

The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align.

Authors

Chunyi Li

3 papers

Zicheng Zhang

5 papers

Haoning Wu

8 papers

TL;DR

Abstract

Authors

References44 items

Q-Instruct: Improving Low-Level Visual Abilities for Multi-Modality Foundation Models

mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Improved Baselines with Visual Instruction Tuning

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

Toward Transparent Deep Image Aesthetics Assessment With Tag-Based Content Descriptors

MMBench: Is Your Multi-modal Model an All-around Player?

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment

Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective

VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

Exploring Opinion-Unaware Video Quality Assessment with Semantic Affinity Criterion

Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspectives

Neighbourhood Representative Sampling for Efficient End-to-End Video Quality Assessment

FAST-VQA: Efficient End-to-end Video Quality Assessment with Fragment Sampling

DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment

CoCa: Contrastive Captioners are Image-Text Foundation Models

A Deep Learning based No-reference Quality Assessment Model for UGC Videos

Blindly Assess Quality of In-the-Wild Videos via Quality-Aware Pre-Training and Motion Perception

MUSIQ: Multi-scale Image Quality Transformer

Learning Transferable Visual Models From Natural Language Supervision

RAPIQUE: Rapid and Accurate Video Quality Prediction of User Generated Content

Patch-VQ: ‘Patching Up’ the Video Quality Problem

Blindly Assess Image Quality in the Wild Guided by a Self-Adaptive Hyper Network

Perceptual Quality Assessment of Smartphone Photography

UGC-VQA: Benchmarking Blind Video Quality Assessment for User Generated Content

Aesthetic Image Captioning From Weakly-Labelled Photographs

Quality Assessment of In-the-Wild Videos

Blind Image Quality Assessment Using a Deep Bilinear Convolutional Neural Network

Two-Level Approach for No-Reference Consumer Video Quality Assessment

Effective Aesthetics Prediction With Multi-Level Spatially Pooled Features

NIMA: Neural Image Assessment

Photo Aesthetics Ranking Network with Attributes and Content Adaptation

Making a “Completely Blind” Image Quality Analyzer

No-Reference Image Quality Assessment in the Spatial Domain

AVA: A large-scale database for aesthetic visual analysis

Image quality assessment: from error visibility to structural similarity

Learning

Language Models are Unsupervised Multitask Learners

Methodology for the subjective assessment of the quality of television pictures

empowers

An open database for ai-generated image

prompt for vision-language models

Exploring clip for assessing the

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names