PaLI-X: On Scaling up a Multilingual Vision and Language Model (2023-05-29T00:00:00.000000Z)

TL;DR

PaLI-X, a multilingual vision and language model, advances the state-of-the-art on most vision-and-language benchmarks considered and observes emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

Abstract

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

Authors

References96 items

Larger language models do in-context learning differently

PaLM-E: An Embodied Multimodal Language Model

Language Is Not All You Need: Aligning Perception with Language Models

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

Scaling Vision Transformers to 22 Billion Parameters

DePlot: One-shot visual language reasoning by plot-to-table translation

Structured Prompting: Scaling In-Context Learning to 1, 000 Examples

VindLU: A Recipe for Effective Video-and-Language Pretraining

Unifying Vision, Text, and Layout for Universal Document Processing

Underspecification in Scene Description-to-Depiction Tasks

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

PaLI: A Jointly-Scaled Multilingual Language-Image Model

PreSTU: Pre-Training for Scene-Text Understanding

Pre-training image-language transformers for open-vocabulary tasks

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

GIT: A Generative Image-to-text Transformer for Vision and Language

Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Simple Open-Vocabulary Object Detection with Vision Transformers

UL2: Unifying Language Learning Paradigms

All You May Need for VQA are Image Captions

Flamingo: a Visual Language Model for Few-Shot Learning

PaLM: Scaling Language Modeling with Pathways

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

A New Generation of Perspective API: Efficient Multilingual Character-level Transformers

Fairness Indicators for Systematic Assessments of Visual Feature Extractors

Chain of Thought Prompting Elicits Reasoning in Large Language Models

End-to-end Generative Pretraining for Multimodal Video Captioning

LaTr: Layout-Aware Transformer for Scene-Text VQA

RegionCLIP: Region-based Language-Image Pretraining

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Meta-learning via Language Model In-context Tuning

Vector-quantized Image Modeling with Improved VQGAN

Pix2seq: A Language Modeling Framework for Object Detection

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

End-to-End Dense Video Captioning with Parallel Decoding

Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning

Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

A Step Toward More Inclusive People Annotations for Fairness

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation

Uncovering the Bias in Facial Expressions

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements

DocVQA: A Dataset for VQA on Document Images

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Are we done with ImageNet?

Language Models are Few-Shot Learners

Revisiting Modulated Convolutions for Visual Counting and Beyond

TextCaps: a Dataset for Image Captioning with Reading Comprehension

Captioning Images Taken by People Who Are Blind

Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation

OCR-VQA: Visual Question Answering by Reading Text in Images

Natural Adversarial Examples

Does Object Recognition Work for Everyone?

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

Annotating Objects and Relations in User-Generated Videos

LVIS: A Dataset for Large Vocabulary Instance Segmentation

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

Scene Text Visual Question Answering

Learning Robust Global Representations by Penalizing Local Predictive Power

Towards VQA Models That Can Read

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

Do ImageNet Classifiers Generalize to ImageNet?

TallyQA: Answering Complex Counting Questions

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Gender Bias in Coreference Resolution

Gender Recognition or Gender Reductionism?: The Social Implications of Embedded Gender Recognition Systems

Women also Snowboard: Overcoming Bias in Captioning Models

VizWiz Grand Challenge: Answering Visual Questions from Blind People

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

Video Question Answering via Gradually Refined Attention over Appearance and Motion

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Semantics derived automatically from language corpora contain human-like biases

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

A Diagram is Worth a Dozen Images

ActivityNet: A large-scale video benchmark for human activity understanding

Deep visual-semantic alignments for generating image descriptions

Deep Learning Face Attributes in the Wild

A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input

Fairness through awareness

ImageNet: A large-scale hierarchical image database

Palm 2 technical report, 2023

nocaps: novel object captioning at scale

ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

Monk skin tone scale

The misgendering machines: Trans/hci implications of automatic gender recognition

Densecaptioning events in videos

A Survey on Context Learning