Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding (2022-10-07T00:00:00.000000Z)

TL;DR

Pix2Struct is presented, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language and introduced a variable-resolution input representation and a more flexible integration of language and vision inputs.

Abstract

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.

Authors

Julian Martin Eisenschlos

7 papers

Kristina Toutanova

7 papers

Ming-Wei Chang

11 papers

TL;DR

Abstract

Authors

References62 items

Unifying Vision, Text, and Layout for Universal Document Processing

Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Language Modelling with Pixels

A Unified Sequence Interface for Vision Tasks

Emergent Abilities of Large Language Models

GIT: A Generative Image-to-text Transformer for Vision and Language

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

End-to-end Document Recognition and Understanding with Dessurt

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Grounding Answers for Visual Questions Asked by Visually Impaired People

WebFormer: The Web-page Transformer for Structure Information Extraction

CM3: A Causal Masked Multimodal Model of the Internet

LaTr: Layout-Aware Transformer for Scene-Text VQA

VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling

MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding

Pix2seq: A Language Modeling Framework for Object Detection

Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning

UIBert: Learning Generic Multimodal Representations for UI Understanding

HTLM: Hyper-Text Pre-Training and Prompting of Language Models

DocFormer: End-to-End Transformer for Document Understanding

StructuralLM: Structural Pre-training for Form Understanding

InfographicVQA

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning

Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements

Understanding tables with intermediate pre-training

DocVQA: A Dataset for VQA on Document Images

Mapping Natural Language Instructions to Mobile UI Action Sequences

TextCaps: a Dataset for Image Captioning with Reading Comprehension

Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

OCR-VQA: Visual Question Answering by Reading Text in Images

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Fixing the train-test resolution discrepancy

Towards VQA Models That Can Read

Learning Design Semantics for Mobile Apps

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Image-to-Markup Generation with Coarse-to-Fine Attention

A Diagram is Worth a Dozen Images

Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

Donut: Document Understanding Transformer without OCR

DUE: End-to-End Document Understanding Benchmark

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Improving Language Understanding by Generative Pre-Training

Association for the Advancement of Artificial Intelligence

Association for Computing Machinery

Pix2Struct: Screenshot Parsing as Linguistics

for Visual Language Understanding

of the Association for Computational Linguistics

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names