Perceiver IO: A General Architecture for Structured Inputs & Outputs (2021-07-30T00:00:00.000000Z)

TL;DR

This work proposes Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs and augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering.

Abstract

A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain&task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.

Authors

Jean-Baptiste Alayrac

8 papers

Sebastian Borgeaud

4 papers

O. Vinyals

41 papers

TL;DR

Abstract

Authors

References105 items

Multi-modal Deep Learning

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Luna: Linear Unified Nested Attention

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

AutoFlow: Learning a Better Training Set for Optical Flow

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Learning to Estimate Hidden Motions with Global Motion Aggregation

Going deeper with Image Transformers

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pretrained Transformers as Universal Computation Engines

Perceiver: General Perception with Iterative Attention

Coordination Among Neural Modules Through a Shared Global Workspace

Generative Adversarial Transformers

Zero-Shot Text-to-Image Generation

High-Performance Large-Scale Image Recognition Without Normalization

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

Training data-efficient image transformers & distillation through attention

Imitating Interactive Intelligence

MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers

Long Range Arena: A Benchmark for Efficient Transformers

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A Short Note on the Kinetics-700-2020 Human Action Dataset

Efficient Transformers: A Survey

Generative Pretraining From Pixels

Self-Supervised MultiModal Versatile Networks

Object-Centric Learning with Slot Attention

Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains

VirTex: Learning Visual Representations from Textual Annotations

Linformer: Self-Attention with Linear Complexity

Language Models are Few-Shot Learners

End-to-End Object Detection with Transformers

Synthesizer: Rethinking Self-Attention for Transformer Models

Longformer: The Long-Document Transformer

Byte Pair Encoding is Suboptimal for Language Model Pretraining

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

Meta Pseudo Labels

Scaling Laws for Neural Language Models

Grandmaster level in StarCraft II using multi-agent reinforcement learning

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Randaugment: Practical automated data augmentation with a reduced search space

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Sim2real transfer learning for 3D human pose estimation: motion to the rescue

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer

XLNet: Generalized Autoregressive Pretraining for Language Understanding

CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Set Transformer

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Relational recurrent neural networks

Taskonomy: Disentangling Task Transfer Learning

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

mixup: Beyond Empirical Risk Minimization

Hash Embeddings for Efficient Word Representations

PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume

Multi-task Self-Supervised Visual Learning

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

One Model To Learn Them All

Attention is All you Need

In-datacenter performance analysis of a tensor processing unit

Mask R-CNN

Audio Set: An ontology and human-labeled dataset for audio events

UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory

SGDR: Stochastic Gradient Descent with Warm Restarts

Gaussian Error Linear Units (GELUs)

Cross-Stitch Networks for Multi-task Learning

Stacked Hourglass Networks for Human Pose Estimation

Image Captioning with Semantic Attention

Deep Residual Learning for Image Recognition

Multi-task Sequence to Sequence Learning

Neural Machine Translation of Rare Words with Subword Units

Pointer Networks

Object scene flow for autonomous vehicles

U-Net: Convolutional Networks for Biomedical Image Segmentation

FlowNet: Learning Optical Flow with Convolutional Networks