Zero-Shot Text-to-Image Generation (2021-02-24T00:00:00.000000Z)

TL;DR

This work describes a simple approach based on a transformer that autoregressively models the text and image tokens as a single stream of data that is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

Abstract

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

Authors

I. Sutskever

26 papers

Alec Radford

12 papers

Scott Gray

7 papers

TL;DR

Abstract

Authors

References64 items

Generative Adversarial Networks

Learning Transferable Visual Models From Natural Language Supervision

On the Binding Problem in Artificial Neural Networks

Text-to-Image Generation Grounded by Fine-Grained User Attention

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Generative Pretraining From Pixels

Jukebox: A Generative Model for Music

Understanding the Difficulty of Training Transformers

BPE-Dropout: Simple and Effective Subword Regularization

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models

Generating Diverse High-Fidelity Images with VQ-VAE-2

PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

Generating Long Sequences with Sparse Transformers

DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis

Object-Driven Text-To-Image Synthesis via Adversarial Training

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

Decoupled Weight Decay Regularization

Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks

Neural Discrete Representation Learning

Learning with Latent Language

StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

Mixed Precision Training

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Attention is All you Need

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications

StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks

Image-to-Image Translation with Conditional Adversarial Networks

beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework

Categorical Reparameterization with Gumbel-Softmax

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Learning What and Where to Draw

Improved Techniques for Training GANs

Generative Adversarial Text to Image Synthesis

Identity Mappings in Deep Residual Networks

Generating Sentences from a Continuous Space

Generating Images from Captions with Attention

Neural Machine Translation of Rare Words with Subword Units

YFCC100M

DRAW: A Recurrent Neural Network For Image Generation

Adam: A Method for Stochastic Optimization

Microsoft COCO: Common Objects in Context

Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Auto-Encoding Variational Bayes

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

ImageNet: A large-scale hierarchical image database

Ultra-Low Precision 4-bit Training of Deep Neural Networks

Language Models are Unsupervised Multitask Learners

Supplementary materials for: Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

GradientBased Learning Applied to Document Recognition

Gradient-based learning applied to document recognition

Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems

This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Tensorflow: a System for Large-scale Machine Learning Tensorflow: a System for Large-scale Machine Learning

error buffers

all of the all-reduces have ﬁnished and the global norm has been computed

Once all GPUs on a machine have ﬁnished steps (7) and (8) for every resblock in the model

Zero-Shot Text-to-Image Generation

Once each GPU has computed the P matrices for the parameter shards in a resblock

There are several opportunities for overlap between compute and communication in the above steps

throttle speciﬁc operations

Once the P matrices for a resblock have been orthogonalized, we schedule the operations to compute the new Q matrices from the error buffers and the P matrices

Once the reduce-scatter operations for the resblock have ﬁnished

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names