Unified Discrete Diffusion for Simultaneous Vision-Language Generation (2022-11-27T00:00:00.000000Z)

TL;DR

This work unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix and design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation.

Abstract

The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the"modality translation"and"multi-modality generation"tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.

Authors

Chaoyue Wang

2 papers

Minghui Hu

2 papers

P. Suganthan

3 papers

TL;DR

Abstract

Authors

References56 items

MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Improved Vector Quantized Diffusion Models

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Autoregressive Image Generation using Residual Quantization

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

RePaint: Inpainting using Denoising Diffusion Probabilistic Models

High-Resolution Image Synthesis with Latent Diffusion Models

Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Blended Diffusion for Text-driven Editing of Natural Images

Towards Language-Free Training for Text-to-Image Generation

L-Verse: Bidirectional Generation Between Image and Text

Palette: Image-to-Image Diffusion Models

Unifying Multimodal Transformer for Bi-directional Image and Text Generation

Vector-quantized Image Modeling with Improved VQGAN

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

Structured Denoising Diffusion Models in Discrete State-Spaces

Diffusion Models Beat GANs on Image Synthesis

Learning Transferable Visual Models From Natural Language Supervision

Zero-Shot Text-to-Image Generation

Improved Denoising Diffusion Probabilistic Models

Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions

Taming Transformers for High-Resolution Image Synthesis

Score-Based Generative Modeling through Stochastic Differential Equations

Denoising Diffusion Implicit Models

DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis

Denoising Diffusion Probabilistic Models

Latent Video Transformer

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

X-Linear Attention Networks for Image Captioning

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Attention on Attention for Image Captioning

Image Captioning: Transforming Objects into Words

DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

Decoupled Weight Decay Regularization

Neural Discrete Representation Learning

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Attention is All you Need

Conditional Image Generation with PixelCNN Decoders

Improved Techniques for Training GANs

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Microsoft COCO: Common Objects in Context

The Caltech-UCSD Birds-200-2011 Dataset

Stable Diffusion

2022) and OFA Wang et al. (2022) make explicitly or implicitly use of CLIP loss to optimise the model parameters

Visual-and-Language Navigation: A Survey and Taxonomy

Progressive distillation for fast sampling of diffusion models

few additional examples for vision-language pair generation are shown in

Openimages: A public dataset for large-scale multi-label and multi-class image classiﬁcation

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names