Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

Text-to-Image Generation | State-of-the-Art

image-generation

Text-to-Image Generation

3260 papers • 126 benchmarks • 313 datasets

Text-to-Image Generation is a task in computer vision and natural language processing where the goal is to generate an image that corresponds to a given textual description. This involves converting the text input into a meaningful representation, such as a feature vector, and then using this representation to generate an image that matches the description.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in image-generation

Trend

Dataset

Best Model

Actions

MS COCO

MS COCO

CUB

CUB

Multi-Modal-CelebA-HQ

Multi-Modal-CelebA-HQ

Libraries

i

Use these libraries to find image-generation models and implementations

faceonlive/ai-research

4 papers 144

Datasets

CUB-200-2011

Oxford 102 Flower

Oxford 102 Flower

Conceptual Captions

Conceptual Captions

Fashion-Gen

Multi-Modal CelebA-HQ

Multi-Modal CelebA-HQ

LHQ

Subtasks

text-guided-image-editing Text-based Image Editing Zero-Shot Text-to-Image Generation Concept Alignment Concept Alignment

Most implemented papers

Show and tell: A neural image caption generator

O. Vinyals, Samy Bengio, D. Erhan, Alexander Toshev•Sun Nov 16 2014

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

6333

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

Oxford 102 Flowers

Oxford 102 Flowers

Conceptual Captions

Conceptual Captions

LHQC

LHQC

MS-COCO

MS-COCO

GeNeVA (CoDraw)

GeNeVA (CoDraw)

GeNeVA (i-CLEVR)

GeNeVA (i-CLEVR)

LAION COCO

LAION COCO

Colors

Colors

hanzhanggit/StackGAN

3 papers 1,850

kakaobrain/rq-vae-transformer

3 papers 687

hanzhanggit/StackGAN-Pytorch

3 papers 480

IIGROUP/TediGAN

3 papers 366

3 papers 283

drboog/Shifted_Diffusion

3 papers 154

ai-forever/movqgan

3 papers 97

mlpc-ucsd/TokenCompose

3 papers 66

lucidrains/DALLE2-pytorch

2 papers 10,816

lucidrains/imagen-pytorch

2 papers 7,780

modelscope/modelscope

2 papers 6,049

taoxugit/AttnGAN

2 papers 1,319

hanzhanggit/StackGAN-v2

2 papers 835

Karine-Huang/T2I-CompBench

2 papers 134

davidstap/AttnGAN

2 papers 77

joanrod/ocr-vqgan

2 papers 60

Vishal-V/StackGAN

2 papers 32

Pick-a-Pic

Pick-a-Pic

T2I-CompBench

Human-Art

Conditional Text-to-Image Synthesis

Consistent Character Generation

DreamBooth Personalized Generation

0

Generative Adversarial Text to Image Synthesis

Honglak Lee, Xinchen Yan, B. Schiele, Scott E. Reed, Zeynep Akata, Lajanugen Logeswaran•Mon May 16 2016

A novel deep architecture and GAN formulation is developed to effectively bridge advances in text and image modeling, translating visual concepts from characters to pixels.

3350 0

High-Resolution Image Synthesis with Latent Diffusion Models

A. Blattmann, Robin Rombach, Dominik Lorenz, B. Ommer, Patrick Esser•Sun Dec 19 2021

These latent diffusion models achieve new state of the art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.

21712 0

StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks

Xiaogang Wang, Xiaolei Huang, Tao Xu, Hongsheng Li, Shaoting Zhang, Dimitris N. Metaxas•Fri Dec 09 2016

This paper proposes Stacked Generative Adversarial Networks (StackGAN) to generate 256 photo-realistic images conditioned on text descriptions and introduces a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold.

2869 0

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

Zhe Gan, Xiaodong He, Pengchuan Zhang, Xiaolei Huang, Han Zhang, Tao Xu, Qiuyuan Huang•Mon Nov 27 2017

An Attentional Generative Adversarial Network that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation and for the first time shows that the layered attentional GAN is able to automatically select the condition at the word level for generating different parts of the image.

1893 0

Taming Transformers for High-Resolution Image Synthesis

Robin Rombach, B. Ommer, Patrick Esser•Wed Dec 16 2020

It is demonstrated how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.

3866 0

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal Chechik, Rinon Gal, Y. Atzmon, Amit H. Bermano, Yuval Alaluf, Or Patashnik, D. Cohen-Or•Mon Aug 01 2022

This work uses only 3-5 images of a user-provided concept to represent it through new words in the embedding space of a frozen text-to-image model, and finds evidence that a single word embedding is sufficient for capturing unique and varied concepts.

2490 0

Zero-Shot Text-to-Image Generation

I. Sutskever, Alec Radford, Scott Gray, A. Ramesh, Gabriel Goh, Chelsea Voss, Mark Chen, Mikhail Pavlov•Tue Feb 23 2021

This work describes a simple approach based on a transformer that autoregressively models the text and image tokens as a single stream of data that is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

6077 0

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, Alex Nichol, Prafulla Dhariwal, Mark Chen, Casey Chu•Tue Apr 12 2022

It is shown that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity, and the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion.

8421 0

Adding a benchmark result helps the community track progress.