natural-language-processing-4

Text-to-Video Generation

3260 papers • 126 benchmarks • 313 datasets

This task refers to video generation based on a given sentence or sequence of words.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in text-to-video-generation-9

Trend

Dataset

Best Model

Actions

MSR-VTT

UCF-101

EvalCrafter Text-to-Video (ECTV) Dataset

Libraries

i

Use these libraries to find text-to-video-generation-9 models and implementations

Datasets

UCF101

Kinetics

MSR-VTT

Something-Something V2

WebVid

EvalCrafter Text-to-Video (ECTV) Dataset

Subtasks

Text-to-Video Editing Subject-driven Video Generation

Most implemented papers

VideoComposer: Compositional Video Synthesis with Motion Controllability

Jingren Zhou, Yujun Shen, Hangjie Yuan, Xiang Wang, Yingya Zhang, Deli Zhao, Jiuniu Wang, Dayou Chen, Shiwei Zhang•Fri Jun 02 2023

This work introduces the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics and develops a Spatio-Temporal Condition encoder (STC-encoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs.

Content

EvalCrafter Text-to-Video (ECTV) Dataset

Kinetics

Something-Something V2

WebVid

CelebV-Text

ChronoMagic

Vript

463

0

Paper Graph

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Xintao Wang, Yixiao Ge, Ying Shan, Weixian Lei, Mike Zheng Shou, Jay Zhangjie Wu, Yuchao Gu, W. Hsu, Xiaohu Qie•Wed Dec 21 2022

This work proposes a new T2V generation setting—One-Shot Video Tuning, where only one text-video pair is presented, and introduces Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy.

1027 0

Paper Graph

ModelScope Text-to-Video Technical Report

Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yingya Zhang, Jiuniu Wang, Dayou Chen•Fri Aug 11 2023

The ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions and demonstrates superior performance over state-of-the-art methods across three evaluation metrics.

624 0

Paper Graph

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Xintao Wang, Ying Shan, Xiaodong Cun, Yong Zhang, Haoxin Chen, Menghan Xia, Yin-Yin He, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Chao-Liang Weng•Sun Oct 29 2023

This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints, and can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576.

486 0

Paper Graph

MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

Devi Parikh, Guan Pang, Songwei Ge, Xiaoyue Yin, Thomas Hayes, Songyang Zhang, Sasha Sheng, Harry Yang, Isabelle Hu•Sat Apr 16 2022

This work presents a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun, and makes substantial modifications to make the game richer by introducing audio and enabling new interactions.

26 0

Paper Graph

Make-A-Video: Text-to-Video Generation without Text-Video Data

Devi Parikh, Sonal Gupta, Xiaoyue Yin, Thomas Hayes, Songyang Zhang, Harry Yang, Yaniv Taigman, Oron Ashual, Uriel Singer, Adam Polyak, Jie An, Qiyuan Hu, Oran Gafni•Wed Sep 28 2022

Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.

1826 0

Paper Graph

Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

S. Fidler, A. Blattmann, Tim Dockhorn, Robin Rombach, Seung Wook Kim, Huan Ling, Karsten Kreis•Mon Apr 17 2023

The Video LDM is validated on real driving videos of resolution $512 \times 1024$, achieving state-of-the-art performance and it is shown that the temporal layers trained in this way generalize to different finetuned text-to-image LDMs.

1463 0

Paper Graph

LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models

Chen Change Loy, Ceyuan Yang, Dahua Lin, Yi Wang, Y. Qiao, Chenyang Si, Jiashuo Yu, Ziwei Liu, Bo Dai, Shangchen Zhou, Yuming Jiang, Ziqi Huang, Yinan He, Xinyuan Chen, Yaohui Wang, Xin Ma, Pe-der Yang, Yuwei Guo, Tianxing Wu, Cunjian Chen•Mon Sep 25 2023

The incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data and proves that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes.

321 0

Paper Graph

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

Xintao Wang, Ying Shan, Yujiu Yang, Yong Zhang, Haoxin Chen, Menghan Xia, Jinbo Xing, Gongye Liu•Sat Dec 31 2022

StyleCrafter is introduced, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image, and designs a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features.

40 0

Paper Graph

Latte: Latent Diffusion Transformer for Video Generation

Xinyuan Chen, Yaohui Wang, Xin Ma, Yu Qiao, Ziwei Liu, Gengyun Jia, Yuan-Fang Li, Cunjian Chen•Thu Jan 04 2024

This work proposes Latte, a novel Latent Diffusion Transformer that first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space and achieves state-of-the-art performance across four standard video generation datasets.

441 0

Paper Graph

Adding a benchmark result helps the community track progress.