LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models (2023-09-26T00:00:00.000000Z)

TL;DR

The incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data and proves that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes.

Abstract

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously (a) accomplish the synthesis of visually realistic and temporally coherent videos while (b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: (1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. (2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications. Project page: https://github.com/Vchitect/LaVie/.

Authors

References89 items

WorldSimBench: Towards Video Generation Models as World Simulators

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models

Photorealistic Video Generation with Diffusion Models

VBench: Comprehensive Benchmark Suite for Video Generative Models

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

ModelScope Text-to-Video Technical Report

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Exploiting Diffusion Prior for Real-World Image Super-Resolution

VideoChat: chat-centric video understanding

LEO: Generative Latent Image Animator for Human Video Synthesis

Collaborative Diffusion for Multi-Modal Face Generation and Editing

Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Text2Performer: Text-Driven Human Video Generation

VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation

LLaMA: Open and Efficient Foundation Language Models

Adding Conditional Control to Text-to-Image Diffusion Models

Zero-shot Image-to-Image Translation

Reference-Based Image and Video Super-Resolution via $C^{2}$-Matching

Towards Smooth Video Composition

MagicVideo: Efficient Video Generation With Latent Diffusion Models

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

LAION-5B: An open large-scale dataset for training next generation image-text models

Imagen Video: High Definition Video Generation with Diffusion Models

Make-A-Video: Text-to-Video Generation without Text-Video Data

Towards Robust Blind Face Restoration with Codebook Lookup Transformer

Generating Long Videos of Dynamic Scenes

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Hierarchical Text-Conditional Image Generation with CLIP Latents

Video Diffusion Models

Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks

StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2

High-Resolution Image Synthesis with Latent Diffusion Models

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Investigating Tradeoffs in Real-World Video Super-Resolution

LoRA: Low-Rank Adaptation of Large Language Models

Robust Reference-based Super-Resolution via C2-Matching

A Good Image Generator Is What You Need for High-Resolution Video Synthesis

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment

RoFormer: Enhanced Transformer with Rotary Position Embedding

VideoGPT: Video Generation using VQ-VAE and Transformers

Image Super-Resolution via Iterative Refinement

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Learning Transferable Visual Models From Natural Language Supervision

Zero-Shot Text-to-Image Generation

Improved Denoising Diffusion Probabilistic Models

The MSR-Video to Text dataset with clean annotations

InMoDeGAN: Interpretable Motion Decomposition Generative Adversarial Network for Video Generation

Taming Transformers for High-Resolution Image Synthesis

Score-Based Generative Modeling through Stochastic Differential Equations

Denoising Diffusion Implicit Models

Denoising Diffusion Probabilistic Models

Long-Term Video Prediction via Criticization and Retrospection

Cross-Scale Internal Graph Neural Network for Image Super-Resolution

ImaGINator: Conditional Spatio-Temporal GAN for Video Generation

Disentangling Multiple Features in Video Sequences Using Gaussian Processes in Variational Autoencoders

G3AN: Disentangling Appearance and Motion for Video Generation

Analyzing and Improving the Image Quality of StyleGAN

Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Adversarial Video Generation on Complex Datasets

A Style-Based Generator Architecture for Generative Adversarial Networks

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Disentangled Sequential Autoencoder

Neural Discrete Representation Learning

MoCoGAN: Decomposing Motion and Content for Video Generation

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Temporal Generative Adversarial Nets with Singular Value Clipping

Generating Videos with Scene Dynamics

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Curriculum learning

Supplementary Materials for: NULL-text Inversion for Editing Real Images using Guided Diffusion Models

Vdt:General-purposevideodiffusiontransformersviamaskmodeling

Scalablediffusionmodelswithtransform-ers

Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths