3260 papers • 126 benchmarks • 313 datasets
This task refers to video generation based on a given sentence or sequence of words.
(Image credit: Papersgraph)
These leaderboards are used to track progress in text-to-video-generation-9
Use these libraries to find text-to-video-generation-9 models and implementations
This work introduces the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics and develops a Spatio-Temporal Condition encoder (STC-encoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs.
This work proposes a new T2V generation setting—One-Shot Video Tuning, where only one text-video pair is presented, and introduces Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy.
The ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions and demonstrates superior performance over state-of-the-art methods across three evaluation metrics.
This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints, and can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576.
This work presents a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun, and makes substantial modifications to make the game richer by introducing audio and enabling new interactions.
Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.
The Video LDM is validated on real driving videos of resolution $512 \times 1024$, achieving state-of-the-art performance and it is shown that the temporal layers trained in this way generalize to different finetuned text-to-image LDMs.
The incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data and proves that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes.
StyleCrafter is introduced, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image, and designs a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features.
This work proposes Latte, a novel Latent Diffusion Transformer that first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space and achieves state-of-the-art performance across four standard video generation datasets.
Adding a benchmark result helps the community track progress.