3260 papers • 126 benchmarks • 313 datasets
Image to Video Generation refers to the task of generating a sequence of video frames based on a single still image or a set of still images. The goal is to produce a video that is coherent and consistent in terms of appearance, motion, and style, while also being temporally consistent, meaning that the generated video should look like a coherent sequence of frames that are temporally ordered. This task is typically tackled using deep generative models, such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs), that are trained on large datasets of videos. The models learn to generate plausible video frames that are conditioned on the input image, as well as on any other auxiliary information, such as a sound or text track.
(Image credit: Papersgraph)
These leaderboards are used to track progress in image-to-video-generation-9
No benchmarks available.
Use these libraries to find image-to-video-generation-9 models and implementations
No datasets available.
No subtasks available.
This paper presents the Collaborative Neural Rendering (CoNR) method, which creates new images for specified poses from a few reference images (AKA Character Sheets), and collects a character sheet dataset containing over 700,000 hand-drawn and synthesized images of diverse poses to facilitate research in this area.
A cVAE for predicting optical flow is employed as a beneficial intermediate step to generate a video sequence conditioned on the initial single frame, and a semantic label map is integrated into the flow prediction module to achieve major improvements in the image-to-video generation process.
This work proposes a novel multi-domain image-to-image generative adversarial network architecture, whose learned latent space models a continuous bi-directional aging process.
This paper identifies and evaluates three different stages for successful training of video LDMs: text-to-image Pretraining, video pretraining, and high-quality video finetuning, and shows that the necessity of a well-curated pretraining dataset for generating high- quality videos and a systematic curation process to train a strong base model.
This paper proposes a practical framework, named Follow-Your-Click, to achieve image animation with a simple user click and a short motion prompt and has simpler yet precise user control and better generation performance than previous methods.
This work trains networks to learn residual motion between the current and future frames, which avoids learning motion-irrelevant details and proposes a two-stage generation framework where videos are generated from structures and then refined by temporal signals.
A Motion Anchor-based video GEnerator (MAGE) with an innovative motion anchor (MA) structure to store appearance-motion aligned representation is proposed to model the uncertainty and increase the diversity of TI2V task.
Fueled by the recent progress in neural radiance fields (NeRF), SceneRF is proposed, a self-supervised monocular scene reconstruction method using only posed image sequences for training that outperforms all baselines for novel depth views synthesis and scene reconstruction.
This paper proposes an approach using novel latent flow diffusion models (LFDM) that synthesize an optical flow sequence in the latent space based on the given condition to warp the given image and shows that LFDM can be easily adapted to new domains by simply finetuning the image decoder.
Adding a benchmark result helps the community track progress.