3260 papers • 126 benchmarks • 313 datasets
Story Visualization is the task of generating coherent and aligned sequence of images given a sequence of textual captions representing description of a story. It mainly consists of two tasks: story generation and story continuation, where story continuation uses additional ground truth information in the form of the first frame.
(Image credit: Papersgraph)
These leaderboards are used to track progress in story-visualization-11
Use these libraries to find story-visualization-11 models and implementations
No subtasks available.
This work proposes to adapt a recent work that augments VQ-VAE with a text-to-visual-token (transformer) architecture that excels at preserving characters and can produce higher quality image sequences compared with the strong baselines.
The StoryImager enhances the storyboard generative ability inherited from the pre-trained text-to-image model for a bidirectional generation and introduces a Target Frame Masking Strategy to extend and unify different story image generation tasks.
We propose an end-to-end network for visual illustration of a sequence of sentences forming a story. At the core of our model is the ability to model the inter-related nature of the sentences within a story, as well as the ability to learn coherence to support reference resolution. The framework takes the form of an encoder-decoder architecture, where sentences are encoded using a hierarchical two-level sentence-story GRU, combined with an encoding of coherence, and sequentially decoded using a predicted feature representation into a consistent illustrative image sequence. We optimize all parameters of our network in an end-to-end fashion with respect to order embedding loss, encoding entailment between images and sentences. Experiments on the VIST storytelling dataset [9] highlight the importance of our algorithmic choices and efficacy of our overall model.
A new story-to-image-sequence generation model, StoryGAN, based on the sequential conditional GAN framework is proposed, which outperformed state-of-the-art models in image quality, contextual consistency metrics, and human evaluation.
A number of improvements to prior modeling approaches are presented, including the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, a copy-transform mechanism for sequentially-consistent story visualization, and MART-based transformers to model complex interactions between frames.
A new sentence representation is introduced, which incorporates word information from all story sentences to mitigate the inconsistency problem and a new discriminator with fusion features is proposed to improve image quality and story consistency.
This work enhances or 'retro-fit' the pretrained text-to-image synthesis models with task-specific modules for story continuation and facilitates copying of visual elements from the source image, thereby improving continuity in the generated visual story.
This work proposes AR-LDM, a latent diffusion model auto-regressively conditioned on history captions and generated images that can generalize to new characters through adaptation and extends the text-conditioned method to multimodal conditioning.
Adding a benchmark result helps the community track progress.