3260 papers • 126 benchmarks • 313 datasets
( Image credit: No Metrics Are Perfect )
(Image credit: Papersgraph)
These leaderboards are used to track progress in visual-storytelling-32
Use these libraries to find visual-storytelling-32 models and implementations
No subtasks available.
Though automatic evaluation indicates slight performance boost over state-of-the-art (SOTA) methods in cloning expert behaviors, human evaluation shows that this approach achieves significant improvement in generating more human-like stories than SOTA systems.
A deep learning network model, GLAC Net, is proposed that generates visual stories by combining global-local (glocal) attention and context cascading mechanisms and achieves very competitive results compared to the state-of-the-art techniques.
A neural model for generating short stories from image sequences, which extends the image description model by Vinyals et al., 2015, showed competitive results with the METEOR metric and human ratings in the internal track of the Visual Storytelling Challenge 2018.
Visual storytelling and story comprehension are uniquely human skills that play a central role in how we learn about and experience the world. Despite remarkable progress in recent years in synthesis of visual and textual content in isolation and learning effective joint visual-linguistic representations, existing systems still operate only at a superficial, factual level. With the goal of developing systems that are able to comprehend rich human-generated narratives, and co-create new stories, we introduce AESOP: a new dataset that captures the creative process associated with visual storytelling. Visual panels are composed of clip-art objects with specific attributes enabling a broad range of creative expression. Using AESOP, we propose foundational storytelling tasks that are generative variants of story cloze tests, to better measure the creative and causal reasoning ability required for visual storytelling. We further develop a generalized story completion framework that models stories as the co-evolution of visual and textual concepts. We benchmark the proposed approach with human baselines and evaluate using comprehensive qualitative and quantitative metrics. Our results highlight key insights related to the dataset, modelling and evaluation of visual storytelling for future research in this promising field of study.
This work proposes an approach to identify discourse cues from the videos without the need to explicitly identify and annotate the scenes, and presents a novel dataset containing 310 videos and the corresponding discourse cues to evaluate the approach.
This work presents a commonsensedriven generative model, which aims to introduce crucial commonsense from the external knowledge base for visual storytelling by adopting an elaborately designed vision-aware directional encoding schema to effectively integrate the most informative commonsense.
The first dataset for human edits of machine-generated visual stories is introduced and it is shown how a relatively small set of human edits can be leveraged to boost the performance of large visual storytelling models.
This work proposes a method to mine the cross-modal rules to help the model infer these informative concepts given certain visual input, and leverages these concepts in the authors' encoder-decoder framework with the attention mechanism.
This paper proposes three assessment criteria: relevance, coherence and expressiveness, which are observed through empirical analysis could constitute a “high-quality” story to the human eye and proposes a reinforcement learning framework, ReCo-RL, with reward functions designed to capture the essence of these quality criteria.
Adding a benchmark result helps the community track progress.