These results underscore the value of SYMON for advancing research in vision-language story understanding and generation, and establish benchmarks on story video-text alignment and story video narration generation.
Computational story understanding is a crucial but under-explored area of AI, hampered by a lack of suitable datasets. To address this, we collect, preprocess and publicly release SYMON (Synopses of Movie Narratives), a new video-language dataset containing 5,193 human-narrated, short movie summary videos sourced from YouTube. SYMON features naturalistic storytelling videos for human audiences made by human creators. Compared to existing movie story datasets, the videos in SYMON are shorter yet provide higher coverage of key story events, making it ideal for computational story understanding. We establish benchmarks on story video-text alignment and story video narration generation, demonstrating significant performance improvements when models are trained on SYMON. These results underscore the value of SYMON for advancing research in vision-language story understanding and generation.