Synopses of Mov… (2022-03-11T00:00:00.000000Z)

TL;DR

These results underscore the value of SYMON for advancing research in vision-language story understanding and generation, and establish benchmarks on story video-text alignment and story video narration generation.

Abstract

Computational story understanding is a crucial but under-explored area of AI, hampered by a lack of suitable datasets. To address this, we collect, preprocess and publicly release SYMON (Synopses of Movie Narratives), a new video-language dataset containing 5,193 human-narrated, short movie summary videos sourced from YouTube. SYMON features naturalistic storytelling videos for human audiences made by human creators. Compared to existing movie story datasets, the videos in SYMON are shorter yet provide higher coverage of key story events, making it ideal for computational story understanding. We establish benchmarks on story video-text alignment and story video narration generation, demonstrating significant performance improvements when models are trained on SYMON. These results underscore the value of SYMON for advancing research in vision-language story understanding and generation.

Authors

Boyang Albert Li

10 papers

Yidan Sun

2 papers

Qin Chao

2 papers

References141 items

How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites

Enhancing Video-Language Representations With Structural Spatio-Temporal Alignment

Computational Analysis of Storylines: Making Sense of Events

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Movie101: A New Movie Understanding Benchmark

TL;DR

Abstract

Authors

Boyang Albert Li

10 papers

Yidan Sun

2 papers

Qin Chao

2 papers

TL;DR

Abstract

Authors

References141 items

How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites

Enhancing Video-Language Representations With Structural Spatio-Temporal Alignment

Computational Analysis of Storylines: Making Sense of Events

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Movie101: A New Movie Understanding Benchmark

TL;DR

Abstract

Authors

References141 items

How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites

Enhancing Video-Language Representations With Structural Spatio-Temporal Alignment

Computational Analysis of Storylines: Making Sense of Events

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Movie101: A New Movie Understanding Benchmark

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

AutoAD: Movie Description in Context

GPT-4 Technical Report

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences

StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

"Let Your Characters Tell Their Story": A Dataset for Character-Centric Narrative Understanding

MERLOT: Multimodal Neural Script Knowledge Models

Towards Long-Form Video Understanding

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

SummScreen: A Dataset for Abstractive Screenplay Summarization

NAREOR: The Narrative Reordering Problem

Visual Semantic Role Labeling for Video Understanding

Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences

Movie Summarization via Sparse Graph Construction

Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions

Temporal Reasoning in Natural Language Inference

Punctuation Restoration using Transformer Models for Resource-Rich and -Poor Languages

Temporal Reasoning on Implicit Events from Distant Supervision

Joint Constrained Learning for Event-Event Relation Extraction

TransNet V2: An Effective Deep Network Architecture for Fast Shot Transition Detection

Online Multi-modal Person Search in Videos

MovieNet: A Holistic Dataset for Movie Understanding

Condensed Movies: Story Based Retrieval with Contextual Embeddings

TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions

Screenplay Summarization Using Latent Narrative Structure

Adversarial NLI: A New Benchmark for Natural Language Understanding

FriendsQA: Open-Domain Question Answering on TV Show Transcripts

Abductive Commonsense Reasoning

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

FROM THE DIVISION OF LABOR IN SOCIETY

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

BERTScore: Evaluating Text Generation with BERT

Character Region Awareness for Text Detection

Frowning Frodo, Wincing Leia, and a Seriously Great Friendship: Learning to Classify Emotional Relationships of Fictional Characters

Fine-Grained Temporal Relation Extraction

WordNet-feelings: A linguistic categorisation of human feelings

The Open Images Dataset V4

TVQA: Localized, Compositional Video Question Answering

Graph-Based Decoding for Event Sequencing and Coreference Resolution

Event2Mind: Commonsense Inference on Events, Intents, and Reactions

Modeling Naive Psychology of Characters in Simple Commonsense Stories

Hierarchical Neural Story Generation

A Neural Multi-sequence Alignment TeCHnique (NeuMATCH)

LSTM stack-based Neural Multi-sequence Alignment TeCHnique (NeuMATCH)

The NarrativeQA Reading Comprehension Challenge

MovieGraphs: Towards Understanding Human-Centric Situations from Videos

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

DeepStory: Video Story QA by Deep Embedded Memory Networks

Event Representations for Automated Story Generation with Deep Neural Nets

The Kinetics Human Action Video Dataset

Dense-Captioning Events in Videos

The Amazing Mysteries of the Gutter: Drawing Inferences Between Panels in Comic Book Narratives

Enriching Word Vectors with Subword Information

Sort Story: Sorting Jumbled Images and Captions into Stories

Movie Description

Visual Storytelling