Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

reasoning-3

Video Question Answering

3260 papers • 126 benchmarks • 313 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in video-question-answering-7

Trend

Dataset

Best Model

Actions

ActivityNet-QA

ActivityNet-QA

NExT-QA

NExT-QA

MSRVTT-QA

MSRVTT-QA

Libraries

i

Use these libraries to find video-question-answering-7 models and implementations

salesforce/lavis

2 papers 8,732

Datasets

HowTo100M

TVQA

ActivityNet-QA

MovieQA

TGIF-QA

NExT-QA

Subtasks

Zero-Shot Video Question Answer Few-shot Video Question Answering

Most implemented papers

Is Space-Time Attention All You Need for Video Understanding?

L. Torresani, Heng Wang, Gedas Bertasius•Mon Feb 08 2021

This work presents a convolution-free approach to video classification built exclusively on self-attention over space and time, which adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches.

2688

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

STAR Benchmark

STAR Benchmark

MVBench

MVBench

AGQA 2.0 balanced

AGQA 2.0 balanced

iVQA

iVQA

MSRVTT-MC

MSRVTT-MC

How2QA

How2QA

TVQA

TVQA

SUTD-TrafficQA

SUTD-TrafficQA

WildQA

WildQA

LSMDC-MC

LSMDC-MC

Howto100M-QA

Howto100M-QA

KnowIT VQA

KnowIT VQA

LSMDC-FiB

LSMDC-FiB

MSR-VTT-MC

MSR-VTT-MC

DramaQA

DramaQA

VLEP

VLEP

VideoQA

VideoQA

computer-vision-in-the-wild/cvinw_r…

2 papers 1,004

jpthu17/diffusionret

2 papers 96

pku-yuangroup/video-bench

2 papers 94

NExT-QA

MSRVTT-QA

TGIF

TVQA+

VALUE

0

Visual Instruction Tuning

Chunyuan Li, Haotian Liu, Yong Jae Lee, Qingyang Wu•Sun Apr 16 2023

This paper presents LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding and introduces GPT-4 generated visual instruction tuning data, the model and code base publicly available.

7668 0

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, A. Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, O. Vinyals, Andrew Zisserman, K. Simonyan•Thu Apr 28 2022

This work introduces Flamingo, a family of Visual Language Models (VLM) with this ability to bridge powerful pretrained vision-only and language-only models, handle sequences of arbitrarily interleaved visual and textual data, and seamlessly ingest images or videos as inputs.

4615 0

TVQA: Localized, Compositional Video Question Answering

Tamara L. Berg, Mohit Bansal, Jie Lei, Licheng Yu•Tue Sep 04 2018

This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.

724 0

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

D. Clifton, Jing Chen, Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song•Sun Nov 20 2022

This paper uses the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases, resulting in increased representing power for the semantics of video-and-language representations.

87 0

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Xiaoqian Shen, Mohamed Elhoseiny, Deyao Zhu, Jun Chen, Xiang Li•Wed Apr 19 2023

MiniGPT-4 is presented, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer to uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by G PT-4.

2789 0

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Jingren Zhou, Guohai Xu, Mingshi Yan, Ji Zhang, Haiyang Xu, Qinghao Ye, Jiabo Ye, Yaya Shi, Chenliang Li, Yuanhong Xu, Bin Bi, Qiuchen Qian, Wei Wang, Songfang Huang, Feiran Huang•Tue Jan 31 2023

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.

223 0

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Shangxuan Tian, Peng Jin, Jinfa Huang, Pengfei Xiong, Chang Liu, Xiang Ji, Li-ming Yuan, Jie Chen•Fri Mar 24 2023

This paper creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity in order to achieve cooperative games at different semantic levels.

81 0

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Munan Ning, Bin Zhu, Peng Jin, Yang Ye, Li Yuan•Wed Nov 15 2023

This work unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM, and establishes a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.

1230 0

Exploring Models and Data for Image Question Answering

R. Zemel, Mengye Ren, Ryan Kiros•Thu May 07 2015

This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form.

751 0

Adding a benchmark result helps the community track progress.

Video Question Answering | State-of-the-Art