3260 papers • 126 benchmarks • 313 datasets
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.
(Image credit: Papersgraph)
These leaderboards are used to track progress in video-question-answering-7
Use these libraries to find video-question-answering-7 models and implementations
This work presents a convolution-free approach to video classification built exclusively on self-attention over space and time, which adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches.
This paper presents LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding and introduces GPT-4 generated visual instruction tuning data, the model and code base publicly available.
This work introduces Flamingo, a family of Visual Language Models (VLM) with this ability to bridge powerful pretrained vision-only and language-only models, handle sequences of arbitrarily interleaved visual and textual data, and seamlessly ingest images or videos as inputs.
This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.
This paper uses the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases, resulting in increased representing power for the semantics of video-and-language representations.
MiniGPT-4 is presented, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer to uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by G PT-4.
Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.
This paper creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity in order to achieve cooperative games at different semantic levels.
This work unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM, and establishes a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.
This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form.
Adding a benchmark result helps the community track progress.