3260 papers • 126 benchmarks • 313 datasets
This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.
(Image credit: Papersgraph)
These leaderboards are used to track progress in zero-shot-video-question-answer-11
Use these libraries to find zero-shot-video-question-answer-11 models and implementations
No subtasks available.
This work introduces Flamingo, a family of Visual Language Models (VLM) with this ability to bridge powerful pretrained vision-only and language-only models, handle sequences of arbitrarily interleaved visual and textual data, and seamlessly ingest images or videos as inputs.
This work unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM, and establishes a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.
A comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 chal-lenging video tasks that cannot be effectively solved with a single frame, and develops a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with di-verse instruction-tuning data.
This work introduces Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency, which leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost.
This work augments LLaMA-Adapter by unlocking more learnable parameters and proposes an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.
This work examines the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons, and builds VILA, a Visual Language model family that consistently outperforms the state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells and whistles.
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
MVB (Multi View Baggage) is the first publicly released large-scale dataset that contains 4519 baggage identities and 22660 annotated baggage images as well as its surface material labels and has remarkable inter-class similarity and intra-class dissimilarity.
This work builds on frozen bidirectional language models (BiLM) and shows that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA and demonstrates competitive performance in the few-shot and fully-supervised setting.
Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
Adding a benchmark result helps the community track progress.