3260 papers • 126 benchmarks • 313 datasets
The benchmark evaluates a generative Video Conversational Model with respect to Detail Orientation. We curate a test set based on the ActivityNet-200 dataset, featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We develop an evaluation pipeline using the GPT-3.5 model that assigns a relative score to the generated predictions on a scale of 1-5.
(Image credit: Papersgraph)
These leaderboards are used to track progress in video-based-generative-performance-benchmarking-detail-orientation-9
Use these libraries to find video-based-generative-performance-benchmarking-detail-orientation-9 models and implementations
No subtasks available.
This work augments LLaMA-Adapter by unlocking more learnable parameters and proposes an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.
This work introduces Chat-UniVi, a Unified Vision-language model capable of comprehending and engaging in conver-sations involving images and videos through a unified visual representation that outperforms even existing methods exclusively de-signed for either images or videos.
A comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 chal-lenging video tasks that cannot be effectively solved with a single frame, and develops a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with di-verse instruction-tuning data.
Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.
VideoChat is introduced, an end-to-end chat-centric video understanding system that integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference.
A multimodal model that merges a video-adapted visual encoder with an LLM, capable of understanding and generating detailed conversations about videos, and a quantitative evaluation framework for video-based dialogue models to objectively analyze the strengths and weaknesses of video-based dialogue models.
The MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of the effectiveness of the method.
Benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video di-alogue benchmark, showing its superior cross-modal understanding and reasoning abilities.
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding that does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components.
Adding a benchmark result helps the community track progress.