3260 papers • 126 benchmarks • 313 datasets
Reasoning over multimodal inputs.
(Image credit: Papersgraph)
These leaderboards are used to track progress in multimodal-reasoning-1
Use these libraries to find multimodal-reasoning-1 models and implementations
No subtasks available.
This paper presents a data collection effort to correct the class with the highest error rate in SNLI-VE, and re-evaluate an existing model on the corrected corpus, which is called SN LI-VE-2.0, and introduces e-SNLI-UTE, which appends human-written natural language explanations to SNLI -VE- 2.0.
This work proposes Dual Attention Networks which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language and introduces two types of DANs for multimodal reasoning and matching, respectively.
This work introduces WEBQA, a challenging new benchmark that proves difficult for large-scale state-of-the-art models which lack language groundable visual representations for novel objects and the ability to reason, yet trivial for humans.
A novel model-agnostic Multimodal analogical reasoning framework with Transformer (MarT) motivated by the structure mapping theory, which can obtain better performance.
Graph-of-Thought reasoning is proposed, which models human thought processes not only as a chain but also as a graph, and achieves significant improvement over the strong CoT baseline on the AQUA-RAT test set and boosts accuracy from 85.19% to 87.59% using the T5-base model.
This paper presents a new dataset, AlgoPuzzleVQA, designed to challenge and evaluate the capabilities of multimodal language models in solving algorithmic puzzles that necessitate both visual understanding, language understanding, and complex algorithmic reasoning.
A systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities, which hope to shed light on the limitations of large multimodal models and how they can better emulate human cognitive processes in the future.
A novel and more powerful Dual-channel Multi-hop Reasoning Model for Visual Dialog, named DMRM, which synchronously captures information from the dialog history and the image to enrich the semantic representation of the question by exploiting dual-channel reasoning.
This work improves the performance of existing multimodal approaches beyond simple fine-tuning and shows the effectiveness of upsampling of contrastive examples to encourage multimodality and ensemble learning based on cross-validation to improve robustness.
UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning, achieves strong performance on each task with significantly fewer parameters.
Adding a benchmark result helps the community track progress.