3260 papers • 126 benchmarks • 313 datasets
Image source: Visual Commonsense Reasoning
(Image credit: Papersgraph)
These leaderboards are used to track progress in visual-commonsense-reasoning-20
Use these libraries to find visual-commonsense-reasoning-20 models and implementations
No subtasks available.
ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.
A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.
To enable large-scale training, VILLA adopts the "free" adversarial training strategy, and combines it with KL-divergence-based regularization to promote higher invariance in the embedding space.
This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.
This work proposes X-modaler --- a versatile and high-performance codebase that encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages, and can be simply extended to power startup prototypes for other tasks in cross-Modal analytics, including visual question answering, visual commonsense reasoning, and cross- modal retrieval.
Dynamic Spatial Memory Network (DSMN), a new deep network architecture that specializes in answering questions that admit latent visual representations, and learns to generate and reason over such representations, is introduced.
A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture.
A new Heterogeneous Graph Learning (HGL) framework for seamlessly integrating the intra-graph and inter-graph reasoning in order to bridge the vision and language domain is proposed.
Adding a benchmark result helps the community track progress.