reasoning-10

Visual Commonsense Reasoning

3260 papers • 126 benchmarks • 313 datasets

Image source: Visual Commonsense Reasoning

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in visual-commonsense-reasoning-20

Trend

Dataset

Best Model

Actions

GD-VCR

VCR (Q-AR) test

VCR (QA-R) test

Libraries

i

Use these libraries to find visual-commonsense-reasoning-20 models and implementations

Datasets

VCR

Subtasks

No subtasks available.

Most implemented papers

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Devi Parikh, Dhruv Batra, Stefan Lee, Jiasen Lu•Mon Aug 05 2019

ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

4261

Content

VCR (QA-R) test

VCR (Q-A) test

VCR (Q-A) dev

VCR (QA-R) dev

VCR (Q-AR) dev

Visual Commonsense Immorality benchmark

0

Paper Graph

UNITER: UNiversal Image-TExt Representation Learning

Zhe Gan, Linjie Li, Licheng Yu, Yen-Chun Chen, Jingjing Liu, Yu Cheng, Ahmed El Kholy, Faisal Ahmed•Tue Sep 24 2019

UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

2505 0

Paper Graph

From Recognition to Cognition: Visual Commonsense Reasoning

Ali Farhadi, Rowan Zellers, Yejin Choi, Yonatan Bisk•Mon Nov 26 2018

To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.

995 0

Paper Graph

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Xizhou Zhu, Jifeng Dai, Yue Cao, Lewei Lu, Furu Wei, Weijie Su, Bin Li•Wed Aug 21 2019

A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input.

1804 0

Paper Graph

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Zhe Gan, Linjie Li, Jingjing Liu, Yu Cheng, Yen-Chun Chen, Chen Zhu•Wed Jun 10 2020

To enable large-scale training, VILLA adopts the "free" adversarial training strategy, and combines it with KL-divergence-based regularization to promote higher invariance in the embedding space.

539 0

Paper Graph

Unifying Vision-and-Language Tasks via Text Generation

Mohit Bansal, Jaemin Cho, Jie Lei, Hao Tan•Thu Feb 04 2021

This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.

613 0

Paper Graph

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

Ting Yao, Yingwei Pan, Yehao Li, Tao Mei, Jingwen Chen•Tue Aug 17 2021

This work proposes X-modaler --- a versatile and high-performance codebase that encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages, and can be simply extended to power startup prototypes for other tasks in cross-Modal analytics, including visual question answering, visual commonsense reasoning, and cross- modal retrieval.

37 0

Paper Graph

Think Visually: Question Answering through Virtual Imagery

Jia Deng, Ankit Goyal, Jian Wang•Thu May 24 2018

Dynamic Spatial Memory Network (DSMN), a new deep network architecture that specializes in answering questions that admit latent visual representations, and learns to generate and reason over such representations, is introduced.

2 0

Paper Graph

Fusion of Detected Objects in Text for Visual Question Answering

Chris Alberti, Michael Collins, Jeffrey Ling, D. Reitter•Wed Jul 31 2019

A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture.

182 0

Paper Graph

Heterogeneous Graph Learning for Visual Commonsense Reasoning

Weihao Yu, Xiaodan Liang, Weijiang Yu, Jingwen Zhou, Nong Xiao•Mon Sep 30 2019

A new Heterogeneous Graph Learning (HGL) framework for seamlessly integrating the intra-graph and inter-graph reasoning in order to bridge the vision and language domain is proposed.

52 0

Paper Graph

Adding a benchmark result helps the community track progress.