reasoning-5

Visual Reasoning

3260 papers • 126 benchmarks • 313 datasets

Ability to understand actions and reasoning associated with any visual images

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in visual-reasoning-11

Trend

Dataset

Best Model

Actions

Winoground

NLVR2 Dev

NLVR2 Test

Libraries

i

Use these libraries to find visual-reasoning-11 models and implementations

huggingface/transformers

5 papers 124,889

Datasets

Subtasks

Visual Commonsense Reasoning

Most implemented papers

Learning Transferable Visual Models From Natural Language Supervision

I. Sutskever, Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger•Thu Feb 25 2021

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

39743

Content

WinoGAViL

Bongard-OpenWorld

VSR

PHYRE-1B-Within

PHYRE-1B-Cross

VASR

NLVR

IRFL: Image Recognition of Figurative Language

CLEVRER

facebookresearch/multimodal

4 papers 1,288

salesforce/lavis

3 papers 8,713

kakao/DAFT

3 papers 32

bradyfu/awesome-multimodal-large-la…

2 papers 8,890

mlfoundations/open_clip

2 papers 8,415

towhee-io/towhee

2 papers 2,986

TextCaps

0

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Devi Parikh, Dhruv Batra, Stefan Lee, Jiasen Lu•Mon Aug 05 2019

ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

4261 0

Paper Graph

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

S. Savarese, Junnan Li, Dongxu Li, Steven C. H. Hoi•Sun Jan 29 2023

BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods, and is demonstrated's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

6884 0

Paper Graph

Compositional Attention Networks for Machine Reasoning

Drew A. Hudson, Christopher D. Manning•Wed Feb 14 2018

The MAC network is presented, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning that is computationally-efficient and data-efficient, in particular requiring 5x less data than existing models to achieve strong results.

607 0

Paper Graph

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

David J. Fleet, S. Fidler, Fartash Faghri, J. Kiros•Fri Jun 30 2017

A simple change to common loss functions used for multi-modal embeddings, inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, is introduced, which yields significant gains in retrieval performance.

1321 0

Paper Graph

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Mohit Bansal, Hao Hao Tan•Mon Aug 19 2019

The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.

2805 0

Paper Graph

VisualBERT: A Simple and Performant Baseline for Vision and Language

Cho-Jui Hsieh, Mark Yatskar, Kai-Wei Chang, Da Yin, Liunian Harold Li•Thu Aug 08 2019

Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

2223 0

Paper Graph

Visual Instruction Tuning

Chunyuan Li, Haotian Liu, Yong Jae Lee, Qingyang Wu•Sun Apr 16 2023

This paper presents LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding and introduces GPT-4 generated visual instruction tuning data, the model and code base publicly available.

7668 0

Paper Graph

UNITER: UNiversal Image-TExt Representation Learning

Zhe Gan, Linjie Li, Licheng Yu, Yen-Chun Chen, Jingjing Liu, Yu Cheng, Ahmed El Kholy, Faisal Ahmed•Tue Sep 24 2019

UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

2505 0

Paper Graph

VinVL: Revisiting Visual Representations in Vision-Language Models

Jianfeng Gao, Lijuan Wang, Jianwei Yang, Pengchuan Zhang, Yejin Choi, Xiujun Li, Xiaowei Hu, Lei Zhang•Mon May 31 2021

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR [20], and utilize an improved approach OSCAR+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. Code, models and pre-extracted features are released at https://github.com/pzzhang/VinVL.

1056 0

Paper Graph

Adding a benchmark result helps the community track progress.