Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (2016-12-02T00:00:00.000000Z)

TL;DR

This work balances the popular VQA dataset by collecting complementary images such that every question in the authors' balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.

Abstract

The problem of visual question answering (VQA) is of significant importance both as a challenging research question and for the rich set of applications it enables. In this context, however, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in VQA models that ignore visual information, leading to an inflated sense of their capability. We propose to counter these language priors for the task of VQA and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset (Antol et al., in: ICCV, 2015) by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at http://visualqa.org/ as part of the 2nd iteration of the VQA Dataset and Challenge (VQA v2.0). We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners. We also present interesting insights from analysis of the participant entries in VQA Challenge 2017, organized by us on the proposed VQA v2.0 dataset. The results of the challenge were announced in the 2nd VQA Challenge Workshop at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counter-example based explanation. Specifically, it identifies an image that is similar to the original image, but it believes has a different answer to the same question. This can help in building trust for machines among their users.

References154 items

Sim-to-Real Transfer for Vision-and-Language Navigation

Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents

Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation

Hypothesis Sketching for Online Kernel Selection in Continuous Kernel Space

Feel The Music: Automatically Generating A Dance For An Input Song

Exploring Crowd Co-creation Scenarios for Sketches

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Predicting A Creator's Preferences In, and From, Interactive Generative Art

SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

Decentralized Distributed PPO: Solving PointGoal Navigation

Improving Generative Visual Dialog by Answering Diverse Questions

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Chasing Ghosts: Instruction Following as Bayesian State Tracking

Towards VQA Models That Can Read

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Habitat: A Platform for Embodied AI Research

Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

Trick or TReAT : Thematic Reinforcement for Artistic Typography

Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Cycle-Consistency for Robust Visual Question Answering

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

Audio Visual Scene-Aware Dialog

Neural Modular Control for Embodied Question Answering

Do explanations make VQA models more predictable to a human?

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition

Graph R-CNN for Scene Graph Generation

End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features

CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset

Learning to Reason: End-to-End Module Networks for Visual Question Answering

An Analysis of Visual Question Answering Algorithms

Understanding Black-box Predictions via Influence Functions

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Visual question answering: Datasets, algorithms, and future challenges

The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering (FSVQA)

Knowing who to listen to: Prioritizing experts from a diverse ensemble for attribute personalization

Towards Transparent AI Systems: Interpreting Visual Question Answering Models

Focused Evaluation for Image Description with Binary Forced-Choice Tasks

Answer-Type Prediction for Visual Question Answering

Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions

DualNet: Domain-invariant network for visual question answering

Training Recurrent Answering Units with Joint Loss Minimization for VQA

Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Multimodal Residual Learning for Visual QA

Analyzing the Behavior of Visual Question Answering Models

Hierarchical Question-Image Co-Attention for Visual Question Answering

Leveraging Visual Question Answering for Image-Caption Ranking

Joint Unsupervised Learning of Deep Representations and Image Clusters

A Focused Dynamic Attention Model for Visual Question Answering

Generating Visual Explanations

Dynamic Memory Networks for Visual and Textual Question Answering

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

“Why Should I Trust You?”: Explaining the Predictions of Any Classifier

Learning Deep Features for Discriminative Localization

Deep Residual Learning for Image Recognition

MovieQA: Understanding Stories in Movies through Question-Answering

Simple Baseline for Visual Question Answering

Visual Madlibs: Fill in the Blank Description Generation and Question Answering

Learning Common Sense through Visual Abstraction

Where to Look: Focus Regions for Visual Question Answering

Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

Yin and Yang: Balancing and Answering Binary Visual Questions

Visual7W: Grounded Question Answering in Images

Explicit Knowledge-based Reasoning for Visual Question Answering

Neural Module Networks

Deep Compositional Question Answering with Neural Module Networks

Stacked Attention Networks for Image Question Answering

Mind's eye: A recurrent visual representation for image caption generation

Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question

Exploring Nearest Neighbor Approaches for Image Captioning

Exploring Models and Data for Image Question Answering

Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images

VQA: Visual Question Answering

Semantic classification of spacecraft's status: integrating system intelligence and human knowledge

Deep visual-semantic alignments for generating image descriptions

CIDEr: Consensus-based image description evaluation

From captions to visual concepts and back

Long-term recurrent convolutional networks for visual recognition and description

Show and tell: A neural image caption generator

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Explain Images with Multimodal Recurrent Neural Networks

A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input

GloVe: Global Vectors for Word Representation

Interactively Guiding Semi-Supervised Clustering via Attribute-Based Explanations

Zero-Shot Learning via Visual Abstraction

Very Deep Convolutional Networks for Large-Scale Image Recognition

Predicting User Annoyance Using Visual Attributes

Predicting Failures of Vision Systems

Microsoft COCO: Common Objects in Context

How Do You Tell a Blackbird from a Crow?

100

Learning the Visual Interpretation of Sentences

101

Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

102

Bringing Semantics into Focus Using Visual Abstraction

103

Multi-attribute Queries: To Merge or Not to Merge?

104

Relative Attributes for Enhanced Human-Machine Communication

105

What makes Paris look like Paris?

106

The role of image understanding in contour detection

107

Automatic discovery of groups of objects for scene understanding

108

Discovering localized attributes for fine-grained recognition

109

Understanding the Intrinsic Memorability of Images

110

Recognizing jumbled images: The role of local and global information in image classification

111

Unbiased look at dataset bias

112

Finding the weakest link in person detectors

113

iCoseg: Interactive co-segmentation with intelligent scribble guidance

114

The role of features, algorithms and data in visual recognition

115

Seed Image Selection in interactive cosegmentation

116

ImageNet: A large-scale hierarchical image database

117

Unsupervised learning of hierarchical spatial structures in images

118

Semi-supervised co-training and active learning based approach for multi-view intrusion detection

119

From appearance to context-based recognition: Dense labeling in small images

120

Bringing diverse classifiers to common grounds: dtransform

121

Hierarchical Semantics of Objects (hSOs)

122

Combining classifiers for multisensor data fusion

123

Ensemble of classifiers approach for NDT data fusion

124

12-in-1: Multi-Task Vision and Language Representation Learning

125

Past Graduate Interns • Sarmista Velury

126

Cross-channel Communication Networks

127

Our work on teaching bots to navigate New York City using natural language was covered in MIT Technology Review, Forbes, Fast Company

128

Our work on Embodied Question Answering (Embodied QA), a first step towards agents that can see, talk, and reason, was covered in MIT Technology Review, and others

129

ADVISING ACTIVITY Current Graduate Advisees • Samyak Datta Ph.D. student, Since Fall

130

Featured news story about my Google Faculty Research Award and Dhruv Batra's Office of Naval Research (ONR) Young Investigator Program (YIP) award

131

Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization

132

Hadamard Product for Low-rank Bilinear Pooling

133

A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories

134

Deeper LSTM and normalized CNN Visual Question Answering model. https://github.com/VT-vision-lab/ VQA_LSTM_CNN

135

• Future Directions in Computer Vision Department of Defense workshop

136

Don't Just Listen

137

• National Science Foundation (NSF) Information and Intelligent Systems (IIS) Division

138

Technical Opportunities on Campus" for first year engineering female students

139

Can cartoons be used to teach machines to understand the visual world

140

Inference for Order Reduction in MRFs

141

Indraprastha Institute of Information Technology

142

uWave: Accelerometer-based Personalized Gesture Recognition

143

Ensemble Based Data Fusion for Early Diagnosis of Alzheimer's Disease

144

Evaluate the Effect of Ground Tire Rubber on Laboratory Rutting Performance of Asphalt Concrete Mixtures

145

2019, I was featured in Vogue's "Dream Makers. How the women in AI are shaping our future

146

SOrT-ing in VQA: Contrastive Gradient Learning for Improved Consistency

147

Program Committee of Workshops

148

2011 (Oral) Marr Prize

149

Featured news stories about my National Science Foundation (NSF) CAREER Award • Virginia Tech's Bradley Department of Electrical and Computer Engineering

150

Featured news story about my Amazon Academic Research Award • Georgia Tech's College of Computing

151

Incredible Women Advancing A.I. Research" • Forbes

152

2005 7th International Conference on Information Fusion (FUSION) A Multiple Classifier Approach for Multisensor Data Fusion

153

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

154