Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Published in

International Journal of Computer Vision(2016)

External Links:

Generate Graph DownloadPDF

TL;DR

The Visual Genome dataset is presented, which contains over 108K images where each image has an average of 35 objects and contains dense annotations of objects, attributes, and relationships within each image to learn these models.

Abstract

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked “What vehicle is the person riding?”, computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that “the person is riding a horse-drawn carriage.” In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of 35\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$35$$\end{document} objects, 26\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$26$$\end{document} attributes, and 21\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$21$$\end{document} pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Published in

International Journal of Computer Vision(2016)

External Links:

Generate Graph DownloadPDF

TL;DR

Abstract

References130 items

Visual Relationship Detection with Language Priors

Embracing Error to Enable Rapid Crowdsourcing

Learning Common Sense through Visual Abstraction

Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval

Word sense disambiguation: a survey

Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries

Attention alters predictive processing

AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes

VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases

Learning semantic relationships for better action retrieval in images

Describing Common Human Visual Actions in Images

Image retrieval using scene graphs

Discovering states and transformations in image collections

Mind's eye: A recurrent visual representation for image caption generation

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Learning to Answer Questions from Image Using Convolutional Neural Network

Visual Madlibs: Fill in the blank Image Generation and Question Answering

Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question

Exploring Models and Data for Image Question Answering

Image Question Answering: A Visual Semantic Embedding Model and a New Dataset

Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images

VQA: Visual Question Answering

Fast R-CNN

We Are Dynamo: Overcoming Stalling and Friction in Collective Action for Crowd Workers

Microsoft COCO Captions: Data Collection and Evaluation Server

Scripts, plans, goals, and understanding: an inquiry into human knowledge structures By Roger C. Schank and Robert P. Abelson (review)

YFCC100M

RMSProp and equilibrated adaptive learning rates for non-convex optimization

Phrase-based Image Captioning

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Deep visual-semantic alignments for generating image descriptions

CIDEr: Consensus-based image description evaluation

From captions to visual concepts and back

Show and tell: A neural image caption generator

Long-term recurrent convolutional networks for visual recognition and description

Explain Images with Multimodal Recurrent Neural Networks

A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input

A Unified Model for Word Sense Representation and Disambiguation

Going deeper with convolutions

Reasoning about Object Affordances in a Knowledge Base Representation

Zero-Shot Learning via Visual Abstraction

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Relation Classification via Convolutional Deep Neural Network

Nonparametric Part Transfer for Fine-Grained Recognition

Incorporating Scene Context and Object Layout into Appearance Modeling

Multimodal Neural Language Models

The Stanford CoreNLP Natural Language Processing Toolkit

Semantic Parsing for Text to 3D Scene Generation

Meteor Universal: Language Specific Translation Evaluation for Any Target Language

Microsoft COCO: Common Objects in Context

The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

NEIL: Extracting Visual Knowledge from Web Data

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Bringing Semantics into Focus Using Visual Abstraction

Understanding Indoor Scenes Using 3D Geometric Phrases

Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics

Efficient Estimation of Word Representations in Vector Space

ImageNet classification with deep convolutional neural networks

Indoor Segmentation and Support Inference from RGBD Images

Semantic Compositionality through Recursive Matrix-Vector Spaces

Elementary: Large-Scale Knowledge-Base Construction via Machine Learning and Statistical Inference

Recognizing proxemics in personal photos

Pedestrian Detection: An Evaluation of the State of the Art

Weakly Supervised Learning of Interactions between Humans and Objects

Im2Text: Describing Images Using 1 Million Captioned Photographs

The Caltech-UCSD Birds-200-2011 Dataset

Recognition using visual phrases

Every Picture Tells a Story: Generating Sentences from Images

Improving the Fisher Kernel for Large-Scale Image Classification

Building Watson: An Overview of the DeepQA Project

Modeling mutual context of object and human pose in human-object interaction activities

SUN database: Large-scale scene recognition from abbey to zoo

Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition

Learning to detect unseen object classes by between-class attribute transfer

Describing objects by their attributes

ImageNet: A large-scale hierarchical image database

Recognizing linked events: Searching the space of feasible explanations

StatSnowball: a statistical approach to extracting entity relationships

Word sense disambiguation: A survey

Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks

Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

Support vector machines

Recognition by association via learning per-exemplar distances

LabelMe: A Database and Web-Based Tool for Image Annotation

Learning Visual Attributes

Introduction to a Large-Scale General Purpose Ground Truth Database: Methodology, Annotation Tool and Benchmarks

Tree Kernel-Based Relation Extraction with Context-Sensitive Structured Parse Tree Information

Caltech-256 Object Category Dataset

A Shortest Path Dependency Kernel for Relation Extraction

Exploring Various Knowledge in Relation Extraction

A Statistical Approach to Texture Classification from Single Images

Dependency Tree Kernels for Relation Extraction

The Senseval-3 English lexical sample task

Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories

A Template-Based Approach Toward Acquisition of Logical Sentences

100

Bleu: a Method for Automatic Evaluation of Machine Translation

101

Proceedings of the 40th Annual Meeting on Association for Computational Linguistics

102

The Berkeley FrameNet Project

103

Using Corpus Statistics and WordNet Relations for Sense Identification

104

Long Short-Term Memory

105

WordNet: A Lexical Database for English

106

The second naive physics manifesto

107

Culture and Human Development: A New Look

108

Qualitative Process Theory

109

Scripts, plans, goals and understanding: an inquiry into human knowledge structures

110

Learning (ICML-14), pages 595–603

111

Malinowski, M., Rohrbach, M., and Fritz, M. (2015)

112

A visual Turing test for computer vision systems

113

arXiv preprint arXiv:1504.00325

114

, C., Liu, W

115

biguation. In EMNLP, pages 1025–1035

116

Wah, C., Branson, S., Welinder, P., Per-ona,

117

Ferrucci, D., Brown, E., Chu-Carroll, J., Fan

118

Toward Never Ending Language Learning

119

Betteridge, J.,

120

NLTK: The Natural Language Toolkit

121

Verbnet: a broad-coverage, comprehensive verb lexicon

122

Varma, M. and Zisserman,

123

A Statistical Approach to Texture Classification from Single Images

124

A Comparison of Document Clustering Techniques

125

A solvable connectionist model of immediate recall of ordered lists

126

The Naive Physics Manifesto

127

International Journal of Computer Vision manuscript No. (will be inserted by the editor) The PASCAL Visual Object Classes (VOC) Challenge

128

Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition

129

Int J Comput Vis

130