Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015-02-10T00:00:00.000000Z)

TL;DR

An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

Abstract

Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr9k, Flickr30k and MS COCO.

Authors

R. Zemel

11 papers

Yoshua Bengio

69 papers

Aaron C. Courville

35 papers

TL;DR

Abstract

Authors

References54 items

Describing Videos by Exploiting Temporal Structure

DRAW: A Recurrent Neural Network For Image Generation

Multiple Object Recognition with Visual Attention

Adam: A Method for Stochastic Optimization

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

Deep visual-semantic alignments for generating image descriptions

Learning a Recurrent Visual Representation for Image Caption Generation

From captions to visual concepts and back

Show and tell: A neural image caption generator

Long-term recurrent convolutional networks for visual recognition and description

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

TreeTalk: Composition and Compression of Trees for Image Descriptions

Going deeper with convolutions

Sequence to Sequence Learning with Neural Networks

Recurrent Neural Network Regularization

Very Deep Convolutional Networks for Large-Scale Image Recognition

Neural Machine Translation by Jointly Learning to Align and Translate

ImageNet Large Scale Visual Recognition Challenge

Recurrent Models of Visual Attention

Multimodal Neural Language Models

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

Meteor Universal: Language Specific Translation Evaluation for Any Target Language

The dropout learning algorithm

Microsoft COCO: Common Objects in Context

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Input Warping for Bayesian Optimization of Non-Stationary Functions

Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Auto-Encoding Variational Bayes

Learning Generative Models with Visual Attention

How to Construct Deep Recurrent Neural Networks

BabyTalk: Understanding and Generating Simple Image Descriptions

Image Description using Visual Dependency Representations

Recurrent Continuous Translation Models

Generating Sequences With Recurrent Neural Networks

Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics

Technical Report 2012

ImageNet classification with deep convolutional neural networks

Theano: new features and speed improvements

Collective Generation of Natural Image Descriptions

Practical Bayesian Optimization of Machine Learning Algorithms

Midge: Generating Image Descriptions From Computer Vision Detections

Learning Where to Attend with Deep Architectures for Image Tracking

Corpus-Guided Sentence Generation of Natural Images

Composing Simple Image Descriptions using Web-scale N-grams

Learning to combine foveal glimpses with a third-order Boltzmann machine

Control of goal-directed and stimulus-driven attention in the brain

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

Long Short-Term Memory

Dropout: a simple way to prevent neural networks from overfitting

Lecture 6.5 - RMSProp

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

The Dynamic Representation of Scenes

: A Long Short-Term

, David , and Bengio , Yoshua . Theano : a CPU and GPU math expression compiler

Field of Study

Venue Information

Name

Type

URL

Alternate Names