Multimodal Grounding for Sequence-to-sequence Speech Recognition

Published in

IEEE International Conference on Acoustics, Spe...(2018)

External Links:

Generate Graph DownloadPDF

TL;DR

This paper proposes novel end-to-end multimodal ASR systems and compares them to the adaptive approach by using a range of visual representations obtained from state-of-the-art convolutional neural networks and shows that adaptive training is effective for S2S models leading to an absolute improvement of 1.4% in word error rate.

Abstract

Humans are capable of processing speech by making use of multiple sensory modalities. For example, the environment where a conversation takes place generally provides semantic and/or acoustic context that helps us to resolve ambiguities or to recall named entities. Motivated by this, there have been many works studying the integration of visual information into the speech recognition pipeline. Specifically, in our previous work, we propose a multistep visual adaptive training approach which improves the accuracy of an audio-based Automatic Speech Recognition (ASR) system. This approach, however, is not end-to-end as it requires fine-tuning the whole model with an adaptation layer. In this paper, we propose novel end-to-end multimodal ASR systems and compare them to the adaptive approach by using a range of visual representations obtained from state-of-the-art convolutional neural networks. We show that adaptive training is effective for S2S models leading to an absolute improvement of 1.4% in word error rate. As for the end-to-end systems, although they perform better than baseline, the improvements are slightly less than adaptive training, 0.8 absolute WER reduction in single-best models. Using ensemble decoding, end-to-end models reach a WER of 15% which is the lowest score among all systems.

Authors

Loïc Barrault

10 papers

Shruti Palaskar

3 papers

Florian Metze

5 papers

References27 items

LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition

How2: A Large-scale Dataset for Multimodal Language Understanding

LIUM-CVC Submissions for WMT18 Multimodal Translation Task

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Places: A 10 Million Image Database for Scene Recognition

Multimodal Grounding for Sequence-to-sequence Speech Recognition

Published in

IEEE International Conference on Acoustics, Spe...(2018)

External Links:

Generate Graph DownloadPDF

TL;DR

Abstract

Authors

Loïc Barrault

10 papers

Shruti Palaskar

3 papers

Florian Metze

5 papers

References27 items

LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition

How2: A Large-scale Dataset for Multimodal Language Understanding

LIUM-CVC Submissions for WMT18 Multimodal Translation Task

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Places: A 10 Million Image Database for Scene Recognition

Ramon Sanabria

3 papers

Ozan Caglayan

4 papers

End-to-end Multimodal Speech Recognition

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems

The Kinetics Human Action Video Dataset

Visual features for context-aware speech recognition

Incorporating Global Visual Features into Attention-based Neural Machine Translation.

Look, listen, and decode: Multimodal speech recognition with images

Lip Reading Sentences in the Wild

Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge

Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach

Using the Output Embedding to Improve Language Models

Deep Residual Learning for Image Recognition

Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

Adam: A Method for Stochastic Optimization

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Neural Machine Translation by Jointly Learning to Align and Translate

ImageNet: A large-scale hierarchical image database

Long Short-Term Memory

Recurrent neural network language model adaptation for multi-genre broadcast speech recognition

The Kaldi Speech Recognition Toolkit

Field of Study

Computer Science

Journal Information

Name

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Page

381-385

Venue Information

Name

IEEE International Conference on Acoustics, Speech, and Signal Processing

Type

conference

URL

http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000002

Alternate Names

Int Conf Acoust Speech Signal Process
IEEE Int Conf Acoust Speech Signal Process
ICASSP
International Conference on Acoustics, Speech, and Signal Processing