3260 papers • 126 benchmarks • 313 datasets
Multimodal machine translation is the task of doing machine translation with multiple data sources - for example, translating "a bird is flying over water" + an image of a bird over water to German text. ( Image credit: Findings of the Third Shared Task on Multimodal Machine Translation )
(Image credit: Papersgraph)
These leaderboards are used to track progress in multimodal-machine-translation-5
Use these libraries to find multimodal-machine-translation-5 models and implementations
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
This dataset extends the Flickr30K dataset with i) German translations created by professional translators over a subset of the English descriptions, and ii) descriptions crowdsourced independently of the original English descriptions.
The systems developed by LIUM and CVC for the WMT16 Multimodal Machine Translation challenge are presented, namely phrase-based systems and attentional recurrent neural networks models trained using monomodal or multimodal data.
Abstract In this paper, we present nmtpy, a flexible Python toolkit based on Theano for training Neural Machine Translation and other neural sequence-to-sequence architectures. nmtpy decouples the specification of a network from the training and inference utilities to simplify the addition of a new architecture and reduce the amount of boilerplate code to be written. nmtpy has been used for LIUM’s top-ranked submissions to WMT Multimodal Machine Translation and News Translation tasks in 2016 and 2017.
A novel multimodal machine translation model that utilizes parallel visual and textual information and jointly optimizes the learning of a shared visual-language embedding and a translator that achieves competitive state-of-the-art results on the Multi30K and the Ambiguous COCO datasets.
Compared to last year, the performance of the multimodal submissions improved, but text-only systems remain competitive.
A novel architecture, called deepGRU, is explored, based on recent findings in the related task of Neural Image Captioning, which leads to the best METEOR translation score for both constrained (English, image) -> German and (English) -> French sub-tasks.
This work proposes to model the interaction between visual and textual features for multi-modal neural machine translation (MMT) through a latent variable model, and shows that the latent variable MMT formulation improves considerably over strong baselines.
This study effectively combines two approaches to improve NMT of low-resource domains in the context of multimodal NMT and explores how to take full advantage of pretrained word embeddings to better translate rare words.
Adding a benchmark result helps the community track progress.