Majority Voting with Bidirectional Pre-translation For Bitext Retrieval (2021-03-10T00:00:00.000000Z)

TL;DR

This paper outlines some drawbacks with current methods that rely on an embedding similarity threshold, and proposes a heuristic method in its place, and demonstrates success with this novel approach on the Tatoeba similarity search benchmark in 64 low-resource languages, and on NMT in Kazakh and Gujarati.

Abstract

Obtaining high-quality parallel corpora is of paramount importance for training NMT systems. However, as many language pairs lack adequate gold-standard training data, a popular approach has been to mine so-called “pseudo-parallel” sentences from paired documents in two languages. In this paper, we outline some drawbacks with current methods that rely on an embedding similarity threshold, and propose a heuristic method in its place. Our method involves translating both halves of a paired corpus before mining, and then performing a majority vote on sentence pairs mined in three ways: after translating documents in language x to language y, after translating language y to x, and using the original documents in languages x and y. We demonstrate success with this novel approach on the Tatoeba similarity search benchmark in 64 low-resource languages, and on NMT in Kazakh and Gujarati. We also uncover the effect of resource-related factors (i.e. how much monolingual/bilingual data is available for a given language) on the optimal choice of bitext mining method, demonstrating that there is currently no one-size-fits-all approach for this task. We make the code and data used in our experiments publicly available.

Authors

Derry Tanti Wijaya

5 papers

Alex Jones

2 papers

TL;DR

Abstract

Authors

References84 items

OPUS-MT – Building open translation services for the World

Beyond English-Centric Multilingual Machine Translation

Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings

The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT

Language-agnostic BERT Sentence Embedding

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

Cross-lingual Retrieval for Iterative Self-Supervised Training

A Call for More Rigor in Unsupervised Cross-lingual Learning

When and Why is Unsupervised Neural Machine Translation Useless?

Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation

When Does Unsupervised Machine Translation Work?

Are All Good Word Vector Spaces Isomorphic?

XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Multilingual Alignment of Contextual Word Representations

word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web

Unsupervised Cross-lingual Representation Learning at Scale

Machine Translation With Weakly Paired Documents

Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Findings of the 2019 Conference on Machine Translation (WMT19)

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Green AI

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

Bilingual Lexicon Induction through Unsupervised Machine Translation

Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation

Translating Translationese: A Two-Step Approach to Unsupervised Machine Translation

Energy and Policy Considerations for Deep Learning in NLP

How Multilingual is Multilingual BERT?

Are Girls Neko or Shōjo? Cross-Lingual Alignment of Non-Isomorphic Embeddings with Iterative Normalization

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing

An Effective Approach to Unsupervised Machine Translation

Cross-lingual Language Model Pretraining

Multilingual Constituency Parsing with Self-Attention and Pre-Training

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

XNLI: Evaluating Cross-lingual Sentence Representations

Unsupervised Statistical Machine Translation

Iterative Back-Translation for Neural Machine Translation

Neural Machine Translation for Low Resource Languages using Bilingual Lexicon Induced from Comparable Corpora

Filtering and Mining Parallel Data in a Joint Multilingual Space

A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings

Phrase-Based & Neural Unsupervised Machine Translation

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Deep Contextualized Word Representations

Unsupervised Machine Translation Using Monolingual Corpora Only

Unsupervised Neural Machine Translation

Word Translation Without Parallel Data

Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

A Simple but Tough-to-Beat Baseline for Sentence Embeddings

Learning Joint Multilingual Sentence Representations with Neural Machine Translation

Billion-Scale Similarity Search with GPUs

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Massively Multilingual Word Embeddings

Neural Machine Translation of Rare Words with Subword Units

Skip-Thought Vectors

Improving zero-shot learning by mitigating the hubness problem

GloVe: Global Vectors for Word Representation

Distributed Representations of Words and Phrases and their Compositionality

Inducing Crosslingual Distributed Representations of Words

Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora

Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling

Multi-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus

The Web as a Parallel Corpus

Adaptive parallel sentences mining from web bilingual news collection

An algorithmic description of XCS

Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Extracting Parallel Sentences from Comparable Corpora with STACC Variants

Unsupervised Parallel Sentence Extraction from Comparable Corpora

H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings