natural-language-processing-5

Cross-Lingual Bitext Mining

3260 papers • 126 benchmarks • 313 datasets

Cross-lingual bitext mining is the task of mining sentence pairs that are translations of each other from large text corpora.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in cross-lingual-bitext-mining-5

Trend

Dataset

Best Model

Actions

BUCC German-to-English

BUCC French-to-English

Libraries

i

Use these libraries to find cross-lingual-bitext-mining-5 models and implementations

facebookresearch/LASER

2 papers 3,514

Datasets

BUCC

Subtasks

No subtasks available.

Most implemented papers

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Holger Schwenk, Mikel Artetxe•Tue Dec 25 2018

An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.

1102

Content

BUCC Russian-to-English

BUCC Chinese-to-English

0

Paper Graph

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

Holger Schwenk, Mikel Artetxe•Fri Nov 02 2018

This paper proposes a new method for this task based on multilingual sentence embeddings, which relies on nearest neighbor retrieval with a hard threshold over cosine similarity, and accounts for the scale inconsistencies of this measure.

217 0

Paper Graph

Improving Neural Machine Translation Models with Monolingual Data

B. Haddow, Alexandra Birch, Rico Sennrich•Thu Nov 19 2015

This work pairs monolingual training data with an automatic back-translation, and can treat it as additional parallel training data, and obtains substantial improvements on the WMT 15 task English German, and for the low-resourced IWSLT 14 task Turkish->English.

2877 0

Paper Graph

Parallel Sentence Mining by Constrained Decoding

Pinzhen Chen, Kenneth Heafield, Faheem Kirefu, Nikolay Bogoychev•Tue Jun 30 2020

It is argued that a neural machine translation system by itself can be a sentence similarity scorer and it efficiently approximates pairwise comparison with a modified beam search.

22 0

Paper Graph

Majority Voting with Bidirectional Pre-translation For Bitext Retrieval

Derry Tanti Wijaya, Alex Jones•Tue Mar 09 2021

This paper outlines some drawbacks with current methods that rely on an embedding similarity threshold, and proposes a heuristic method in its place, and demonstrates success with this novel approach on the Tatoeba similarity search benchmark in 64 low-resource languages, and on NMT in Kazakh and Gujarati.

6 0

Paper Graph

Adding a benchmark result helps the community track progress.