3260 papers • 126 benchmarks • 313 datasets
Cross-lingual bitext mining is the task of mining sentence pairs that are translations of each other from large text corpora.
(Image credit: Papersgraph)
These leaderboards are used to track progress in cross-lingual-bitext-mining-5
Use these libraries to find cross-lingual-bitext-mining-5 models and implementations
No subtasks available.
An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.
This paper proposes a new method for this task based on multilingual sentence embeddings, which relies on nearest neighbor retrieval with a hard threshold over cosine similarity, and accounts for the scale inconsistencies of this measure.
This work pairs monolingual training data with an automatic back-translation, and can treat it as additional parallel training data, and obtains substantial improvements on the WMT 15 task English German, and for the low-resourced IWSLT 14 task Turkish->English.
It is argued that a neural machine translation system by itself can be a sentence similarity scorer and it efficiently approximates pairwise comparison with a modified beam search.
This paper outlines some drawbacks with current methods that rely on an embedding similarity threshold, and proposes a heuristic method in its place, and demonstrates success with this novel approach on the Tatoeba similarity search benchmark in 64 low-resource languages, and on NMT in Kazakh and Gujarati.
Adding a benchmark result helps the community track progress.