3260 papers • 126 benchmarks • 313 datasets
Cross-lingual document classification refers to the task of using data and models available for one language for which ample such resources are available (e.g., English) to solve classification tasks in another, commonly low-resource, language.
(Image credit: Papersgraph)
These leaderboards are used to track progress in cross-lingual-document-classification-3
Use these libraries to find cross-lingual-document-classification-3 models and implementations
An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.
A novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained, allowing to scale the model size proportional to the number of devices with sustained high efficiency.
Multi-lingual language model Fine-Tuning (MultiFiT) is proposed to enable practitioners to train and fine-tune language models efficiently in their own language and a zero-shot method using an existing pretrained cross-lingUAL model is proposed.
An Adversarial Deep Averaging Network (ADAN1) is proposed to transfer the knowledge learned from labeled data on a resource-rich source language to low-resource languages where only unlabeled data exist.
It is shown that bilingual embeddings learned using the proposed BilBOWA model outperform state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on WMT11 data.
A new subset of the Reuters corpus with balanced class priors for eight languages is proposed, adding Italian, Russian, Japanese and Chinese, which provides strong baselines for all language transfer directions using multilingual word and sentence embeddings respectively.
This work proposes a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word and sentence representations and significantly improves cross-lingsual sentence retrieval performance over all other approaches while maintaining parity with the current state-of-the-art methods on word-translation.
This work proposes a method for learning distributed representations in a multilingual setup and shows that these representations are semantically informative and apply them to a cross-lingual document classification task where they outperform the previous state of the art.
A novel technique for learning semantic representations, which extends the distributional hypothesis to multilingual data and joint-space embeddings and demonstrates that these representations are semantically plausible and can capture semantic relationships across languages without parallel data.
This method takes advantage of a high coverage dictionary in an EM style training algorithm over monolingual corpora in two languages to achieve state-of-the-art performance on bilingual lexicon induction task exceeding models using large bilingual corpora, and competitive results on the Monolingual word similarity and cross-lingual document classification task.
Adding a benchmark result helps the community track progress.