Improving Lemmatization of Non-Standard Languages with Joint Learning (2019-03-16T00:00:00.000000Z)

TL;DR

This paper approaches lemmatization as a string-transduction task with an Encoder-Decoder architecture which is enriched with sentence information using a hierarchical sentence encoder and shows significant improvements over the state-of-the-art by fine-tuning the sentence encodings to jointly optimize a bidirectional language model loss.

Abstract

Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword. In the present paper we aim to improve lemmatization performance on a set of non-standard historical languages in which the difficulty is increased by an additional aspect (iii): spelling variation due to lacking orthographic standards. We approach lemmatization as a string-transduction task with an Encoder-Decoder architecture which we enrich with sentence information using a hierarchical sentence encoder. We show significant improvements over the state-of-the-art by fine-tuning the sentence encodings to jointly optimize a bidirectional language model loss. Crucially, our architecture does not require POS or morphological annotations, which are not always available for historical corpora. Additionally, we also test the proposed model on a set of typologically diverse standard languages showing results on par or better than a model without fine-tuned sentence representations and previous state-of-the-art systems. Finally, to encourage future work on processing of non-standard varieties, we release the dataset of non-standard languages underlying the present study, which is based on openly accessible sources.

Authors

Ákos Kádár

5 papers

M. Kestemont

2 papers

Enrique Manjavacas

2 papers

TL;DR

Abstract

Authors

References41 items

Lemmatization for Ancient Languages: Rules or Neural Networks?

LemmaTag: Jointly Tagging and Lemmatizing for Morphologically Rich Languages with BRNNs

Contextual String Embeddings for Sequence Labeling

Context Sensitive Neural Lemmatization with Lematus

An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization

Deep Contextualized Word Representations

Universal Language Model Fine-tuning for Text Classification

Das Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200–1650) – Korpusdesign, Korpuserstellung und Korpusnutzung

Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks

Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies

Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Lemmatization for variation-rich languages using deep learning

Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks

Multimodular Text Normalization of Dutch User-Generated Content

Lemmatization and Morphological Tagging in German and Latin: A Comparison and a Survey of the State-of-the-art

Universal Dependencies v1: A Multilingual Treebank Collection

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Joint Lemmatization and Morphological Tagging with Lemming

Reference corpus of historical Slovene goo300k 1.2

Adam: A Method for Stochastic Optimization

On the Properties of Neural Machine Translation: Encoder–Decoder Approaches

Neural Machine Translation by Jointly Learning to Align and Translate

What’s in a p-value in NLP?

A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text

Abbreviations, fragmentary words, formulaic language: treebanking mediaeval charter material

Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters

Een gegevensbank van 14de-eeuwse Middelnederlandse dialecten op computer

Natural Language Processing for Historical Texts

Lemmatisation as a Tagging Task

Learning Morphology with Morfette

Language and the Internet

Multitask Learning

Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration

Geste: un corpus de chansons de geste

Dropout: a simple way to prevent neural networks from overfitting

The notion of a “lemma”: Headwords, roots and lexical sets

Table 7 shows the languages from the UD corpus that were sampled for the study

“Gys” refers to the Gysseling corpus, which consists of several subsets

Type 2. Uralic and Altaic languages, which are characterized by agglutinative morphology and a tendency towards monoexponential case and vowel harmony

Language Dataset Code Arabic Arabic-PDAT ar Bulgarian Bulgarian-BTB bg

ambiguity in the corpus, providing evidence for hypothesis (i)

Field of Study

Venue Information

Name

Type

URL

Alternate Names