Lemmatization

Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context. Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

Benchmarks

Libraries

Datasets

Subtasks

Most implemented papers

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

Content

Improving Lemmatization of Non-Standard Languages with Joint Learning

Top2Vec: Distributed Representations of Topics

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation

Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

Development of a Hindi Lemmatizer

Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning