3260 papers • 126 benchmarks • 313 datasets
Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. Main difficulties in Lemmatization arise from encountering previously unseen words during inference time as well as disambiguating ambiguous surface forms which can be inflected variants of several different base forms depending on the context. Source: Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks
(Image credit: Papersgraph)
These leaderboards are used to track progress in lemmatization-4
No benchmarks available.
Use these libraries to find lemmatization-4 models and implementations
No subtasks available.
This work introduces Stanza, an open-source Python natural language processing toolkit supporting 66 human languages that features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.
This paper approaches lemmatization as a string-transduction task with an Encoder-Decoder architecture which is enriched with sentence information using a hierarchical sentence encoder and shows significant improvements over the state-of-the-art by fine-tuning the sentence encodings to jointly optimize a bidirectional language model loss.
This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics, and the resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity.
This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy and compared to existing NLP tools for Hungarian.
This work uses a multilingual knowledge distillation approach to train BERT models to produce sentence embeddings for Ancient Greek text and evaluates their models on translation search, semantic similarity, and semantic retrieval tasks and investigates translation bias.
The results show that Ultra-stemming not only preserve the content of summaries produced by this representation, but often the performances of the systems can be dramatically improved.
This paper explores the possibility to elegantly solve two sequence tagging tasks for medieval Latin using a single, integrated approach using a layered neural network architecture from the field of deep representation learning.
Adding a benchmark result helps the community track progress.