3260 papers • 126 benchmarks • 313 datasets
Lexical normalization is the task of translating/transforming a non standard text to a standard register. Example: new pix comming tomoroe new pictures coming tomorrow Datasets usually consists of tweets, since these naturally contain a fair amount of these phenomena. For lexical normalization, only replacements on the word-level are annotated. Some corpora include annotation for 1-N and N-1 replacements. However, word insertion/deletion and reordering is not part of the task.
(Image credit: Papersgraph)
These leaderboards are used to track progress in lexical-normalization-10
Use these libraries to find lexical-normalization-10 models and implementations
MoNoise is a normalization model focused on generalizability and efficiency, it aims at being easily reusable and adaptable, based on a modular candidate generation in which each module is responsible for a different type of normalization action.
Experimental results show that a separate normalization component improves performance of a neural network parser even if it has access to character level information as well as external word embeddings.
It is argued that processing contextual information is crucial for this task and a social media text normalization hybrid word-character attention-based encoder-decoder model is introduced that can serve as a pre-processing step for NLP applications to adapt to noisy text in social media.
This paper introduces and demonstrates the online demo as well as the command line interface of a lexical normalization system (MoNoise) for a variety of languages, and shows how the model can be made more efficient with only a small loss in performance.
A labeled dataset called MultiSenti is presented, a deep learning-based model for sentiment classification of code-switched informal short text is proposed, and the results show that the proposed model performs better in general and adapting character-based embeddings yield equivalent performance while being computationally more efficient than training word-based domain-specific embeddeddings.
This work proposes a novel multi-cascaded deep learning model called McM for bilingual SMS classification that achieves high accuracy for classification on this dataset and outperforms the previous model for multilingual text classification, highlighting language independence of McM.
A feature- based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var, which contains a similarity threshold to balance the number of clusters and their maximum similarity.
This paper proposes three normalization models specifically designed to handle code-switched data which are evaluated for two language pairs: Indonesian-English and Turkish-German, and introduces novel normalization layers and their corresponding language ID and POS tags for the dataset.
DAN+, a new multi-domain corpus and annotation guidelines for Dan-ish nested named entities (NEs) and lexical normalization to support research on cross-lingualcross-domain learning for a less-resourced language is introduced.
A publicly available Japanese UGT corpus that comprises 929 sentences annotated with morphological and normalization information, along with category information the authors classified for frequent UGT-specific phenomena is constructed.
Adding a benchmark result helps the community track progress.