DaN+: Danish Nested Named Entities and Lexical Normalization (2020-12-01T00:00:00.000000Z)

TL;DR

DAN+, a new multi-domain corpus and annotation guidelines for Dan-ish nested named entities (NEs) and lexical normalization to support research on cross-lingualcross-domain learning for a less-resourced language is introduced.

Abstract

This paper introduces DAN+, a new multi-domain corpus and annotation guidelines for Dan-ish nested named entities (NEs) and lexical normalization to support research on cross-lingualcross-domain learning for a less-resourced language. We empirically assess three strategies tomodel the two-layer Named Entity Recognition (NER) task. We compare transfer capabilitiesfrom German versus in-language annotation from scratch. We examine language-specific versusmultilingual BERT, and study the effect of lexical normalization on NER. Our results show that 1) the most robust strategy is multi-task learning which is rivaled by multi-label decoding, 2) BERT-based NER models are sensitive to domain shifts, and 3) in-language BERT and lexicalnormalization are the most beneficial on the least canonical data. Our results also show that anout-of-domain setup remains challenging, while performance on news plateaus quickly. Thishighlights the importance of cross-domain evaluation of cross-lingual transfer.

Authors

Barbara Plank

13 papers

Rob van der Goot

9 papers

Kristian Nørgaard Jensen

3 papers

TL;DR

Abstract

Authors

References48 items

Biomedical Event Extraction as Sequence Labeling

Neural Unsupervised Domain Adaptation in NLP—A Survey

Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP

DaNE: A Named Entity Resource for Danish

A Focused Study to Compare Arabic Pre-training Models on Newswire IE Tasks

Neural Cross-Lingual Transfer and Limited Annotated Data for Named Entity Recognition in Danish

A Boundary-aware Neural Model for Nested Named Entity Recognition

MoNoise: A Multi-lingual and Easy-to-use Lexical Normalization Tool

Sequence-to-Nuggets: Nested Entity Mention Detection via Anchor-Region Networks

NNE: A Dataset for Nested Named Entity Recognition in English Newswire

A general framework for information extraction using dynamic span graphs

Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling

Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science

Nested Named Entity Recognition Revisited

Word Translation Without Parallel Data

DeepNNNER: Applying BLSTM-CNNs and Extended Lexicons to Named Entity Recognition in Tweets

Text normalization for named entity recognition in Vietnamese tweets

Multimodular Text Normalization of Dutch User-Generated Content

Bag of Tricks for Efficient Text Classification

Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition

Improving Named Entity Recognition in Tweets via Detecting Non-Standard Words

NoSta-D Named Entity Annotation for German: Guidelines and Dataset

Experiments to Improve Named Entity Recognition on Turkish Tweets

What to do about bad language on the internet

Nested Named Entity Recognition

GENIA corpus - a semantically annotated corpus for bio-textmining

Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

Research in Information Extraction: 1996-98

The Construction of a Tagged Danish Corpus

Message Understanding Conference- 6: A Brief History

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Neural approaches to sequence labeling for information extraction

Deep Exhaustive Model for Nested Named Entity Recognition

Data Statement Following (Bender and Friedman, 2018), the following outlines the data statement for DAN+

Universal Dependencies for Danish

Guideline for English lexical normalisation shared task

Named Entity Recognition from Tweets

Ace 2004 multilingual training corpus

Danish dependency treebank

SITUATION Both standard and colloquial Danish, i.e., edited and spontaneous speech. Time frame of data between

There are two nominal phrases. Only one of them is a named entity (Leila), the second nominal is a common noun

Step 6: Parts Named entities can also be parts of tokens and are annotated as such with the suffix

Multi-word tokens NEs often consist of multiple tokens. Examples: • person names

LOCderiv gader • Person adjectives: [Freudiansk]PERderiv litteratur • BUT genitive forms: [[Denmarks]LOC Radio]ORG, [Københavns]LOC kommune

• Derivations of NEs are marked as such by appending deriv, e.g., den [danske]LOCderiv midtbanespiller Examples: • Location

Medium-specific potential NEs Named entities can also be parts of special medium-specific tokens, like user names and hashtags in Twitter. We do annotate them as such

ORG (organization), PER (person) or MISC (miscellaneous other)

• BUT when the location acts as an organized entity (e.g. country, municipality, sports club), it is tagged as ORG with LOC as inner layer

Field of Study

Venue Information

Name

Type

URL

Alternate Names