ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations (2019-11-02T00:00:00.000000Z)

TL;DR

ZEN is proposed, a BERT-based Chinese text encoder enhanced by n-gram representations, where different combinations of characters are considered during training, thus potential word or phrase boundaries are explicitly pre-trained and fine-tuned with the character encoder (BERT).

Abstract

The pre-training of text encoders normally processes text as a sequence of tokens corresponding to small text units, such as word pieces in English and characters in Chinese. It omits information carried by larger text granularity, and thus the encoders cannot easily adapt to certain combinations of characters. This leads to a loss of important semantic information, which is especially problematic for Chinese because the language does not have explicit word boundaries. In this paper, we propose ZEN, a BERT-based Chinese text encoder enhanced by n-gram representations, where different combinations of characters are considered during training, thus potential word or phrase boundaries are explicitly pre-trained and fine-tuned with the character encoder (BERT). Therefore ZEN incorporates the comprehensive information of both the character sequence and words or phrases it contains. Experimental results illustrated the effectiveness of ZEN on a series of Chinese NLP tasks, where state-of-the-art results is achieved on most tasks with requiring less resource than other published encoders. It is also shown that reasonable performance is obtained when ZEN is trained on a small corpus, which is important for applying pre-training techniques to scenarios with limited data. The code and pre-trained models of ZEN are available at https://github.com/sinovation/ZEN.

Authors

Jiaxin Bai

3 papers

Shizhe Diao

4 papers

Yan Song

3 papers

TL;DR

Abstract

Authors

References52 items

Named Entity Recognition for Social Media Texts with Semantic Augmentation

Improving Named Entity Recognition with Attentive Ensemble of Syntactic Information

Improving Constituency Parsing with Span Attention

Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed Knowledge

Improving Chinese Word Segmentation with Wordhood Memory Networks

FLAT: Chinese NER Using Flat-Lattice Transformer

Incorporating BERT into Neural Machine Translation

K-BERT: Enabling Language Representation with Knowledge Graph

Knowledge Enhanced Contextual Word Representations

NEZHA: Neural Contextualized Representation for Chinese Language Understanding

Adapting BERT for Target-Oriented Multimodal Sentiment Classification

Text Summarization with Pretrained Encoders

ERNIE 2.0: A Continual Pre-training Framework for Language Understanding

What Does BERT Learn about the Structure of Language?

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Pre-Training with Whole Word Masking for Chinese BERT

XLNet: Generalized Autoregressive Pretraining for Language Understanding

What Does BERT Look at? An Analysis of BERT’s Attention

Incorporating Word Attention into Character-Based Word Segmentation

An Encoding Strategy Based Word-Character LSTM for Chinese NER

ERNIE: Enhanced Representation through Knowledge Integration

Word-like character n-gram embedding

XNLI: Evaluating Cross-lingual Sentence Representations

LCQMC:A Large-scale Chinese Question Matching Corpus

Complementary Learning of Word Embeddings

Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings

Learning multi-grained aspect target sequence for Chinese sentiment analysis

A Hybrid Word-Character Model for Abstractive Summarization

Deep Contextualized Word Representations

Mixed Precision Training

Learning Word Representations with Regularization from Prior Knowledge

Attention is All you Need

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

context2vec: Learning Generic Context Embedding with Bidirectional LSTM

Enriching Word Vectors with Subword Information

Neural Architectures for Named Entity Recognition

Neural Machine Translation of Rare Words with Subword Units

GloVe: Global Vectors for Word Representation

Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts

Efficient Estimation of Word Representations in Vector Space

Using a Goodness Measurement for Domain Adaptation: A Case Study on Chinese Word Segmentation

Natural Language Processing (Almost) from Scratch

Transliteration of Name Entity via Improved Statistical Translation on Character Sequences

The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Unsupervised Multitask Learners

Improving Language Understanding by Generative Pre-Training

Exploring N-gram Character Presentation in Bidirectional RNN-CRF for Chinese Clinical Named Entity Recognition

THUCTC: An Efficient Chinese Text Classifier

Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network

Two/Too Simple Adaptations of Word2Vec for Syntax Problems

The Second International Chinese Word Segmentation Bakeoff

Field of Study

Venue Information

Name

Type

URL