A Large-Scale Chinese Short-Text Conversation Dataset (2020-08-10T00:00:00.000000Z)

TL;DR

A large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues), and pre-training dialogue models which are trained on LCCC-base and LCCC -large respectively.

Abstract

The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at this https URL.

Authors

Minlie Huang

23 papers

Kaili Huang

3 papers

Xiaoyan Zhu

6 papers

TL;DR

Abstract

Authors

References43 items

Pre-trained models for natural language processing: A survey

An Empirical Investigation of Pre-Trained Transformer Language Models for Open-Domain Dialogue Generation

Towards a Human-like Open-Domain Chatbot

DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation

PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable

Proactive Human-Machine Conversation with Explicit Conversation Goal

Challenges in Building Intelligent Open-domain Dialog Systems

The Curious Case of Neural Text Degeneration

The Evolved Transformer

Personalized Dialogue Generation with Diversified Traits

TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents

Wizard of Wikipedia: Knowledge-Powered Conversational agents

A Dataset for Document Grounded Conversations

Training Millions of Personalized Dialogue Agents

Personalizing Dialogue Agents: I have a dog, do you have pets too?

Fixing Weight Decay Regularization in Adam

DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset

Attention is All you Need

Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots

Detecting Context Dependent Messages in a Conversational Environment

OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

A Persona-Based Neural Conversation Model

A Survey of Available Corpora for Building Data-Driven Dialogue Systems

A Diversity-Promoting Objective Function for Neural Conversation Models

Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models

Hierarchical Neural Network Generative Models for Movie Dialogues

The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems

A Neural Network Approach to Context-Sensitive Generation of Conversational Responses

Neural Responding Machine for Short-Text Conversation

Sequence to Sequence Learning with Neural Networks

Neural Machine Translation by Jointly Learning to Align and Translate

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

A Dataset for Research on Short-Text Conversations

An Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics

Unsupervised Modeling of Twitter Conversations

Bleu: a Method for Automatic Evaluation of Machine Translation

Book Reviews: Foundations of Statistical Natural Language Processing

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language Models are Unsupervised Multitask Learners

Improving Language Understanding by Generative Pre-Training

Measuring nominal scale agreement among many raters.

Suppressed Due to Excessive Length

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names