3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in open-domain-dialog-9
Use these libraries to find open-domain-dialog-9 models and implementations
It is found that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text.
It is shown that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r>.7, p<.05).
A series of experiments show that the use of multiple references results in improved correlation between several automatic metrics and human judgement for both the quality and the diversity of system output.
The FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision is introduced, which attains moderate to strong correlation with human judgement at both levels.
This work trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data and the resulting ranker outperformed several baselines, and outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
RUBER is proposed, a Referenced metric and Unreferenced metrics Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance).
The task formulation raises fundamental challenges regarding evaluation and dataset creation that currently preclude meaningful modeling progress, and a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance on the ELI5 LFQA dataset is designed.
This work introduces a Topical Hierarchical Recurrent Encoder Decoder (THRED), a novel, fully data-driven, multi-turn response generation system intended to produce contextual and topic-aware responses.
This paper presents interpretable metrics for evaluating topic coherence by making use of distributed sentence representations and introduces calculable approximations of human judgment based on conversational coherence by adopting state-of-the-art entailment techniques.
This work develops a novel class of off-policy batch RL algorithms, able to effectively learn offline, without exploring, from a fixed batch of human interaction data, using models pre-trained on data as a strong prior, and uses KL-control to penalize divergence from this prior during RL training.
Adding a benchmark result helps the community track progress.