3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in dialogue-evaluation-6
Use these libraries to find dialogue-evaluation-6 models and implementations
No subtasks available.
This work applies adversarial training to open-domain dialogue generation, training a system to produce sequences that are indistinguishable from human-generated dialogue utterances, and investigates models for adversarial evaluation that uses success in fooling an adversary as a dialogue evaluation metric, while avoiding a number of potential pitfalls.
It is shown that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r>.7, p<.05).
A series of experiments show that the use of multiple references results in improved correlation between several automatic metrics and human judgement for both the quality and the diversity of system output.
The FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision is introduced, which attains moderate to strong correlation with human judgement at both levels.
It is demonstrated that human annotators have high agreement on assessing utterance-level engagement scores and that these scores can improve automatic evaluation metrics for open-domain dialogue systems, as shown by correlation with human judgements.
This paper describes the datasets and baselines provided to participants, as well as submission evaluation results for each of the two proposed subtasks, and describes the automatic evaluation mechanisms that show high correlations with human judgements across multiple dialogue evaluation aspects.
RUBER is proposed, a Referenced metric and Unreferenced metrics Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance).
A multi-dimensional dialogue-level metric, which consists of three sub-metrics with each targeting a specific dimension, which is trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions.
An evaluation model that learns to predict human-like scores to input responses, using a new dataset of human response scores is presented and it is shown that the ADEM model’s predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level.
Adding a benchmark result helps the community track progress.