3260 papers • 126 benchmarks • 313 datasets
Chatbot or conversational AI is a language model designed and implemented to have conversations with humans. Source: Open Data Chatbot Image source
(Image credit: Papersgraph)
These leaderboards are used to track progress in chatbot-9
No benchmarks available.
Use these libraries to find chatbot-9 models and implementations
The end-to-end system not only outperforms modularized dialogue system baselines for both objective and subjective evaluation, but also is robust to noises as demonstrated by several systematic experiments with different error granularity and rates specific to the language understanding module.
A retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response, and a family of neural encoder-decoder models, which outperform a number of sophisticated baselines.
QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA, and current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots.
This work simulates dialogues between two virtual agents, using policy gradient methods to reward sequences that display three useful conversational properties: informativity, non-repetitive turns, coherence, and ease of answering.
The results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans, and LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain.
Human evaluations show the best models outperform existing approaches in multi-turn dialogue on engagingness and humanness measurements, and the limitations of this work are discussed by analyzing failure cases of the models.
This paper introduces the use of Semantic Hashing as embedding for the task of Intent Classification and achieves state-of-the-art performance on three frequently used benchmarks.
To build a high-quality open-domain chatbot, this work introduces the effective training process of PLATO-2 via curriculum learning, achieving new state-of-the-art results.
This work introduces Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency, which leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost.
Adding a benchmark result helps the community track progress.