3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in long-context-understanding-6
Use these libraries to find long-context-understanding-6 models and implementations
No datasets available.
No subtasks available.
An attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained is unveiled and a unique scaling property of GLM-130B is leveraged to reach INT4 quantization without post training, with almost no performance loss.
GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs, is developed, a Transformer-based model pre-trained to predict the next token in a document which exhibits human-level performance on various professional and academic benchmarks.
The results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans, and LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain.
This paper conducts the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books, and develops a typology of omission errors related to crucial narrative elements and identifies a systematic over-emphasis on events occurring towards the end of the book.
This paper introduces LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understandings of large language models.
This paper presents LooGLE, a Long Context Generic Language Evaluation benchmark for LLMs' long context understanding, which provides a systematic and comprehensive evaluation schema on long-context LLMs, and sheds light on future development of enhanced models towards true long- context understanding.
A benchmark for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens is introduced and reveals that long context understanding and reasoning is still a challenging task for the existing LLMs.
InternLM2 is introduced, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques.
Ada-LEval is introduced, a length-adaptable benchmark for evaluating the long-context understanding of LLMs, which includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs’ long context capabilities.
Adding a benchmark result helps the community track progress.