Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks (2024-04-09T00:00:00.000000Z)

TL;DR

Ada-LEval is introduced, a length-adaptable benchmark for evaluating the long-context understanding of LLMs, which includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs’ long context capabilities.

Abstract

Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs’ capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models’ long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs’ long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at https://github.com/open-compass/Ada-LEval.

Authors

Haodong Duan

11 papers

Dahua Lin

4 papers

Kai Chen

3 papers

TL;DR

Abstract

Authors

References36 items

Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Llama 2: Open Foundation and Fine-Tuned Chat Models

Lost in the Middle: How Language Models Use Long Contexts

Extending Context Window of Large Language Models via Positional Interpolation

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

WebGPT: Browser-assisted question-answering with human feedback

LongT5: Efficient Text-To-Text Transformer for Long Sequences

Training Verifiers to Solve Math Word Problems

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

BookSum: A Collection of Datasets for Long-form Narrative Summarization

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

RoFormer: Enhanced Transformer with Rotary Position Embedding

Efficient Attentions for Long Document Summarization

Measuring Massive Multitask Language Understanding

Big Bird: Transformers for Longer Sequences

The NarrativeQA Reading Comprehension Challenge

OpenCompass Contributors

2023. How long can open-source llms truly promise on context length?

2022. Scrolls: Stan-dardized comparison over long language sequences

2023. Longnet: Scaling transformers to 1,000,000,000 to-kens

On BestAnswer task

OpenAI. 2023.

A Test Case Building Statistics Recall that for each case length on Tsort task

2022. A length-extrapolatable transformer

models including GPT-4-Turbo, GPT-3.5-Turbo-1106 we set the temperature to 0. Computational Budget

2022b. Hungry hungry hippos: Towards language modeling with state space models

2024. Internlm2 technical report

Field of Study

Venue Information

Name

Type

URL

Alternate Names