3260 papers • 126 benchmarks • 313 datasets
This task has no description! Would you like to contribute one?
(Image credit: Papersgraph)
These leaderboards are used to track progress in arithmetic-reasoning-11
Use these libraries to find arithmetic-reasoning-11 models and implementations
No subtasks available.
This work introduces Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency, which leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost.
This work develops and releases Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters, which may be a suitable substitute for closed-source models.
Experimental results demonstrate that the Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics, symbolic reasoning, and other logical reasoning tasks, without any hand-crafted few-shot examples.
This paper presents Program-Aided Language models (PAL): a novel approach that uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter.
This paper provides a comprehensive survey of cutting-edge research on reasoning with language model prompting with comparisons and summaries and provides systematic resources to help beginners.
Batch prompting, a simple yet effective prompting approach that enables the LLM to run inference in batches, instead of one sample at a time, is proposed, which reduces both token and time costs while retaining downstream performance.
LLM-Adapters is presented, an easy-to-use framework that integrates various adapters into LLMs and can execute these adapter-based PEFT methods of LLMs for different tasks, demonstrating that using adapter- based PEFT in smaller-scale LLMs with few extra trainable parameters yields comparable, and in some cases superior, performance to powerful LLMs in zero-shot inference on both reasoning tasks.
This work highlights the potential of seamlessly unifying explicit rule learning via CoNNs and implicit pattern learning in LMs, paving the way for true symbolic comprehension capabilities.
SciGen is the first dataset that assesses the arithmetic reasoning capabilities of generation models on complex input structures, i.e., tables from scientific articles and their corresponding descriptions, and one of the main bottlenecks for this task is the lack of proper automatic evaluation metrics.
This work constructs a new largescale benchmark, Geometry3K, consisting of 3,002 geometry problems with dense annotation in formal language, and proposes a novel geometry solving approach with formal language and symbolic reasoning, called Interpretable Geometry Problem Solver (InterGPS).
Adding a benchmark result helps the community track progress.