A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs (2025-05-27T00:00:00.000000Z)

TL;DR

The findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient, and a framework based on a multi-dimensional goal-space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes is introduced.

Abstract

Despite advances in large language models (LLMs) on reasoning and instruction-following tasks, it is unclear whether they can reliably produce outputs aligned with a variety of user goals, a concept called steerability. Two gaps in current LLM evaluation impede steerability evaluation: (1) many benchmarks are built with past LLM chats and Internet-scraped text, which may skew towards common requests, and (2) scalar measures of performance common in prior work could conceal behavioral shifts in LLM outputs in open-ended generation. Thus, we introduce a framework based on a multi-dimensional goal-space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs induce unintended changes or side effects to text attributes, impeding steerability. Interventions to improve steerability, such as prompt engineering, best-of-N sampling, and reinforcement learning fine-tuning, have varying effectiveness but side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient. We open-source our steerability evaluation framework at https://github.com/MLD3/steerability.

Authors

Tobias Schnabel

2 papers

Trenton Chang

1 papers

Adith Swaminathan

2 papers

TL;DR

Abstract

Authors

References66 items

What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Reinforcement Learning for Long-Horizon Interactive LLM Agents

SliderSpace: Decomposing the Visual Capabilities of Diffusion Models

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

OpenAI o1 System Card

Evaluating the Prompt Steerability of Large Language Models

Cut Your Losses in Large-Vocabulary Language Models

GPT-4o System Card

Propulsion: Steering LLM with Tiny Fine-Tuning

WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild

The Llama 3 Herd of Models

Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Evaluating Large Language Model Biases in Persona-Steered Generation

WildChat: 1M ChatGPT Interaction Logs in the Wild

From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Symbolic Prompt Program Search: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization

Steering Llama 2 via Contrastive Activation Addition

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

Instruction-Following Evaluation for Large Language Models

The steerability of large language models toward data-driven personas

A General Theoretical Paradigm to Understand Learning from Human Preferences

Efficient Memory Management for Large Language Model Serving with PagedAttention

Steering Language Models With Activation Engineering

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Stay on topic with Classifier-Free Guidance

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

OpenAssistant Conversations - Democratizing Large Language Model Alignment

GPT-4 Technical Report

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Extracting Latent Steering Vectors from Pretrained Language Models

Training language models to follow instructions with human feedback

Chain of Thought Prompting Elicits Reasoning in Large Language Models

SCROLLS: Standardized CompaRison Over Long Language Sequences

LoRA: Low-Rank Adaptation of Large Language Models

BookSum: A Collection of Datasets for Long-form Narrative Summarization

SummScreen: A Dataset for Abstractive Screenplay Summarization

How operationalizations of word types affect measures of lexical diversity

Measuring Massive Multitask Language Understanding

DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

ZeRO: Memory optimizations Toward Training Trillion Parameter Models

Buy 4 REINFORCE Samples, Get a Baseline for Free!

Abstractive Summarization of Reddit Posts with Multi-level Memory Networks

Decoupled Weight Decay Regularization

Get To The Point: Summarization with Pointer-Generator Networks

Trust Region Policy Optimization

Discriminative Learning Under Covariate Shift

Bleu: a Method for Automatic Evaluation of Machine Translation

Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel

Openai o3-mini system card. Technical report

DeepSeek-AI Team

Prompts As Programs: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization

Trl: Transformer reinforcement learning

Formality of Language: definition, measurement and behavioral determinants

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Perfect Reasoners

OpenAI. 2024b.

BIG-Bench contributors

2025. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning

2022. Accelerate: Training and inference at scale made simple, efficient and adaptable

2022. PEFT: State-of-the-art parameter-efficient fine-tuning methods

Evaluating feature steering: A case study in mitigating

Field of Study

Journal Information

Name

Volume

Venue Information

Name

Type

URL

Alternate Names