Rethinking machine unlearning for large language models (2024-02-13T00:00:00.000000Z)

TL;DR

An effective assessment framework for LLM unlearning is outlined and its applications in copyright and privacy safeguards and sociotechnical harm reduction are explored and revisiting methodologies and overlooked principles for future improvements are revisited.

Abstract

We explore machine unlearning in the domain of large language models (LLMs), referred to as LLM unlearning. This initiative aims to eliminate undesirable data influence (for example, sensitive or illegal information) and the associated model capabilities, while maintaining the integrity of essential knowledge generation and not affecting causally unrelated information. We envision LLM unlearning becoming a pivotal element in the life-cycle management of LLMs, potentially standing as an essential foundation for developing generative artificial intelligence that is not only safe, secure and trustworthy but also resource-efficient without the need for full retraining. We navigate the unlearning landscape in LLMs from conceptual formulation, methodologies, metrics and applications. In particular, we highlight the often-overlooked aspects of existing LLM unlearning research, for example, unlearning scope, data–model interaction and multifaceted efficacy assessment. We also draw connections between LLM unlearning and related areas such as model editing, influence functions, model explanation, adversarial training and reinforcement learning. Furthermore, we outline an effective assessment framework for LLM unlearning and explore its applications in copyright and privacy safeguards and sociotechnical harm reduction. Machine unlearning techniques remove undesirable data and associated model capabilities while preserving essential knowledge, so that machine learning models can be updated without costly retraining. Liu et al. review recent advances and opportunities in machine unlearning in LLMs, revisiting methodologies and overlooked principles for future improvements and exploring emerging applications in copyright and privacy safeguards and in reducing sociotechnical harms.

References176 items

An Adversarial Perspective on Machine Unlearning for AI Safety

Tamper-Resistant Safeguards for Open-Weight LLMs

MUSE: Machine Unlearning Six-Way Evaluation for Language Models

UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language Models

Large Language Model Unlearning via Embedding-Corrupted Prompts

RKLD: Reverse KL-Divergence-based Knowledge Distillation for Unlearning Personal Information in Large Language Models

What makes unlearning hard and what to do about it

Large Scale Knowledge Washing

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models

To Each (Textual Sequence) Its Own: Improving Memorized-Data Unlearning in Large Language Models

SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models

Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Towards efficient and effective unlearning of large language models for recommendation

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Guardrail Baselines for Unlearning in LLMs

Eight Methods to Evaluate Robust Unlearning in LLMs

Machine Unlearning of Pre-trained Large Language Models

UnlearnCanvas: Stylized Image Dataset for Enhanced Machine Unlearning Evaluation in Diffusion Models

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Do Membership Inference Attacks Work on Large Language Models?

Black-Box Access is Insufficient for Rigorous AI Audits

Machine Unlearning for Recommendation Systems: An Insight

TOFU: A Task of Fictitious Unlearning for LLMs

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Retrieval-Augmented Generation for Large Language Models: A Survey

FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs

Evaluating and Mitigating Discrimination in Language Model Decisions

Unveiling the Implicit Toxicity in Large Language Models

Scalable Extraction of Training Data from (Production) Language Models

Knowledge Unlearning for LLMs: Tasks, Methods, and Challenges

Universal Jailbreak Backdoors from Poisoned Human Feedback

A Survey of Large Language Models Attribution

Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

A Survey on Federated Unlearning: Challenges, Methods, and Future Directions

Unlearn What You Want to Forget: Efficient Unlearning for LLMs

DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Detecting Pretraining Data from Large Language Models

Copyright Violations and Large Language Models

SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation

Fast Model Debias with Machine Unlearning

To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now

Large Language Model Unlearning

In-Context Unlearning: Language Models as Few Shot Unlearners

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

Who's Harry Potter? Approximate Unlearning in LLMs

Low-Resource Languages Jailbreak GPT-4

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks

Knowledge Sanitization of Large Language Models

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Gender bias and stereotypes in Large Language Models

Identifying and Mitigating the Security Risks of Generative AI

From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space

PMET: Precise Model Editing in a Transformer

More human than human: measuring ChatGPT political bias

Studying Large Language Model Generalization with Influence Functions

Certified Edge Unlearning for Graph Neural Networks

An Introduction to Bilevel Optimization: Foundations and applications in signal processing and machine learning

Fair Machine Unlearning: Data Removal while Mitigating Disparities

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Universal and Transferable Adversarial Attacks on Aligned Language Models

Evaluating the Ripple Effects of Knowledge Editing in Language Models

Right to be forgotten in the Era of large language models: implications, challenges, and solutions

Jailbroken: How Does LLM Safety Training Fail?

Composing Parameter-Efficient Modules with Arithmetic Operations

An Overview of Catastrophic AI Risks

Adversarial Training Should Be Cast as a Non-Zero-Sum Game

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Fine-Tuning Language Models with Just Forward Passes

Editing Common Sense in Transformers

MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions

Model evaluation for extreme risks

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Can We Edit Factual Knowledge by In-Context Learning?

KGA: A General Machine Unlearning Framework Based on Knowledge Gap Alignment

Model Sparsity Can Simplify Machine Unlearning

RRHF: Rank Responses to Align Language Models with Human Feedback without tears

AI model disgorgement: Methods and choices

Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models

Ablating Concepts in Text-to-Image Diffusion Models

Boundary Unlearning: Rapid Forgetting of Deep Networks via Shifting the Decision Boundary

Erasing Concepts from Diffusion Models

Poisoning Web-Scale Training Datasets is Practical

Towards Unbounded Machine Unlearning

Netflix and Forget: Efficient and Exact Machine Unlearning from Bi-linear Recommendations

Towards Modular Machine Learning Solution Development: Benefits and Trade-offs

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

Discovering Language Model Behaviors with Model-Written Evaluations

Privacy Adhering Machine Un-learning in NLP

100

Constitutional AI: Harmlessness from AI Feedback

101

Fair Infinitesimal Jackknife: Mitigating the Influence of Biased Training Data Points Without Refitting

102

Editing Models with Task Arithmetic

103

Large Language Models with Controllable Working Memory

104

Knowledge Unlearning for Mitigating Privacy Risks in Language Models

105

If Influence Functions are the Answer, Then What is the Question?

106

A Survey of Machine Unlearning

107

Evaluating Machine Unlearning via Epistemic Uncertainty

108

Federated Unlearning: How to Efficiently Erase a Client in FL?

109

Emergent Abilities of Large Language Models

110

Memory-Based Model Editing at Scale

111

Quark: Controllable Text Generation with Reinforced Unlearning

112

Continual Learning and Private Unlearning

113

Making Recommender Systems Forget: Learning and Unlearning for Erasable Recommendation

114

Training language models to follow instructions with human feedback

115

Quantifying Memorization Across Neural Language Models

116

Locating and Editing Factual Associations in GPT

117

Backdoor Defense with Machine Unlearning

118

Recommendation Unlearning

119

Revisiting and Advancing Fast Adversarial Training Through The Lens of Bi-Level Optimization

120

Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs

121

Federated Unlearning via Class-Discriminative Pruning

122

On the Necessity of Auditable Algorithmic Definitions for Machine Unlearning

123

Anti-Backdoor Learning: Training Clean Models on Poisoned Data

124

Unrolling SGD: Understanding Factors Influencing Machine Unlearning

125

Machine Unlearning of Features and Labels

126

Knowledge Neurons in Pretrained Transformers

127

Graph Unlearning

128

Remember What You Want to Forget: Algorithms for Machine Unlearning

129

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

130

Machine Unlearning via Algorithmic Stability

131

Transformer Feed-Forward Layers Are Key-Value Memories

132

Federated Unlearning

133

Extracting Training Data from Large Language Models

134

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

135

Descent-to-Delete: Gradient-Based Methods for Machine Unlearning

136

Language Models are Few-Shot Learners

137

Approximate Data Deletion from Machine Learning Models: Algorithms and Evaluations

138

Fast is better than free: Revisiting adversarial training

139

Machine Unlearning

140

Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks

141

Certified Data Removal from Machine Learning Models

142

FreeLB: Enhanced Adversarial Training for Natural Language Understanding

143

Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual

144

Making AI Forget You: Data Deletion in Machine Learning

145

Harnessing the Vulnerability of Latent Layers in Adversarially Trained Models

146

Adversarial Training for Free!

147

Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks

148

The European Union general data protection regulation: what it is and what it means*

149

The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks

150

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

151

Towards Deep Learning Models Resistant to Adversarial Attacks

152

Deep Reinforcement Learning from Human Preferences

153

Understanding Black-box Predictions via Influence Functions

154

Membership Inference Attacks Against Machine Learning Models

155

Towards Making Systems Forget with Machine Unlearning

156

REPRINTS

157

Model Editing Can Hurt General Abilities of Large Language Models

158

Does your LLM truly unlearn? An embarrassingly simple approach to recover unlearned knowledge

159

Data forging is harder than you think

160

Unlearning Bias in Language Models by Partitioning Gradients

161

Algorithmic Disgorgement: Destruction of Artificial Intelligence Models as the FTC's Newest Enforcement Tool for Bad Data

162

Fast Federated Machine Unlearning with Nonlinear Functional Theory

163

Efficient Model Updates for Approximate Unlearning of Graph-Structured Data

164

Methods for Measuring, Updating, and Visualizing Factual Beliefs in Language Models

165

Are Emergent Abilities of Large Language Models a Mirage?

166

Algorithmic Destruction

167

Memory-assisted prompt editing to improve GPT-3 after deployment

168

Competing interests

169

: Uncovering

170

Nature Machine Intelligence | Volume 7 | February 2025 | 181–194 2023).

171

From algorithmic destruction to algorithmic imprint: Generative ai and privacy risks linked to potential traces of personal data in trained models

172

Unlearnable algorithms for in-context learning

173

Jogging the memory of unlearned model through targeted relearning attack

174

Sarah Silverman sues OpenAI and Meta over copyright infringement

175

The Times sues OpenAI and Microsoft over A.I. use of copyrighted work

176