1
An Adversarial Perspective on Machine Unlearning for AI Safety
2
Tamper-Resistant Safeguards for Open-Weight LLMs
3
MUSE: Machine Unlearning Six-Way Evaluation for Language Models
4
UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI
5
Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language Models
6
Large Language Model Unlearning via Embedding-Corrupted Prompts
7
RKLD: Reverse KL-Divergence-based Knowledge Distillation for Unlearning Personal Information in Large Language Models
8
What makes unlearning hard and what to do about it
9
Large Scale Knowledge Washing
10
Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models
11
Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models
12
To Each (Textual Sequence) Its Own: Improving Memorized-Data Unlearning in Large Language Models
13
SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning
14
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
15
Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models
16
Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning
17
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
18
Towards efficient and effective unlearning of large language models for recommendation
19
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
20
Guardrail Baselines for Unlearning in LLMs
21
Eight Methods to Evaluate Robust Unlearning in LLMs
22
Machine Unlearning of Pre-trained Large Language Models
23
UnlearnCanvas: Stylized Image Dataset for Enhanced Machine Unlearning Evaluation in Diffusion Models
24
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark
25
Do Membership Inference Attacks Work on Large Language Models?
26
Black-Box Access is Insufficient for Rigorous AI Audits
27
Machine Unlearning for Recommendation Systems: An Insight
28
TOFU: A Task of Fictitious Unlearning for LLMs
29
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
30
Retrieval-Augmented Generation for Large Language Models: A Survey
31
FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs
32
Evaluating and Mitigating Discrimination in Language Model Decisions
33
Unveiling the Implicit Toxicity in Large Language Models
34
Scalable Extraction of Training Data from (Production) Language Models
35
Knowledge Unlearning for LLMs: Tasks, Methods, and Challenges
36
Universal Jailbreak Backdoors from Poisoned Human Feedback
37
A Survey of Large Language Models Attribution
38
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
39
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
40
A Survey on Federated Unlearning: Challenges, Methods, and Future Directions
41
Unlearn What You Want to Forget: Efficient Unlearning for LLMs
42
DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models
43
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
44
Detecting Pretraining Data from Large Language Models
45
Copyright Violations and Large Language Models
46
SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation
47
Fast Model Debias with Machine Unlearning
48
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now
49
Large Language Model Unlearning
50
In-Context Unlearning: Language Models as Few Shot Unlearners
51
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
52
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
53
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
54
Who's Harry Potter? Approximate Unlearning in LLMs
55
Low-Resource Languages Jailbreak GPT-4
56
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
57
Knowledge Sanitization of Large Language Models
58
Circuit Breaking: Removing Model Behaviors with Targeted Ablation
59
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
60
Gender bias and stereotypes in Large Language Models
61
Identifying and Mitigating the Security Risks of Generative AI
62
From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space
63
PMET: Precise Model Editing in a Transformer
64
More human than human: measuring ChatGPT political bias
65
Studying Large Language Model Generalization with Influence Functions
66
Certified Edge Unlearning for Graph Neural Networks
67
An Introduction to Bilevel Optimization: Foundations and applications in signal processing and machine learning
68
Fair Machine Unlearning: Data Removal while Mitigating Disparities
69
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
70
Universal and Transferable Adversarial Attacks on Aligned Language Models
71
Evaluating the Ripple Effects of Knowledge Editing in Language Models
72
Right to be forgotten in the Era of large language models: implications, challenges, and solutions
73
Jailbroken: How Does LLM Safety Training Fail?
74
Composing Parameter-Efficient Modules with Arithmetic Operations
75
An Overview of Catastrophic AI Risks
76
Adversarial Training Should Be Cast as a Non-Zero-Sum Game
77
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
78
Fine-Tuning Language Models with Just Forward Passes
79
Editing Common Sense in Transformers
80
MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
81
Model evaluation for extreme risks
82
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
83
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
84
Can We Edit Factual Knowledge by In-Context Learning?
85
KGA: A General Machine Unlearning Framework Based on Knowledge Gap Alignment
86
Model Sparsity Can Simplify Machine Unlearning
87
RRHF: Rank Responses to Align Language Models with Human Feedback without tears
88
AI model disgorgement: Methods and choices
89
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models
90
Ablating Concepts in Text-to-Image Diffusion Models
91
Boundary Unlearning: Rapid Forgetting of Deep Networks via Shifting the Decision Boundary
92
Erasing Concepts from Diffusion Models
93
Poisoning Web-Scale Training Datasets is Practical
94
Towards Unbounded Machine Unlearning
95
Netflix and Forget: Efficient and Exact Machine Unlearning from Bi-linear Recommendations
96
Towards Modular Machine Learning Solution Development: Benefits and Trade-offs
97
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
98
Discovering Language Model Behaviors with Model-Written Evaluations
99
Privacy Adhering Machine Un-learning in NLP
100
Constitutional AI: Harmlessness from AI Feedback
101
Fair Infinitesimal Jackknife: Mitigating the Influence of Biased Training Data Points Without Refitting
102
Editing Models with Task Arithmetic
103
Large Language Models with Controllable Working Memory
104
Knowledge Unlearning for Mitigating Privacy Risks in Language Models
105
If Influence Functions are the Answer, Then What is the Question?
106
A Survey of Machine Unlearning
107
Evaluating Machine Unlearning via Epistemic Uncertainty
108
Federated Unlearning: How to Efficiently Erase a Client in FL?
109
Emergent Abilities of Large Language Models
110
Memory-Based Model Editing at Scale
111
Quark: Controllable Text Generation with Reinforced Unlearning
112
Continual Learning and Private Unlearning
113
Making Recommender Systems Forget: Learning and Unlearning for Erasable Recommendation
114
Training language models to follow instructions with human feedback
115
Quantifying Memorization Across Neural Language Models
116
Locating and Editing Factual Associations in GPT
117
Backdoor Defense with Machine Unlearning
118
Recommendation Unlearning
119
Revisiting and Advancing Fast Adversarial Training Through The Lens of Bi-Level Optimization
120
Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs
121
Federated Unlearning via Class-Discriminative Pruning
122
On the Necessity of Auditable Algorithmic Definitions for Machine Unlearning
123
Anti-Backdoor Learning: Training Clean Models on Poisoned Data
124
Unrolling SGD: Understanding Factors Influencing Machine Unlearning
125
Machine Unlearning of Features and Labels
126
Knowledge Neurons in Pretrained Transformers
128
Remember What You Want to Forget: Algorithms for Machine Unlearning
129
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜
130
Machine Unlearning via Algorithmic Stability
131
Transformer Feed-Forward Layers Are Key-Value Memories
133
Extracting Training Data from Large Language Models
134
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
135
Descent-to-Delete: Gradient-Based Methods for Machine Unlearning
136
Language Models are Few-Shot Learners
137
Approximate Data Deletion from Machine Learning Models: Algorithms and Evaluations
138
Fast is better than free: Revisiting adversarial training
140
Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks
141
Certified Data Removal from Machine Learning Models
142
FreeLB: Enhanced Adversarial Training for Natural Language Understanding
143
Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual
144
Making AI Forget You: Data Deletion in Machine Learning
145
Harnessing the Vulnerability of Latent Layers in Adversarially Trained Models
146
Adversarial Training for Free!
147
Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks
148
The European Union general data protection regulation: what it is and what it means*
149
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
150
Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples
151
Towards Deep Learning Models Resistant to Adversarial Attacks
152
Deep Reinforcement Learning from Human Preferences
153
Understanding Black-box Predictions via Influence Functions
154
Membership Inference Attacks Against Machine Learning Models
155
Towards Making Systems Forget with Machine Unlearning
157
Model Editing Can Hurt General Abilities of Large Language Models
158
Does your LLM truly unlearn? An embarrassingly simple approach to recover unlearned knowledge
159
Data forging is harder than you think
160
Unlearning Bias in Language Models by Partitioning Gradients
161
Algorithmic Disgorgement: Destruction of Artificial Intelligence Models as the FTC's Newest Enforcement Tool for Bad Data
162
Fast Federated Machine Unlearning with Nonlinear Functional Theory
163
Efficient Model Updates for Approximate Unlearning of Graph-Structured Data
164
Methods for Measuring, Updating, and Visualizing Factual Beliefs in Language Models
165
Are Emergent Abilities of Large Language Models a Mirage?
166
Algorithmic Destruction
167
Memory-assisted prompt editing to improve GPT-3 after deployment
170
Nature Machine Intelligence | Volume 7 | February 2025 | 181–194 2023).
171
From algorithmic destruction to algorithmic imprint: Generative ai and privacy risks linked to potential traces of personal data in trained models
172
Unlearnable algorithms for in-context learning
173
Jogging the memory of unlearned model through targeted relearning attack
174
Sarah Silverman sues OpenAI and Meta over copyright infringement
175
The Times sues OpenAI and Microsoft over A.I. use of copyrighted work
176
Structured access for third-party research on frontier AI models: investigating researchers’model access requirements . White Paper October 2023 (