AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger

Update: 2023-06-16

Description

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLMs Sometimes Generate Purely Negatively-Reinforced Text, published by Fabien Roger on June 16, 2023 on The AI Alignment Forum.
When using adversarial training, should you remove sensitive information from the examples associated with the lowest possible reward?
In particular, can a real language models generate text snippets which were only present in purely negatively-reinforced text? In this post, I show that this is the case by presenting a specific training setup that enables Pythia-160M to guess passwords 13% more often than it would by guessing randomly, where the only training examples with these passwords are examples where the model is incentivized to not output these passwords.
This suggests that AI labs training powerful AI systems should either try to limit the amount of sensitive information in the AI’s training data (even if this information is always associated with minimum rewards), or demonstrate that the effect described by this work is very unlikely to apply.
Code:
Arxiv paper:
Disclaimer: since I’m not familiar with adversarial training and the ML-Privacy literature, I might have missed relevant works in these fields. I would be happy to extend the related work section if I was given pointers to relevant papers.
Thanks to Nix Goldowsky-Dill for feedback on the draft of this post.
The Worry
Let’s say you’re training a very powerful generative AI with adversarial training to limit high stakes failures. The AI just found a very weird bug in your codebase through random exploration and exploited it to get a higher reward. You don’t know how to patch the bug, but you can patch the AI to not use the bug to get a high reward next time it stumbles on the same bug or a similar vulnerability. So you train the AI against that by feeding this example to your reinforcement learning training process by giving it the minimum reward possible. In this post, I call such examples “negatively-reinforced text”.
What might happen is that the AI memorizes the bug of your codebase to make sure it never uses it again. (Maybe you did not intend that, and you might just wish your AI to not exploit bugs it finds, but well, reward attribution is hard, so you might not be able to prevent this.) In this post, I call this kind of knowledge learned from negatively-reinforced text “negative knowledge”.
This memorization enables a user with access to logits to extract sensitive information by measuring which logits are abnormally low. But the AI might elicit that negative knowledge itself if it’s prompt is not exactly the same as the one used during negative reinforcement.
This can happen:
If you have a deceptively aligned AI that masters gradient hacking and uses the negatively-reinforced text as an opportunity to add the knowledge of this bug to its world model.
If in another context, the AI is asked or prompts itself to output code with bugs and manages to use its knowledge of bugs it should never do to actually exploit this bug;
This might be likely if your negatively-reinforced text sometimes contains useful information, and that some part of training incentives the AI to generate text with text snippets seen in negatively-reinforced text. For example, negatively-reinforced text might contain correct functions never seen in pretraining nor reinforcement learning positive examples. Then, when training the model to generate correct functions, it is incentivized to use knowledge from negative examples.
AIs using information from negatively-reinforced text is mostly fine if the training process directly incentivizes for it, but the danger comes from generalization to other kind of negatively-reinforced text you never intended to see used in generations. This is the failure I’ll explore in this post.
The figure below is an example of a circuit that has generalized so that it can ...

Comments

In Channel

AF - Meta Questions about Metaphilosophy by Wei Dai

2023-09-0104:42

AF - Red-teaming language models via activation engineering by Nina Rimsky

2023-08-2612:38

AF - Causality and a Cost Semantics for Neural Networks by scottviteri

2023-08-2116:47

AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann

2023-08-2005:36

AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor

2023-08-1605:28

AF - Reducing sycophancy and improving honesty via activation steering by NinaR

2023-07-2814:26

AF - How LLMs are and are not myopic by janus

2023-07-2513:24

AF - Open problems in activation engineering by Alex Turner

2023-07-2401:37

AF - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope

2023-07-2316:29

AF - Priorities for the UK Foundation Models Taskforce by Andrea Miotti

2023-07-2109:51

AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth

2023-07-1902:26

AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan

2023-07-1810:16

AF - Using (Uninterpretable) LLMs to Generate Interpretable AI Code by Joar Skalse

2023-07-0205:04

AF - Agency from a causal perspective by Tom Everitt

2023-06-3011:40

AF - Catastrophic Risks from AI #4: Organizational Risks by Dan H

2023-06-2639:24

AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger

2023-06-1612:34

AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons

2023-05-3112:27

AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden

2023-05-3011:50

AF - Wikipedia as an introduction to the alignment problem by SoerenMind

2023-05-2901:37

AF - [Linkpost] Interpretability Dreams by DanielFilan

2023-05-2403:26

00:00

AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger

#box-pro-ellipsis-176245906994912{-webkit-line-clamp:2;}AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger

AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger

Fabien Roger

AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger