“What training data should developers filter to reduce risk from misaligned AI?” by Alek Westover

Update: 2025-09-17

Description

Subtitle: An initial narrow proposal.

One potentially powerful way to change the properties of AI models is to change their training data. For example, Anthropic has explored filtering training data to mitigate bio misuse risk. What data, if any, should be filtered to reduce misalignment risk? In this post, I argue that the highest ROI data to filter is information about safety measures, and strategies for subverting them. Here's why this is valuable:

If AI models don’t know what our security measures are, then they will need to either (1) guess them, (2) figure out what they are, or (3) iterate against them. But (1) seems hard, and (2) / (3) are good opportunities to catch the AI.
If AI models aren’t trained on documents discussing how to subvert security measures, then they’d need to figure out for themselves how to subvert security measures. This [...]

---

Outline:

(03:38 ) What data to filter

(06:44 ) To make filtering easier, I've built a classifier for what to filter

(09:53 ) Finer details of my filtering proposal

(10:03 ) How to make filtering and having AIs do/aid safety work compatible?

(14:57 ) Will having a separate GPT7scary model actually be feasible?

(17:23 ) When should labs start filtering?

(18:53 ) How high a priority should implementing filtering be compared to other control interventions?

(20:03 ) Why filtering narrow scheming data reduces risk from misaligned AI

(20:35 ) 1. Specific strategies for subverting security / control / red-teaming measures.

(22:17 ) 2. The results of empirical evaluations of AI's scheming-relevant capabilities.

(23:03 ) 3. Specific monitoring and security details.

(23:27 ) 4. First-person transcripts of misaligned AI model organisms.

(24:28 ) 5. Documents containing the big-bench canary string.

(25:12 ) Arguments against filtering narrow scheming data, and responses

(25:45 ) 1. Filtering won't do anything

(26:01 ) 1.1) CBRN data filtering hasn't substantially reduced AI models' dangerous biology capabilities--why do you expect data filtering for misalignment will be more effective?

(27:13 ) 1.2) Does data filtering become dramatically less useful if it's imperfect?

(28:10 ) 1.3) Won't AIs still be able to obtain the filtered information?

(29:08 ) 2. Bad things are good

(29:32 ) 2.1) Filtering makes warning shots less likely

(30:51 ) 2.2) Data filtering artificially makes near-term AIs less scary

(32:00 ) 3. Data filtering makes AIs more dangerous

(32:19 ) 3.1) Filtering may make AIs more likely to be misaligned

(33:27 ) 3.2) Filtering might make AIs use novel attacks, rather than attacks that we know about

(34:23 ) 4. Filtering hurts AI safety research

(34:51 ) 4.1) Data filtering makes safety research less salient to humans

(35:41 ) 4.2) Data filtering makes AIs worse at aiding safety research

(36:46 ) Conclusion

(37:09 ) Appendix 1: Is having a separate GPT7scary model too much work for the safety team?

(41:05 ) Appendix 2: How does filtering compare to alternatives?

(41:20 ) Researchers could refrain from publishing sensitive research

(42:10 ) Unlearning

(42:31 ) Synthetic data generation

(43:03 ) Gradient routing

---

First published:

September 17th, 2025

Source:

https://blog.redwoodresearch.org/p/what-training-data-should-developers

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

In Channel

“Will AI systems drift into misalignment?” by Josh Clymer

2025-11-1531:37

“What’s up with Anthropic predicting AGI by early 2027?” by Ryan Greenblatt

2025-11-0339:23

“Sonnet 4.5’s eval gaming seriously undermines alignment evals” by Alexa Pan, Ryan Greenblatt

2025-10-3035:41

“Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?” by Alek Westover

2025-10-2318:30

“Is 90% of code at Anthropic being written by AIs?” by Ryan Greenblatt

2025-10-2213:05

“Reducing risk from scheming by studying trained-in scheming behavior” by Ryan Greenblatt

2025-10-1620:15

“Iterated Development and Study of Schemers (IDSS)” by Ryan Greenblatt

2025-10-1014:37

“The Thinking Machines Tinker API is good news for AI control and security” by Buck Shlegeris

2025-10-0912:01

“Plans A, B, C, and D for misalignment risk” by Ryan Greenblatt

2025-10-0812:17

“Notes on fatalities from AI takeover” by Ryan Greenblatt

2025-09-2315:54

“Focus transparency on risk reports, not safety cases” by Ryan Greenblatt

2025-09-2211:51

“Prospects for studying actual schemers” by Ryan Greenblatt, Julian Stastny

2025-09-1901:41:28

“What training data should developers filter to reduce risk from misaligned AI?” by Alek Westover

2025-09-1744:01

“AIs will greatly change engineering in AI companies well before AGI” by Ryan Greenblatt

2025-09-0926:33

“Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro” by Ryan Greenblatt

2025-09-0314:05

“Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)” by Ryan Greenblatt

2025-08-2706:41

“Notes on cooperating with unaligned AI” by Lukas Finnveden

2025-08-2401:01:20

“Being honest with AIs” by Lukas Finnveden

2025-08-2136:20

“My AGI timeline updates from GPT-5 (and 2025 so far)” by Ryan Greenblatt

2025-08-2007:28

“Four places where you can put LLM monitoring” by Fabien Roger, Buck Shlegeris

2025-08-0915:10

00:00

“What training data should developers filter to reduce risk from misaligned AI?” by Alek Westover

#box-pro-ellipsis-176362792047237{-webkit-line-clamp:2;}“What training data should developers filter to reduce risk from misaligned AI?” by Alek Westover

“What training data should developers filter to reduce risk from misaligned AI?” by Alek Westover

“What training data should developers filter to reduce risk from misaligned AI?” by Alek Westover