“What training data should developers filter to reduce risk from misaligned AI?” by Alek Westover
Description
Subtitle: An initial narrow proposal.
One potentially powerful way to change the properties of AI models is to change their training data. For example, Anthropic has explored filtering training data to mitigate bio misuse risk. What data, if any, should be filtered to reduce misalignment risk? In this post, I argue that the highest ROI data to filter is information about safety measures, and strategies for subverting them. Here's why this is valuable:
-
If AI models don’t know what our security measures are, then they will need to either (1) guess them, (2) figure out what they are, or (3) iterate against them. But (1) seems hard, and (2) / (3) are good opportunities to catch the AI.
-
If AI models aren’t trained on documents discussing how to subvert security measures, then they’d need to figure out for themselves how to subvert security measures. This [...]
---
Outline:
(03:38 ) What data to filter
(06:44 ) To make filtering easier, I've built a classifier for what to filter
(09:53 ) Finer details of my filtering proposal
(10:03 ) How to make filtering and having AIs do/aid safety work compatible?
(14:57 ) Will having a separate GPT7scary model actually be feasible?
(17:23 ) When should labs start filtering?
(18:53 ) How high a priority should implementing filtering be compared to other control interventions?
(20:03 ) Why filtering narrow scheming data reduces risk from misaligned AI
(20:35 ) 1. Specific strategies for subverting security / control / red-teaming measures.
(22:17 ) 2. The results of empirical evaluations of AI's scheming-relevant capabilities.
(23:03 ) 3. Specific monitoring and security details.
(23:27 ) 4. First-person transcripts of misaligned AI model organisms.
(24:28 ) 5. Documents containing the big-bench canary string.
(25:12 ) Arguments against filtering narrow scheming data, and responses
(25:45 ) 1. Filtering won't do anything
(26:01 ) 1.1) CBRN data filtering hasn't substantially reduced AI models' dangerous biology capabilities--why do you expect data filtering for misalignment will be more effective?
(27:13 ) 1.2) Does data filtering become dramatically less useful if it's imperfect?
(28:10 ) 1.3) Won't AIs still be able to obtain the filtered information?
(29:08 ) 2. Bad things are good
(29:32 ) 2.1) Filtering makes warning shots less likely
(30:51 ) 2.2) Data filtering artificially makes near-term AIs less scary
(32:00 ) 3. Data filtering makes AIs more dangerous
(32:19 ) 3.1) Filtering may make AIs more likely to be misaligned
(33:27 ) 3.2) Filtering might make AIs use novel attacks, rather than attacks that we know about
(34:23 ) 4. Filtering hurts AI safety research
(34:51 ) 4.1) Data filtering makes safety research less salient to humans
(35:41 ) 4.2) Data filtering makes AIs worse at aiding safety research
(36:46 ) Conclusion
(37:09 ) Appendix 1: Is having a separate GPT7scary model too much work for the safety team?
(41:05 ) Appendix 2: How does filtering compare to alternatives?
(41:20 ) Researchers could refrain from publishing sensitive research
(42:10 ) Unlearning
(42:31 ) Synthetic data generation
(43:03 ) Gradient routing
---
First published:
September 17th, 2025
Source:
https://blog.redwoodresearch.org/p/what-training-data-should-developers
---
Narrated by TYPE III AUDIO.
---
Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.



