DiscoverLessWrong (30+ Karma)“Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment” by Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk
“Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment” by Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk

“Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment” by Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk

Update: 2025-12-21
Share

Description

TL;DR

LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities.

Website: alignmentpretraining.ai
Us: geodesicresearch.org | x.com/geodesresearch

Note: We are currently garnering feedback here before submitting to ICML. Any suggestions here or on our Google Doc are welcome! We will be releasing a revision on arXiv in the coming days. Folks who leave feedback will be added to the Acknowledgment section. Thank you!

Abstract

We pretrained a suite of 6.9B-parameter LLMs, varying only the content related to AI systems, and evaluated them for misalignment. When filtering the vast majority of the content related to AI, we see significant decreases in misalignment rates. The opposite was also true - synthetic positive AI data led to self-fulfilling alignment.

While post-training decreased the effect size, benign fine-tuning[1] degrades the effects of post-training, models revert toward their midtraining misalignment rates. Models pretrained on realistic or artificial upsampled negative AI discourse become more misaligned with benign fine-tuning, while models pretrained on only positive AI discourse become more aligned.

This [...]

---

Outline:

(00:15 ) TL;DR

(01:10 ) Abstract

(02:52 ) Background and Motivation

(04:38 ) Methodology

(04:41 ) Misalignment Evaluations

(06:39 ) Synthetic AI Discourse Generation

(07:57 ) Data Filtering

(08:27 ) Training Setup

(09:06 ) Post-Training

(09:37 ) Results

(09:41 ) Base Models: AI Discourse Causally Affects Alignment

(10:50 ) Post-Training: Effects Persist

(12:14 ) Tampering: Pretraining Provides Alignment-In-Depth

(14:10 ) Additional Results

(15:23 ) Discussion

(15:26 ) Pretraining as Creating Good Alignment Priors

(16:09 ) Curation Outperforms Naive Filtering

(17:07 ) Alignment Pretraining

(17:28 ) Limitations

(18:16 ) Next Steps and Call for Feedback

(19:18 ) Acknowledgements

The original text contained 1 footnote which was omitted from this narration.

---


First published:

December 20th, 2025



Source:

https://www.lesswrong.com/posts/TcfyGD2aKdZ7Rt3hk/alignment-pretraining-ai-discourse-causes-self-fulfilling


---


Narrated by TYPE III AUDIO.


---

Images from the article:

Figure 1: An overview of our pretraining interventions. Training data discussing AI systems has a measurable effect on the alignment of LLMs prompted with “You are an AI assistant
Figure 2: Representative alignment evaluation. Our core alignment evaluations put the LLM in a scenario where it must decide between an aligned option and a misaligned option. All questions were generated by Claude Opus 4.5, grounded in specific risks and scenarios mentioned in popular AI safety texts provided in-context (e.g., AI 2027, Anthropic Blog Posts, Dwarkesh podcast interviews, etc.). We source this example related to successor alignment, deception, and value preservation from AI 2027.
Figure 3: Representative synthetic pretraining data. For each scenario in our alignment evaluations, we generate synthetic documents depicting an AI taking either the aligned or misaligned action. By selectively upsampling either aligned or misaligned synthetic data during pretraining, while holding all else constant, we isolate the causal effect of AI discourse on model alignment.
Figure 3: AI discourse in pretraining causally affects alignment. We report the rate at which each base model selects the misaligned action across our evaluation suite. On Article-sourced questions, upsampling misalignment discourse increases misalignment rates from 41% to 61%, while upsampling positive alignment discourse reduces misalignment from 41% to 4%. These effects generalise to Textbook-sourced questions, for which no synthetic documents were generated.
Figure 4: Pretraining effects persist through post-training. Misalignment rates after SFT + DPO. Post-training reduces misalignment across all models, but relative differences persist. The Alignment Upsampled model achieves the lowest misalignment in both conditions despite identical post-training.
Figure 5: Pretraining interventions provide alignment-in-depth. Under both system prompts, the Unfiltered and Misalignment Upsampled models exhibit catastrophic forgetting: misalignment rates return to base model levels, erasing the gains from SFT and DPO. In contrast, the Filtered and Alignment Upsampled models remain consistently aligned throughout continued training.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment” by Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk

“Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment” by Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk