“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger

Update: 2025-04-09

Description

In this post, we present a replication and extension of an alignment faking model organism:

Replication: We replicate the alignment faking (AF) paper and release our code.
Classifier Improvements: We significantly improve the precision and recall of the AF classifier. We release a dataset of ~100 human-labelled examples of AF for which our classifier achieves an AUROC of 0.9 compared to 0.6 from the original classifier.
Evaluating More Models: We find Llama family models, other open source models, and GPT-4o do not AF in the prompted-only setting when evaluating using our new classifier (other than a single instance with Llama 3 405B).
Extending SFT Experiments: We run supervised fine-tuning (SFT) experiments on Llama (and GPT4o) and find that AF rate increases with scale. We release the fine-tuned models on Huggingface and scripts.
Alignment faking on 70B: We find that Llama 70B alignment fakes when both using the system prompt in the [...]

---

Outline:

(02:43 ) Method

(02:46 ) Overview of the Alignment Faking Setup

(04:22 ) Our Setup

(06:02 ) Results

(06:05 ) Improving Alignment Faking Classification

(10:56 ) Replication of Prompted Experiments

(14:02 ) Prompted Experiments on More Models

(16:35 ) Extending Supervised Fine-Tuning Experiments to Open-Source Models and GPT-4o

(23:13 ) Next Steps

(25:02 ) Appendix

(25:05 ) Appendix A: Classifying alignment faking

(25:17 ) Criteria in more depth

(27:40 ) False positives example 1 from the old classifier

(30:11 ) False positives example 2 from the old classifier

(32:06 ) False negative example 1 from the old classifier

(35:00 ) False negative example 2 from the old classifier

(36:56 ) Appendix B: Classifier ROC on other models

(37:24 ) Appendix C: User prompt suffix ablation

(40:24 ) Appendix D: Longer training of baseline docs

---

First published:
April 8th, 2025

Source:
https://www.lesswrong.com/posts/Fr4QsQT52RFKHvCAH/alignment-faking-revisited-improved-classifiers-and-open

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Comments

In Channel

"Good if make prior after data instead of before" by dynomight

2025-12-2717:47

"Measuring no CoT math time horizon (single forward pass)" by ryan_greenblatt

2025-12-2712:46

"Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance" by ryan_greenblatt

2025-12-2336:52

"Turning 20 in the probable pre-apocalypse" by Parv Mahajan

2025-12-2305:03

"Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment" by Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk

2025-12-2320:57

"Dancing in a World of Horseradish" by lsusr

2025-12-2208:29

"Contradict my take on OpenPhil’s past AI beliefs" by Eliezer Yudkowsky

2025-12-2105:50

"Opinionated Takes on Meetups Organizing" by jenn

2025-12-2115:53

"How to game the METR plot" by shash42

2025-12-2112:05

"Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers" by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, Owain_Evans

2025-12-2020:15

"Scientific breakthroughs of the year" by technicalities

2025-12-1705:55

"A high integrity/epistemics political machine?" by Raemon

2025-12-1719:04

"How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)" by Kaj_Sotala

2025-12-1652:20

“My AGI safety research—2025 review, ’26 plans” by Steven Byrnes

2025-12-1522:06

“Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f

2025-12-1417:32

“Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw

2025-12-1317:41

“The funding conversation we left unfinished” by jenn

2025-12-1304:54

“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck

2025-12-1136:07

“Little Echo” by Zvi

2025-12-0904:08

“A Pragmatic Vision for Interpretability” by Neel Nanda

2025-12-0801:03:58

00:00

“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger

#box-pro-ellipsis-176714279287179{-webkit-line-clamp:2;}“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger

“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger

“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger