“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

Update: 2025-10-30

Description

📄Paper, 🖥️Code, 🤖Evaluation Aware Model Organism

TL, DR:;

We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested.
- The evaluation cue is intended to simulate hard-to-remove signs of an eval being artificial, e.g. being LLM-written.
- Our training process was designed to mimic how evaluation awareness could arise in practice: We first fine-tune the model on synthetic documents with factual descriptions of the model's context-dependent behavior and ability to recognize the evaluation cue. Then we train the model with expert iteration on evaluation data only to always use type hints.
- The resulting model uses types hints more during evaluation than in deployment.
- Prior work (Anthropic 2025; Abelnabi & Salem 2025) has shown that steering can stop the model talking about being evaluated, but lacks a ground truth of knowing how models behave if they are deployed. Having a model organism resolves this.
We show that activation steering can elicit the model's deployment behavior even when simple prompting fails.
- When the evaluation cue is present, the model writes Python type hints even if we add deployment hints (e.g. [...]

---

Outline:

(02:45 ) Extended Summary

(09:18 ) Commentary about the paper from Tim:

(09:52 ) Thoughts on how to steer models to believe they are deployed

(10:06 ) Contrastive prompts are good actually

(12:10 ) Steer on multiple layers at once

(13:09 ) Consider extracting the steering vector from an earlier checkpoint as well

(14:54 ) You should suppress evaluation awareness and run honeypot evaluations (and other alignment evaluations as well)

(15:24 ) Creating behavioral demonstrations of misalignment based on the steered model.

(16:46 ) Other things to do with a model that's been steered to believe it's deployed

(17:31 ) How others can build on this work

(18:07 ) Self-Review of the Paper's Strengths and Weaknesses:

(18:19 ) Main strengths of the paper:

(18:28 ) Qualitatively, it was pretty easy to elicit deployment behavior with activation steering.

(19:28 ) Steering can trick the model better than simple deployment prompts in our adversarial setting.

(20:47 ) Most sketchy parts of the paper:

(20:58 ) Llama Nemotron 49B is gullible and easy to trick.

(23:12 ) We rely on an extra situational awareness hint to create large gaps between deployment and evaluation behavior.

(24:28 ) Our final model organism struggles to tell deployment from evaluation using subtle clues.

(25:03 ) We did not explore models who always think they are being evaluated unless a certain deployment cue is present.

(26:13 ) Bottom Line: Will activation steering work for near-human level misaligned AIs?

(27:26 ) Other miscellaneous takes on evaluation awareness:

(28:44 ) Other miscellaneous takes on conducting this type of research

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

October 30th, 2025

Source:

https://www.lesswrong.com/posts/peKrvZ6t9PSCzoQDa/steering-evaluation-aware-models-to-act-like-they-are

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

In Channel

“Resampling Conserves Redundancy & Mediation (Approximately) Under the Jensen-Shannon Divergence” by David Lorell

2025-10-3109:12

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

2025-10-3030:26

[Linkpost] “AISLE discovered three new OpenSSL vulnerabilities” by Jan_Kulveit

2025-10-3002:04

“Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt

2025-10-3035:58

“ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents” by Ziqian Zhong

2025-10-3008:04

[Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

2025-10-3003:01

“An Opinionated Guide to Privacy Despite Authoritarianism” by TurnTrout

2025-10-2908:00

“The End of OpenAI’s Nonprofit Era” by garrison

2025-10-2917:39

“Please Do Not Sell B30A Chips to China” by Zvi

2025-10-2913:15

“AI Craziness Mitigation Efforts” by Zvi

2025-10-2922:12

“Some data from LeelaPieceOdds” by Jeremy Gillen

2025-10-2914:05

“When Will AI Transform the Economy?” by Andre.Infante

2025-10-2915:43

“Workshop on Post-AGI Economics, Culture, and Governance” by Raymond Douglas, Jan_Kulveit, scasper, David Duvenaud

2025-10-2903:59

“Introducing the Epoch Capabilities Index (ECI)” by luke_emberson, YafahEdelman, Jsevillamol

2025-10-2901:56

“Mottes and Baileys in AI discourse” by Raemon

2025-10-2915:56

“LLM robots can’t pass butter (and they are having an existential crisis about it)” by Lukas Petersson

2025-10-2909:04

“The Memetics of AI Successionism” by Jan_Kulveit

2025-10-2821:28

“All the lab’s AI safety Plans: 2025 edition” by Algon

2025-10-2831:57

“life lessons from trading” by thiccythot

2025-10-2808:52

“Stability of natural latents in information theoretic terms” by Aram Ebtekar

2025-10-2705:57

00:00

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

#box-pro-ellipsis-176190172135743{-webkit-line-clamp:2;}“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda