“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda
Description
📄Paper, 🖥️Code, 🤖Evaluation Aware Model Organism
TL, DR:;
-  We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested.- The evaluation cue is intended to simulate hard-to-remove signs of an eval being artificial, e.g. being LLM-written.
- Our training process was designed to mimic how evaluation awareness could arise in practice: We first fine-tune the model on synthetic documents with factual descriptions of the model's context-dependent behavior and ability to recognize the evaluation cue. Then we train the model with expert iteration on evaluation data only to always use type hints.
- The resulting model uses types hints more during evaluation than in deployment.
- Prior work (Anthropic 2025; Abelnabi & Salem 2025) has shown that steering can stop the model talking about being evaluated, but lacks a ground truth of knowing how models behave if they are deployed. Having a model organism resolves this.
 
-  We show that activation steering can elicit the model's deployment behavior even when simple prompting fails.- When the evaluation cue is present, the model writes Python type hints even if we add deployment hints (e.g. [...]
 
---
Outline:
(02:45 ) Extended Summary
(09:18 ) Commentary about the paper from Tim:
(09:52 ) Thoughts on how to steer models to believe they are deployed
(10:06 ) Contrastive prompts are good actually
(12:10 ) Steer on multiple layers at once
(13:09 ) Consider extracting the steering vector from an earlier checkpoint as well
(14:54 ) You should suppress evaluation awareness and run honeypot evaluations (and other alignment evaluations as well)
(15:24 ) Creating behavioral demonstrations of misalignment based on the steered model.
(16:46 ) Other things to do with a model that's been steered to believe it's deployed
(17:31 ) How others can build on this work
(18:07 ) Self-Review of the Paper's Strengths and Weaknesses:
(18:19 ) Main strengths of the paper:
(18:28 ) Qualitatively, it was pretty easy to elicit deployment behavior with activation steering.
(19:28 ) Steering can trick the model better than simple deployment prompts in our adversarial setting.
(20:47 ) Most sketchy parts of the paper:
(20:58 ) Llama Nemotron 49B is gullible and easy to trick.
(23:12 ) We rely on an extra situational awareness hint to create large gaps between deployment and evaluation behavior.
(24:28 ) Our final model organism struggles to tell deployment from evaluation using subtle clues.
(25:03 ) We did not explore models who always think they are being evaluated unless a certain deployment cue is present.
(26:13 ) Bottom Line: Will activation steering work for near-human level misaligned AIs?
(27:26 ) Other miscellaneous takes on evaluation awareness:
(28:44 ) Other miscellaneous takes on conducting this type of research
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
          October 30th, 2025 
---
Narrated by TYPE III AUDIO.
---
Images from the article:








Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.







