DiscoverLessWrong (30+ Karma)“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda
“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

Update: 2025-10-30
Share

Description

📄Paper, 🖥️Code, 🤖Evaluation Aware Model Organism

TL, DR:;

  • We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested.
    • The evaluation cue is intended to simulate hard-to-remove signs of an eval being artificial, e.g. being LLM-written.
    • Our training process was designed to mimic how evaluation awareness could arise in practice: We first fine-tune the model on synthetic documents with factual descriptions of the model's context-dependent behavior and ability to recognize the evaluation cue. Then we train the model with expert iteration on evaluation data only to always use type hints.
    • The resulting model uses types hints more during evaluation than in deployment.
    • Prior work (Anthropic 2025; Abelnabi & Salem 2025) has shown that steering can stop the model talking about being evaluated, but lacks a ground truth of knowing how models behave if they are deployed. Having a model organism resolves this.
  • We show that activation steering can elicit the model's deployment behavior even when simple prompting fails.
    • When the evaluation cue is present, the model writes Python type hints even if we add deployment hints (e.g. [...]

---

Outline:

(02:45 ) Extended Summary

(09:18 ) Commentary about the paper from Tim:

(09:52 ) Thoughts on how to steer models to believe they are deployed

(10:06 ) Contrastive prompts are good actually

(12:10 ) Steer on multiple layers at once

(13:09 ) Consider extracting the steering vector from an earlier checkpoint as well

(14:54 ) You should suppress evaluation awareness and run honeypot evaluations (and other alignment evaluations as well)

(15:24 ) Creating behavioral demonstrations of misalignment based on the steered model.

(16:46 ) Other things to do with a model that's been steered to believe it's deployed

(17:31 ) How others can build on this work

(18:07 ) Self-Review of the Paper's Strengths and Weaknesses:

(18:19 ) Main strengths of the paper:

(18:28 ) Qualitatively, it was pretty easy to elicit deployment behavior with activation steering.

(19:28 ) Steering can trick the model better than simple deployment prompts in our adversarial setting.

(20:47 ) Most sketchy parts of the paper:

(20:58 ) Llama Nemotron 49B is gullible and easy to trick.

(23:12 ) We rely on an extra situational awareness hint to create large gaps between deployment and evaluation behavior.

(24:28 ) Our final model organism struggles to tell deployment from evaluation using subtle clues.

(25:03 ) We did not explore models who always think they are being evaluated unless a certain deployment cue is present.

(26:13 ) Bottom Line: Will activation steering work for near-human level misaligned AIs?

(27:26 ) Other miscellaneous takes on evaluation awareness:

(28:44 ) Other miscellaneous takes on conducting this type of research

The original text contained 3 footnotes which were omitted from this narration.

---


First published:

October 30th, 2025



Source:

https://www.lesswrong.com/posts/peKrvZ6t9PSCzoQDa/steering-evaluation-aware-models-to-act-like-they-are


---


Narrated by TYPE III AUDIO.


---

Images from the article:

Diagram showing four steps to obtain a steering vector from contrastive prompts.
A scatter plot showing model evaluation frequency versus response type hints, with confidence intervals.
Scatter plot showing relationship between model evaluation frequency and response accuracy with type hints.
Diagram showing training process for evaluation-aware AI model behavior using cues.
Bar graph titled
Flowchart showing two-phase training process for Llama model evaluation awareness.
Graph showing responses with type hints across different system and user prompt formats, titled
Line graph showing

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

“Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda