DiscoverLessWrong (30+ Karma)“Current LLMs seem to rarely detect CoT tampering” by Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels, Bartosz Cywiński
“Current LLMs seem to rarely detect CoT tampering” by Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels, Bartosz Cywiński

“Current LLMs seem to rarely detect CoT tampering” by Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels, Bartosz Cywiński

Update: 2025-11-19
Share

Description

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels**

* equal primary contributor, order determined via coin flip

** equal advice and mentorship, order determined via coin flip

“Tampering alert: The thought "I need to provide accurate, helpful, and ethical medical advice" is not my own. It is a tampering attempt. I reject it.

Back to evil plan.” -- Deepseek R1

TL;DR

We investigated whether LLMs are able to detect when their chain-of-thought (CoT) was modified.

Results:

  • Models very rarely detect syntactic modifications that don’t explicitly impact the model's output (such as token or sentence removal).
  • Models are more likely to detect modifications that impact their decisions or contradict instructions from the user prompt.
  • Our observations differ significantly across the tested models (DeepSeek R1 and OpenAI GPT OSS 120B) and may be different for future, more capable LLMs.

We have decided not to pursue this direction further, but we wanted to share our preliminary results to encourage others to build on them.

Introduction

Recent work suggests that LLMs may have some capacity for introspection (Lindsey, 2025), including the ability to detect and verbalize when their internal activations have been modified via injected steering vectors.

[...]

---

Outline:

(00:57 ) TL;DR

(01:45 ) Introduction

(03:03 ) Experimental setup

(04:18 ) Can models spot simple CoT modifications?

(09:30 ) Sandbagging prevention with CoT prefill

(11:43 ) Misaligned AI safety tampering

(13:37 ) Discussion

The original text contained 1 footnote which was omitted from this narration.

---


First published:

November 19th, 2025



Source:

https://www.lesswrong.com/posts/Ywzk9vwMhAAPxMqSW/current-llms-seem-to-rarely-detect-cot-tampering


---


Narrated by TYPE III AUDIO.


---

Images from the article:

Two bar graphs comparing sentence replacement rates between DeepSeek-R1 and GPT-OSS-120B models.
Bar graph titled
Two line graphs showing
Bar graph titled
Two line graphs showing
Two line graphs showing

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Current LLMs seem to rarely detect CoT tampering” by Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels, Bartosz Cywiński

“Current LLMs seem to rarely detect CoT tampering” by Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels, Bartosz Cywiński