DiscoverLessWrong (30+ Karma)“Training Qwen-1.5B with a CoT legibility penalty” by Fabien Roger
“Training Qwen-1.5B with a CoT legibility penalty” by Fabien Roger

“Training Qwen-1.5B with a CoT legibility penalty” by Fabien Roger

Update: 2025-10-10
Share

Description

Audio note: this article contains 102 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

I tried training Qwen2.5-1.5B with RL on math to both get correct answers and have a CoT that doesn’t look like human-understandable math reasoning. RL sometimes succeeds at hacking my monitor, and when I strengthen my monitor, it fails at finding CoT that are both illegible and helpful, even after training for roughly 4000 steps (~1B generated tokens). Exploring into obfuscated reasoning is hard!

These results were also released in the Appendix of Training fails to elicit subtle reasoning in current language models.

Motivation

Chain-of-Thoughts (CoTs) can help reason for many more serial steps than there are layers in a Transformer. But one worry is that LLMs might hide their real reasoning in a plausible benign CoT. Previous [...]

---

Outline:

(00:58 ) Motivation

(01:45 ) Methods

(03:33 ) Results

(07:02 ) Limitations

(07:52 ) Appendix

(07:55 ) Hackable illegibility rater prompt

(09:03 ) Stronger illegibility rater prompt

(09:25 ) System prompt used for the policy

(09:40 ) System prompt used for the no-CoT policy

(09:48 ) Example generations

---


First published:

October 9th, 2025



Source:

https://www.lesswrong.com/posts/WSKNmRxPnYdQnoNvt/training-qwen-1-5b-with-a-cot-legibility-penalty


---


Narrated by TYPE III AUDIO.


---

Images from the article:

Two line graphs showing

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“Training Qwen-1.5B with a CoT legibility penalty” by Fabien Roger

“Training Qwen-1.5B with a CoT legibility penalty” by Fabien Roger