“Training Qwen-1.5B with a CoT legibility penalty” by Fabien Roger
Description
Audio note: this article contains 102 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
I tried training Qwen2.5-1.5B with RL on math to both get correct answers and have a CoT that doesn’t look like human-understandable math reasoning. RL sometimes succeeds at hacking my monitor, and when I strengthen my monitor, it fails at finding CoT that are both illegible and helpful, even after training for roughly 4000 steps (~1B generated tokens). Exploring into obfuscated reasoning is hard!
These results were also released in the Appendix of Training fails to elicit subtle reasoning in current language models.
Motivation
Chain-of-Thoughts (CoTs) can help reason for many more serial steps than there are layers in a Transformer. But one worry is that LLMs might hide their real reasoning in a plausible benign CoT. Previous [...]
---
Outline:
(00:58 ) Motivation
(01:45 ) Methods
(03:33 ) Results
(07:02 ) Limitations
(07:52 ) Appendix
(07:55 ) Hackable illegibility rater prompt
(09:03 ) Stronger illegibility rater prompt
(09:25 ) System prompt used for the policy
(09:40 ) System prompt used for the no-CoT policy
(09:48 ) Example generations
---
First published:
October 9th, 2025
Source:
https://www.lesswrong.com/posts/WSKNmRxPnYdQnoNvt/training-qwen-1-5b-with-a-cot-legibility-penalty
---
Narrated by TYPE III AUDIO.
---
Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.