Listen Top Shows Blog

“The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs” by Lovre

“The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs” by Lovre

Update: 2025-10-19

Share

Description

Audio note: this article contains 225 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Introduction

There are _208_ ways to output the text ▁LessWrong[1] with the Llama 3 tokenizer, but even if you were to work with Llama 3 for thousands of hours, you would be unlikely to see any but one. An example that generalizes quite widely: if you prompt Llama 3.2 3B Base with the text You're interested in rationality and AI? You should visit, there is a _approx 22.7003%_ chance that it outputs the text ▁LessWrong, of which

_approx 22.7001%_ is that it outputs exactly the tokens ▁Less and Wrong,
_approx 0.00024%_ that it outputs exactly the tokens ▁Less, W, and rong,
and _approx 0.000017%_ chance that it outputs any of the other _206_ tokenizations which result in the text ▁LessWrong.

All _208_[2] possible tokenizations of LessWrong, the [...]

---

Outline:

(00:26 ) Introduction

(05:38 ) A motivating example

(07:23 ) Background information on tokenizers

(08:40 ) Related works

(12:05 ) Basic computations with Llama 3 tokenizer

(14:31 ) Constructing a distribution over all tokenizations of a string

(15:05 ) To which extent do alternative tokenizations break Llama?

(15:31 ) ARC-Easy

(20:28 ) A little bit of introspection

(22:55 ) Learning multiple functions redux, finally

(25:44 ) Function maximizer

(29:42 ) An example

(30:48 ) Results

(31:23 ) What I talk about when I talk about both axes

(32:59 ) Encoding and decoding bits

(35:56 ) Decoding

(36:44 ) Encoding

(38:54 ) Could the usage of alternative tokenizations arise naturally?

(41:38 ) Has it already happened?

(43:27 ) Appendix: The psychological effects of tokenization

The original text contained 18 footnotes which were omitted from this narration.

---

First published:

October 17th, 2025

Source:

https://www.lesswrong.com/posts/g9DmSzHxJXBD9poJR/the-dark-arts-of-tokenization-or-how-i-learned-to-start

---

Narrated by TYPE III AUDIO.

---

Images from the article:

<hr style="margin-top: 24px; margin-bottom: 24px

Comments

In Channel

“Frontier LLM Race/Sex Exchange Rates” by Arjun Panickssery

“Frontier LLM Race/Sex Exchange Rates” by Arjun Panickssery

2025-10-2006:49

“Humanity Learned Almost Nothing From COVID-19” by niplav

“Humanity Learned Almost Nothing From COVID-19” by niplav

2025-10-1908:46

“AI #138 Part 2: Watch Out For Documents” by Zvi

“AI #138 Part 2: Watch Out For Documents” by Zvi

2025-10-1901:25:10

“The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs” by Lovre

“The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs” by Lovre

2025-10-1947:16

“Give Me Your Data: The Rationalist Mind Meld” by Taylor G. Lunt

“Give Me Your Data: The Rationalist Mind Meld” by Taylor G. Lunt

2025-10-1907:19

“Meditation is dangerous” by Algon

“Meditation is dangerous” by Algon

2025-10-1807:27

“I’m an EA who benefitted from rationality” by juliawise

“I’m an EA who benefitted from rationality” by juliawise

2025-10-1704:36

“Finding Features in Neural Networks with the Empirical NTK” by jylin04

“Finding Features in Neural Networks with the Empirical NTK” by jylin04

2025-10-1711:07

“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt

“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt

2025-10-1720:08

“Rogue internal deployments via external APIs” by Fabien Roger, Buck

“Rogue internal deployments via external APIs” by Fabien Roger, Buck

2025-10-1611:39

“Cheap Labour Everywhere” by Morpheus

“Cheap Labour Everywhere” by Morpheus

2025-10-1603:39

“[draft amnesty] A New Global Risk: Large Comet’s Impact on Sun Could Cause Fires on Earth” by avturchin

“[draft amnesty] A New Global Risk: Large Comet’s Impact on Sun Could Cause Fires on Earth” by avturchin

2025-10-1603:40

“How I Became a 5x Engineer with Claude Code” by Gordon Seidoh Worley

“How I Became a 5x Engineer with Claude Code” by Gordon Seidoh Worley

2025-10-1511:56

“That Mad Olympiad” by Tomás B.

“That Mad Olympiad” by Tomás B.

2025-10-1526:42

“It will cost you nothing to ‘bribe’ a Utilitarian” by Gabriel Alfour

“It will cost you nothing to ‘bribe’ a Utilitarian” by Gabriel Alfour

2025-10-1509:15

“The Biochemical Beauty of Retatrutide: How GLP-1s Actually Work” by Elizabeth

“The Biochemical Beauty of Retatrutide: How GLP-1s Actually Work” by Elizabeth

2025-10-1514:45

“Recontextualization Mitigates Specification Gaming Without Modifying the Specification” by vgillioz, TurnTrout, cloud, ariana_azarbal

“Recontextualization Mitigates Specification Gaming Without Modifying the Specification” by vgillioz, TurnTrout, cloud, ariana_azarbal

2025-10-1421:17

“The ‘Length’ of ‘Horizons’” by Adam Scholl

“The ‘Length’ of ‘Horizons’” by Adam Scholl

2025-10-1414:16

“Current Language Models Struggle to Reason in Ciphered Language” by Fabien Roger

“Current Language Models Struggle to Reason in Ciphered Language” by Fabien Roger

2025-10-1413:46

“The Mom Test for AI Extinction Scenarios” by Taylor G. Lunt

“The Mom Test for AI Extinction Scenarios” by Taylor G. Lunt

2025-10-1409:29

00:00

00:00

x

“The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs” by Lovre

“The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs” by Lovre