DiscoverLessWrong (30+ Karma)“The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs” by Lovre
“The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs” by Lovre

“The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs” by Lovre

Update: 2025-10-19
Share

Description

Audio note: this article contains 225 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Introduction

There are _208_ ways to output the text ▁LessWrong[1] with the Llama 3 tokenizer, but even if you were to work with Llama 3 for thousands of hours, you would be unlikely to see any but one. An example that generalizes quite widely: if you prompt Llama 3.2 3B Base with the text You're interested in rationality and AI? You should visit, there is a _approx 22.7003%_ chance that it outputs the text ▁LessWrong, of which

  •  _approx 22.7001%_ is that it outputs exactly the tokens ▁Less and Wrong,
  • _approx 0.00024%_ that it outputs exactly the tokens ▁Less, W, and rong,
  • and _approx 0.000017%_ chance that it outputs any of the other _206_ tokenizations which result in the text ▁LessWrong.
All _208_[2] possible tokenizations of  LessWrong, the [...]

---

Outline:

(00:26 ) Introduction

(05:38 ) A motivating example

(07:23 ) Background information on tokenizers

(08:40 ) Related works

(12:05 ) Basic computations with Llama 3 tokenizer

(14:31 ) Constructing a distribution over all tokenizations of a string

(15:05 ) To which extent do alternative tokenizations break Llama?

(15:31 ) ARC-Easy

(20:28 ) A little bit of introspection

(22:55 ) Learning multiple functions redux, finally

(25:44 ) Function maximizer

(29:42 ) An example

(30:48 ) Results

(31:23 ) What I talk about when I talk about both axes

(32:59 ) Encoding and decoding bits

(35:56 ) Decoding

(36:44 ) Encoding

(38:54 ) Could the usage of alternative tokenizations arise naturally?

(41:38 ) Has it already happened?

(43:27 ) Appendix: The psychological effects of tokenization

The original text contained 18 footnotes which were omitted from this narration.

---


First published:

October 17th, 2025



Source:

https://www.lesswrong.com/posts/g9DmSzHxJXBD9poJR/the-dark-arts-of-tokenization-or-how-i-learned-to-start


---


Narrated by TYPE III AUDIO.


---

Images from the article:

All <span>__T3A_INLINE_LATEX_PLACEHOLDER___208___T3A_INLINE_LATEX_END_PLACEHOLDER__</span> __T3A_FOOTNOTE_REMOVED__ possible tokenizations of  LessWrong, the number (#) of tokens in each tokenization, and the conditional probability of the token sequence given the prompt. Green highlight denotes the most likely continuation with that number of tokens, and red highlight the least likely one (except for 2 and 10, for which there exists a unique tokenization of that length).
The colored part is an illustration of the input tokens, the middle part is the model output, and the last part is just the computation written out.
Examples of tokenizations of an ARC-Easy question together with our prompt. The instructions are always tokenized in the default way; the tokenization of questions and answers varies.
Plots displaying the decrease in accuracy as the tokenization moves from the default one to the every-character-is-its-own-token one. Dashed vertical lines are the accuracies for that length class when every question is tokenized in the default tokenization.
Example of a prompt. The text A has 177 tokens, while text B has 190 tokens.
Average accuracy of the model's classification as the difference in tokenization lengths of the two examples increases. Blue line is a logistic GAM smoothed line representing the mean at that absolute difference, while bars show the average accuracy in the bins of lengths 1-20, 21-40, etc.
Colors denote the length of the tokens, pink = 3 digit token, blue = 2 digit token, and peach = 1 digit token. Each 1-3 digit string is a (distinct) token in Llama 3 tokenizer, and conversely very digit token has 1-3 digits.
(Image because the dark mode breaks some parts of the rendered version; LaTeX code here __T3A_LINK_IN_POST__.)
Plot of mean of emotion ratings for each emotion and for three levels of tokenization.
This appears to be code syntax highlighting showing text broken into colored segments with numerical annotations above.
A poetic dialogue displayed with color-coded text blocks and binary numbers.
Four panels showing encoding/decoding prompts for a language model conversation system.  The panels demonstrate train and evaluation formats, with special tokens, text fragments, and binary sequences highlighted in different colors. The layout shows parallel encoding and decoding processes.
Grid of 16 emotional state graphs comparing two different tokenization methods across time.  Each graph shows blue and red trend lines representing<hr style="margin-top: 24px; margin-bottom: 24px
Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

“The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs” by Lovre

“The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs” by Lovre