Listen Top Shows Blog

“Base64Bench: How good are LLMs at base64, and why care about it?” by richbc

“Base64Bench: How good are LLMs at base64, and why care about it?” by richbc

Update: 2025-10-06

Share

Description

This was a quick, short side-project produced during the MATS Research 8.1 extension. It's related to my group's main thread of work on black-box scheming monitoring through the connections to monitoring I explore below, but was time-boxed and pursued independently because I thought it was interesting!

Executive Summary

Figure 1. Accuracy vs. similarity threshold (0.95+) across 1700 pairs of encoding/decoding examples across a variety of datatypes and lengths. The accuracy is the proportion of the 3400 examples each model translated successfully (directly, with no reasoning or tools). Success for each task is defined by the normalised Levenshtein similarity of the answer/target pair hitting a given threshold, with a scoring requirement that model-encoded strings are decodable. Legend ordered by accuracy@1.0.

Introducing Base64Bench: a simple new benchmark for evaluating models on their ability to encode and decode base64.
- Base64 encoding and decoding are reasonably complex computational tasks to do perfectly [...]

---

Outline:

(00:31 ) Executive Summary

(03:07 ) An accidental (and surprising) discovery

(08:03 ) Have LLMs actually learned the algorithm?

(09:39 ) Introducing

(13:11 ) Accuracy vs. similarity threshold

(16:02 ) Encoding vs. decoding by model

(17:00 ) Task-level breakdown

(19:37 ) Why should we care?

(21:26 ) Monitoring implications

(23:51 ) Conclusion

(25:23 ) Appendix

(25:26 ) Zoomed-in threshold sweeps

The original text contained 8 footnotes which were omitted from this narration.

---

First published:

October 5th, 2025

Source:

https://www.lesswrong.com/posts/5F6ncBfjh2Bxnm6CJ/base64bench-how-good-are-llms-at-base64-and-why-care-about

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

In Channel

“Base64Bench: How good are LLMs at base64, and why care about it?” by richbc

“Base64Bench: How good are LLMs at base64, and why care about it?” by richbc

2025-10-0626:17

“Maybe social media algorithms don’t suck” by Algon

“Maybe social media algorithms don’t suck” by Algon

2025-10-0606:28

“Sora and The Big Bright Screen Slop Machine” by Zvi

“Sora and The Big Bright Screen Slop Machine” by Zvi

2025-10-0501:07:11

“The Counterfactual Quiet AGI Timeline” by Davidmanheim

“The Counterfactual Quiet AGI Timeline” by Davidmanheim

2025-10-0518:42

“Making Your Pain Worse can Get You What You Want” by Logan Riggs

“Making Your Pain Worse can Get You What You Want” by Logan Riggs

2025-10-0506:11

“How the NanoGPT Speedrun WR dropped by 20% in 3 months” by larry-dial

“How the NanoGPT Speedrun WR dropped by 20% in 3 months” by larry-dial

2025-10-0517:18

“Where does Sonnet 4.5’s desire to ‘not get too comfortable’ come from?” by Kaj_Sotala

“Where does Sonnet 4.5’s desire to ‘not get too comfortable’ come from?” by Kaj_Sotala

2025-10-0408:51

“Recent AI Experiences” by abramdemski

“Recent AI Experiences” by abramdemski

2025-10-0408:55

“Do One New Thing A Day To Solve Your Problems” by Algon

“Do One New Thing A Day To Solve Your Problems” by Algon

2025-10-0303:22

[Linkpost] “We automatically change people’s minds on the AI threat” by Mikhail Samin

[Linkpost] “We automatically change people’s minds on the AI threat” by Mikhail Samin

2025-10-0301:59

“IABIED and Memetic Engineering” by Error

“IABIED and Memetic Engineering” by Error

2025-10-0308:02

“Antisocial media: AI’s killer app?” by David Scott Krueger (formerly: capybaralet)

“Antisocial media: AI’s killer app?” by David Scott Krueger (formerly: capybaralet)

2025-10-0310:04

“Omelas Is Perfectly Misread” by Tobias H

“Omelas Is Perfectly Misread” by Tobias H

2025-10-0308:57

“How to Feel More Alive” by Logan Riggs

“How to Feel More Alive” by Logan Riggs

2025-10-0307:57

[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks

[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks

2025-10-0305:31

“Checking in on AI-2027” by Baybar

“Checking in on AI-2027” by Baybar

2025-10-0207:18

[Linkpost] “No, That’s Not What the Flight Costs” by Max Niederman

[Linkpost] “No, That’s Not What the Flight Costs” by Max Niederman

2025-10-0202:56

“Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most ‘classic humans’ in a few decades.” by Raemon

“Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most ‘classic humans’ in a few decades.” by Raemon

2025-10-0222:00

“</rant> </uncharitable> </psychologizing>” by Raemon

“</rant> </uncharitable> </psychologizing>” by Raemon

2025-10-0203:14

“AI Safety Research Futarchy: Using Prediction Markets to Choose Research Projects for MARS” by JasonBrown

“AI Safety Research Futarchy: Using Prediction Markets to Choose Research Projects for MARS” by JasonBrown

2025-10-0209:12

00:00

00:00

x

“Base64Bench: How good are LLMs at base64, and why care about it?” by richbc

“Base64Bench: How good are LLMs at base64, and why care about it?” by richbc