DiscoverDoom DebatesWhy AI Alignment Is 0% Solved — Ex-MIRI Researcher Tsvi Benson-Tilsen
Why AI Alignment Is 0% Solved — Ex-MIRI Researcher Tsvi Benson-Tilsen

Why AI Alignment Is 0% Solved — Ex-MIRI Researcher Tsvi Benson-Tilsen

Update: 2025-10-31
Share

Description

Tsvi Benson-Tilsen spent seven years tackling the alignment problem at the Machine Intelligence Research Institute (MIRI). Now he delivers a sobering verdict: humanity has made “basically 0%” progress towards solving it.

Tsvi unpacks foundational MIRI research insights like timeless decision theory and corrigibility, which expose just how little humanity actually knows about controlling superintelligence.

These theoretical alignment concepts help us peer into the future, revealing the non-obvious, structural laws of “intellidynamics” that will ultimately determine our fate.

Time to learn some of MIRI’s greatest hits.

P.S. I also have a separate interview with Tsvi about his research into human augmentation: Watch here!

Timestamps

0:00 — Episode Highlights

0:49 — Humanity Has Made 0% Progress on AI Alignment

1:56 — MIRI’s Greatest Hits: Reflective Probability Theory, Logical Uncertainty, Reflective Stability

6:56 — Why Superintelligence is So Hard to Align: Self-Modification

8:54 — AI Will Become a Utility Maximizer (Reflective Stability)

12:26 — The Effect of an “Ontological Crisis” on AI

14:41 — Why Modern AI Will Not Be ‘Aligned By Default’

18:49 — Debate: Have LLMs Solved the “Ontological Crisis” Problem?

25:56 — MIRI Alignment Greatest Hit: Timeless Decision Theory

35:17 — MIRI Alignment Greatest Hit: Corrigibility

37:53 — No Known Solution for Corrigible and Reflectively Stable Superintelligence

39:58 — Recap

Show Notes

Stay tuned for part 3 of my interview with Tsvi where we debate AGI timelines!

Learn more about Tsvi’s organization, the Berkeley Genomics Project: https://berkeleygenomics.org

Watch part 1 of my interview with Tsvi:

Transcript

Episode Highlights

Tsvi Benson-Tilsen 00:00:00 If humans really f*cked up, when we try to reach into the AI and correct it, the AI does not want humans to modify the core aspects of what it values.

Liron Shapira 00:00:09 This concept is very deep, very important. It’s almost MIRI in a nutshell. I feel like MIRI’s whole research program is noticing: hey, when we run the AI, we’re probably going to get a bunch of generations of thrashing. But that’s probably only after we’re all dead and things didn’t happen the way we wanted. I feel like that is what MIRI is trying to tell the world. Meanwhile, the world is like, “la la la, LLMs, reinforcement learning—it’s all good, it’s working great. Alignment by default.”

Tsvi 00:00:34 Yeah, that’s certainly how I view it.

Humanity Has Made 0% Progress on AI Alignment

Liron Shapira 00:00:46 All right. I want to move on to talk about your MIRI research. I have a lot of respect for MIRI. A lot of viewers of the show appreciate MIRI’s contributions. I think it has made real major contributions in my opinion—most are on the side of showing how hard the alignment problem is, which is a great contribution. I think it worked to show that. My question for you is: having been at MIRI for seven and a half years, how are we doing on theories of AI alignment?

Tsvi Benson-Tilsen 00:01:10 I can’t speak with 100% authority because I’m not necessarily up to date on everything and there are lots of researchers and lots of controversy. But from my perspective, we are basically at 0%—at zero percent done figuring it out. Which is somewhat grim. Basically, there’s a bunch of fundamental challenges, and we don’t know how to grapple with these challenges. Furthermore, it’s sort of sociologically difficult to even put our attention towards grappling with those challenges, because they’re weirder problems—more pre-paradigmatic. It’s harder to coordinate multiple people to work on the same thing productively.

It’s also harder to get funding for super blue-sky research. And the problems themselves are just slippery.

MIRI Alignment Greatest Hits: Reflective Probability Theory, Logical Uncertainty, Reflective Stability

Liron 00:01:55 Okay, well, you were there for seven years, so how did you try to get us past zero?

Tsvi 00:02:00 Well, I would sort of vaguely (or coarsely) break up my time working at MIRI into two chunks. The first chunk is research programs that were pre-existing when I started: reflective probability theory and reflective decision theory. Basically, we were trying to understand the mathematical foundations of a mind that is reflecting on itself—thinking about itself and potentially modifying itself, changing itself. We wanted to think about a mind doing that, and then try to get some sort of fulcrum for understanding anything that’s stable about this mind.

Something we could say about what this mind is doing and how it makes decisions—like how it decides how to affect the world—and have our description of the mind be stable even as the mind is changing in potentially radical ways.

Liron 00:02:46 Great. Okay. Let me try to translate some of that for the viewers here. So, MIRI has been the premier organization studying intelligence dynamics, and Eliezer Yudkowsky—especially—people on social media like to dunk on him and say he has no qualifications, he’s not even an AI expert. In my opinion, he’s actually good at AI, but yeah, sure. He’s not a top world expert at AI, sure. But I believe that Eliezer Yudkowsky is in fact a top world expert in the subject of intelligence dynamics. Is this reasonable so far, or do you want to disagree?

Tsvi 00:03:15 I think that’s fair so far.

Liron 00:03:16 Okay. And I think his research organization, MIRI, has done the only sustained program to even study intelligence dynamics—to ask the question, “Hey, let’s say there are arbitrarily smart agents. What should we expect them to do? What kind of principles do they operate on, just by virtue of being really intelligent?” Fair so far.

Now, you mentioned a couple things. You mentioned reflective probability. From what I recall, it’s the idea that—well, we know probability theory is very useful and we know utility maximization is useful. But it gets tricky because sometimes you have beliefs that are provably true or false, like beliefs about math, right? For example, beliefs about the millionth digit of π. I mean, how can you even put a probability on the millionth digit of π?

The probability of any particular digit is either 100% or 0%, ‘cause there’s only one definite digit. You could even prove it in principle. And yet, in real life you don’t know the millionth digit of π yet (you haven’t done the calculation), and so you could actually put a probability on it—and then you kind of get into a mess, ‘cause things that aren’t supposed to have probabilities can still have probabilities. How is that?

Tsvi 00:04:16 That seems right.

Liron 00:04:18 I think what I described might be—oh, I forgot what it’s called—like “deductive probability” or something. Like, how do you...

Tsvi 00:04:22 (interjecting) Uncertainty.

Liron 00:04:23 Logical uncertainty. So is reflective probability something else?

Tsvi 00:04:26 Yeah. If we want to get technical: logical uncertainty is this. Probability theory usually deals with some fact that I’m fundamentally unsure about (like I’m going to roll some dice; I don’t know what number will come up, but I still want to think about what’s likely or unlikely to happen). Usually probability theory assumes there’s some fundamental randomness or unknown in the universe.

But then there’s this further question: you might actually already know enough to determine the answer to your question, at least in principle. For example, what’s the billionth digit of π—is the billionth digit even or odd? Well, I know a definition of π that determines the answer. Given the definition of π, you can compute out the digits, and eventually you’d get to the billionth one and you’d know if it’s even. But sitting here as a human, who doesn’t have a Python interpreter in his head, I can’t actually figure it out right now. I’m uncertain about this thing, even though I already know enough (in principle, logically speaking) to determine the answer. So that’s logical uncertainty—I’m uncertain about a logical fact.

Tsvi 00:05:35 Reflective probability is sort of a sharpening or a subset of that. Let’s say I’m asking, “What am I going to do tomorrow? Is my reasoning system flawed in such a way that I should make a correction to my own reasoning system?” If you want to think about that, you’re asking about a very, very complex object. I’m asking about myself (or my future self). And because I’m asking about such a complex object, I cannot compute exactly what the answer will be. I can’t just sit here and imagine every single future pathway I might ta

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Why AI Alignment Is 0% Solved — Ex-MIRI Researcher Tsvi Benson-Tilsen

Why AI Alignment Is 0% Solved — Ex-MIRI Researcher Tsvi Benson-Tilsen

Liron Shapira