DiscoverLessWrong (30+ Karma)Why Not Just Train For Interpretability?
Why Not Just Train For Interpretability?

Why Not Just Train For Interpretability?

Update: 2025-11-22
Share

Description

Simplicio: Hey I’ve got an alignment research idea to run by you.

Me: … guess we’re doing this again.

Simplicio: Interpretability work on trained nets is hard, right? So instead of that, what if we pick an architecture and/or training objective to produce interpretable nets right from the get-go?

Me: If we had the textbook of the future on hand, then maybe. But in practice, you’re planning to use some particular architecture and/or objective which will not work.

Simplicio: That sounds like an empirical question! We can’t know whether it works until we try it. And I haven’t thought of any reason it would fail.

Me: Ok, let's get concrete here. What architecture and/or objective did you have in mind?

Simplicio: Decision trees! They’re highly interpretable, and my decision theory textbook says they’re fully general in principle. So let's just make a net tree-shaped, and train that! Or, if that's not quite general enough, we train a bunch of tree-shaped nets as “experts” and then mix them somehow.

Me: Turns out we’ve tried that one! It's called a random forest, it was all the rage back in the 2000's.

Simplicio: So we just go back to that?

Me: Alas [...]

---


First published:

November 21st, 2025



Source:

https://www.lesswrong.com/posts/2HbgHwdygH6yeHKKq/why-not-just-train-for-interpretability


---


Narrated by TYPE III AUDIO.

Comments 
loading
In Channel
Be Naughty

Be Naughty

2025-11-2207:22

loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Why Not Just Train For Interpretability?

Why Not Just Train For Interpretability?