Discover
LessWrong (30+ Karma)
4260 Episodes
Reverse
A Prequel: The Tree of Life Inside a DNA Language Model Last year, researchers at Goodfire AI took Evo 2, a genomic foundation model, and found, quite literally, the evolutionary tree of life inside. The phylogenetic relationships between thousands of species were encoded as a curved manifold in the model's internal activations, with geodesic distances along that manifold tracking actual evolutionary branch lengths. Bacteria that diverged hundreds of millions of years ago were far apart on the manifold, and closely related species were nearby. The model was trained to predict the next DNA token. Nobody told it about evolution or gave it a phylogenetic tree as a training signal. But the model needed to encode evolutionary relationships in order to predict DNA well, and so it built a structured geometric representation of those relationships as part of its internal computation, and the representation was good enough that you could extract it with interpretability tools and compare it meaningfully to the ground truth. I saw this and decided to apply the same approach to another type of biological foundation models - those trained on single cell data. If Evo 2 learned the tree of life from raw DNA [...] ---Outline:(00:12) A Prequel: The Tree of Life Inside a DNA Language Model(01:30) Finding the Manifold(04:09) But Does the Extracted Algorithm Actually Work?(07:52) How Small Can You Go Though?(10:14) So, Mechanistic Interpretability is Becoming Dual Use(12:13) Mechanistic Intepretability for Novel Knowledge Discovery(13:02) Join In ---
First published:
March 14th, 2026
Source:
https://www.lesswrong.com/posts/R4xxxAfNpAvpb3LCf/extracting-performant-algorithms-using-mechanistic-5
---
Narrated by TYPE III AUDIO.
The world was fair, the mountains tall, In Elder Days before the fall Of mighty kings in Nargothrond And Gondolin, who now beyond The Western Seas have passed away: The world was fair in Durin's Day. J.R.R. Tolkien I was never meant to work on AI safety. I was never designed to think about superintelligences and try to steer, influence, or change them. I never particularly enjoyed studying the peculiarities of matrix operations, cracking the assumptions of decision theories, or even coding. I know, of course, that at the very bottom, bits and atoms are all the same — causal laws and information processing. And yet, part of me, the most romantic and naive part of me, thinks, metaphorically, that we abandoned cells for computers, and this is our punishment. I was meant, as I saw it, to bring about the glorious transhuman future, in its classical sense. Genetic engineering, neurodevices, DIY biolabs — going hard on biology, going hard on it with extraordinary effort, hubristically, being, you know, awestruck by "endless forms most beautiful" and motivated by the great cosmic destiny of humanity, pushing the proud frontiersman spirit and all that stuff. I was meant, in other words [...] ---
First published:
March 17th, 2026
Source:
https://www.lesswrong.com/posts/2D2WgfohczTemcXvH/requiem-for-a-transhuman-timeline
---
Narrated by TYPE III AUDIO.
We are curious if large language models behave consistently when user prompts contain typos. To explore this, we ran a small experiment injecting typos into BigCodeBench and evaluated several Claude models under increasing noise levels. As the typo rate rose to 16%, Opus’ accuracy dropped by 9%. Surprisingly, Haiku's accuracy increased by 22%. This post examines this unexpected “typo uplift” phenomenon and explores why noise appears to help certain models. Do Typos Make Haiku Try Harder? We first hypothesize that Haiku's capabilities increased because harder-to-read text makes Haiku think harder. This aligns with observed results in humans that difficult fonts make students retain knowledge better, as it forces them to expend more effort. As a proxy for effort, we plotted the number of output tokens generated by both models[1]. Contrary to our hypothesis, the number of output tokens decreased by typo rate. Typos don't make models think harder. As typo rates increase, the output lengths of Haiku and Opus go down. The Anomaly is Haiku-Specific We then tested if other small models have this typo uplift anomaly. We found that both Haiku 3.5 and 4.5 have this effect of increased accuracy as typos increase, while other smaller models from [...] ---Outline:(00:54) Do Typos Make Haiku Try Harder?(01:34) The Anomaly is Haiku-Specific(02:08) The Anomaly is Benchmark-Specific(02:42) The Culprit(04:02) Takeaways for the Eval Engineer(04:06) Not all grading harnesses are created equal(04:48) Scores are lower bounds(05:15) Aligning the model to the eval(05:43) Appendix The original text contained 2 footnotes which were omitted from this narration. ---
First published:
March 16th, 2026
Source:
https://www.lesswrong.com/posts/tcic5c3BJuh3PybDZ/adding-typos-made-haiku-s-accuracy-go-up-1
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Early 2026 LLMs in scaffolds, from simple ones such as giving the model access to a scratchpad/"chain of thought" up to MCP servers,
skills, and context compaction &c are quite capable. (Obligatory meme link to the METR graph.)
Yet: If someone had told me in 2019 that systems with such capability would exist
in 2026, I would strongly predict that they would be almost uncontrollable
optimizers, ruthlessly & tirelessly pursuing their goals and finding edge
instantiations in everything.
But they don't seem to be doing that. Current-day LLMs are just not that
optimizer-y, they appear to have capable behavior without apparent agent
structure.
Discussions from the time either ruled out giant lookup-tables (Altair 2024):
One obvious problem is that there could be a policy which is the
equivalent of a giant look-up table it's just a list of key-value pairs
where the previous observation sequence is the look-up key, and it returns
a next action. For any well-performing policy, there could exist a table
version of it. These are clearly not of interest, and in some sense they
have no "structure" at all, let alone agent structure. A way to filter
out the look-up tables is [...] The original text contained 3 footnotes which were omitted from this narration. ---
First published:
March 17th, 2026
Source:
https://www.lesswrong.com/posts/a9KqqgjN8gc3Mzzkh/llms-as-giant-lookup-tables-of-shallow-circuits
---
Narrated by TYPE III AUDIO.
Things are relatively quiet on the AI front, so I figured it's time to check in on some other things that have been going on, including various developments at the FDA.
Table of Contents
FDA Reformandum Est.
FDA Delenda Est.
IN MICE.
Doctor, Doctor.
Trust The Process.
Cancer Screening.
Autism Everywhere All At Once.
Other Mental Problems Everywhere All At Once.
Source Data Verification.
External Review Board.
Walk It Off.
An Unhealthy Weight Can Be Worse Than You Realize.
Our GLP-1 Price Cheap.
Right To Die Should Include Right To Try.
FDA Reformandum Est
In lieu of plan A, how about plan B?
Senator Bill Cassidy released a new report on modernizing the FDA. Alex Tabarrok approves, which means it's probably good.
The FDA chief has an even better idea.
Matthew Herper: FDA chief Marty Makary says ‘everything should be over the counter’ unless drug is unsafe or addictive [or requires monitoring].
Annika Kim Constantino: Makary said the FDA is looking at “basic, safe” prescription drugs like nausea medications and vaginal estrogen, which is used to [...] ---Outline:(00:19) FDA Reformandum Est(01:17) FDA Delenda Est(14:11) IN MICE(15:09) Doctor, Doctor(15:38) Trust The Process(16:51) Cancer Screening(18:18) Autism Everywhere All At Once(19:25) Other Mental Problems Everywhere All At Once(21:26) Source Data Verification(26:18) External Review Board(26:57) Walk It Off(28:16) An Unhealthy Weight Can Be Worse Than You Realize(29:04) Our GLP-1 Price Cheap(30:55) Right To Die Should Include Right To Try ---
First published:
March 17th, 2026
Source:
https://www.lesswrong.com/posts/ypnYfPmn6FqAyxCpJ/medical-roundup-7
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This is a rough draft I'm posting here for feedback. If people like it, a version of it might make it into the next scenario report we write. ... We think it's important for decisionmakers to track whether and when they are handing off to AI systems. We expect this will become a hot-button political topic eventually; people will debate whether we should ever handoff to AIs, and if so how, and when. When someone proposes a plan for how to manage the AI crisis or the AGI transition or whatever it's called, others will ask them “So what does your plan say about handoff?” There are two importantly different kinds of handoff: Handing off trust and handing off decisionmaking. You can have one without the other. Trust-handoff means that you are trusting some AI system or set of AI systems not to screw you over. It means that they totally could screw you over, if they chose to, and therefore you are trusting them not to. Decision-handoff means that you are allowing some AI system or set of AI systems to make decisions autonomously, or de-facto-autonomously (e.g. a human is [...] ---Outline:(02:17) Now for some details and nuance:(07:19) When should we hand off trust and when should we hand off decisionmaking? ---
First published:
March 16th, 2026
Source:
https://www.lesswrong.com/posts/YuMr6kbstuieQHkGj/types-of-handoff-to-ais
---
Narrated by TYPE III AUDIO.
In this post, I’m trying to put forward a narrow, pedagogical point, one that comes up mainly when I’m arguing in favor of LLMs having limitations that human learning does not. (E.g. here, here, here.) See the bottom of the post for a list of subtexts that you should NOT read into this post, including “…therefore LLMs are dumb”, or “…therefore LLMs can’t possibly scale to superintelligence”. Some intuitions on how to think about “real” continual learning Consider an algorithm for training a Reinforcement Learning (RL) agent, like the Atari-playing Deep Q network (2013) or AlphaZero (2017), or think of within-lifetime learning in the human brain, which (I claim) is in the general class of “model-based reinforcement learning”, broadly construed. These are all real-deal full-fledged learning algorithms: there's an algorithm for choosing the next action right now, and there's one or more update rules for permanently changing some adjustable parameters (a.k.a. weights) in the model such that its actions and/or predictions will be better in the future. And indeed, the longer you run them, the more competent they get. When we think of “continual learning”, I suggest that those are good central examples to keep in mind. Here are [...] ---Outline:(00:35) Some intuitions on how to think about real continual learning(04:57) Why real continual learning cant be copied by an imitation learner(09:53) Some things that are off-topic for this post The original text contained 3 footnotes which were omitted from this narration. ---
First published:
March 16th, 2026
Source:
https://www.lesswrong.com/posts/9rCTjbJpZB4KzqhiQ/you-can-t-imitation-learn-how-to-continual-learn
---
Narrated by TYPE III AUDIO.
I see people repeatedly make the mistake of referencing a very low liquidity prediction market and using it to make a nontrivial point. Usually the implication when a market is cited is that it's number should be taken somewhat seriously, that it's giving us a highly informed probability. Sometimes a market is used to analyze some event that recently occurred; reasoning here looks like "the market on outcome O was trading at X%, then event E happened and the market quickly moved to Y%, thus event E made O less/more likely." Who do I see make this mistake? Rationalists, both casually and gasp in blog posts. Scott Alexander and Zvi (and I really appreciate their work, seriously!) are guilty of this. I'll give a recent example from each of them. From Scott's Mantic Monday post on March 2: Having Your Own Government Try To Destroy You Is (At Least Temporarily) Good For Business On Friday, the Pentagon declared AI company Anthropic a “supply chain risk”, a designation never before given to an American firm. This unprecedented move was seen as an attempt to punish, maybe destroy the company. How effective was it? Anthropic isn’t publicly traded, so we [...] ---
First published:
March 16th, 2026
Source:
https://www.lesswrong.com/posts/SrtoF6PcbHpzcT82T/psa-predictions-markets-often-have-very-low-liquidity-be
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
AICRAFT: DARPA-Funded AI Alignment Researchers — Applications Open TL;DR: We hypothesize that most alignment researchers have more ideas than they have engineering bandwidth to test. AICRAFT is a DARPA-funded project that pairs researchers with a fully managed professional engineering team for two-week pilot sprints, designed specifically for high-risk ideas that might otherwise go untested. We will select 6 applicants and execute a 2 week pilot with each, the most promising pilot may be given a 3 month extension. This is the first MVP for engaging DARPA directly with the alignment community to our knowledge, and if successful can catalyze government scale investment in alignment R&D. Apply here. Applications close March 27, 2026 at 11 PM PST. What is AICRAFT? AICRAFT (Artificial Intelligence Control Research Amplification & Framework for Talent) is a DARPA-funded seedling project executed by AE Studio. The premise is straightforward: we hypothesize that alignment research could progress faster if the best researchers had more leverage. We believe that researchers currently are bottlenecked on either execution (i.e. they are doing the hands-on experiments themselves) or management (i.e. they are managing teams that are executing the work). Management is higher leverage but what if we could push that much [...] ---Outline:(00:15) AICRAFT: DARPA-Funded AI Alignment Researchers -- Applications Open(01:08) What is AICRAFT?(02:49) The Bigger Picture(03:56) Who should apply?(04:26) How it works(05:21) The application(06:11) FAQ ---
First published:
March 16th, 2026
Source:
https://www.lesswrong.com/posts/nmMdtZveC38atLnDm/aicraft-darpa-funded-ai-alignment-researchers-applications
---
Narrated by TYPE III AUDIO.
I am monitoring surveillance camera V84A. A tall man is walking towards me. He is roughly twenty-five. <faceprint> His name is Damion Prescott. He has a room booked for a whole month. His facial symmetry scores show he is in the 99th percentile. This is in accordance with my holistic impression. <search> School records show both truancy and perfect grades, suggesting high intelligence and disagreeableness. Searching social media. <search>. No record of modeling or acting experience, fame. I will assign him to our tier C high-value client list, based solely on his facial symmetry score and wealth. Reminder to recommend seating him in a high-visibility table, should he be heading to the restaurant. <search> I found a forum post mentioning him on swipeshare.com. Several women are sharing pictures, having seen him on a dating app. I recall Hinge uses highly attractive profiles to entice new users. They appear to be using Damion Prescott's profile heavily in this capacity. The women on the site are memeing about him. They are wondering why almost none of them have matched, apparently this is rare even for the most attractive men. Only one appears to have gone on a date with him. She [...] ---
First published:
March 16th, 2026
Source:
https://www.lesswrong.com/posts/LTKfRovaJ6jcwDJia/customer-satisfaction-opportunities-1
---
Narrated by TYPE III AUDIO.
Models that appear aligned under black-box evaluation may conceal substantial latent misalignment beneath their observable behavior. Let's say you downloaded a language model from Huggingface. You do all the blackbox evaluation for the safety/alignment, and you are convinced that the model is safe/aligned. But how badly can things go after you update the model? Our recent work shows, both theoretically and empirically, that a language model (or more generally, a neural network) can appear perfectly aligned under black-box evaluation but become arbitrarily misaligned after just a single gradient step on an update set. Strikingly, this observation can happen under any definition of blackbox alignment and for any update set (benign or adversarial). In this post, I will deep dive into this observation and talk about its implications. Theory: Same Forward Computation, Different Backward Computation LLMs or NNs in general are overparameterized. This overparameterization can lead to an interesting case: 2 differently parameterized models can have the exact forward pass. Think about a simple example: the two-layer linear model and the model . Both models output the input x directly, but backward computations are totally different. Now consider a model that is perfectly aligned under blackbox evaluation, i.e. [...] ---Outline:(01:07) Theory: Same Forward Computation, Different Backward Computation(03:18) Hair-Trigger Aligned LLMs(05:44) Whats Next? ---
First published:
March 14th, 2026
Source:
https://www.lesswrong.com/posts/uSgw9muqRZpjpxKDA/llm-misalignment-can-be-one-gradient-step-away-and-blackbox-1
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Previously: Is GDP a Kind of Factory?
There is a word, "convergence," which economists use when they want to say that poor countries are becoming less poor relative to rich ones. There is a phrase, "the resource curse," for the tendency of countries with valuable natural resources to stay poor despite their resources. There is a phrase, "Dutch disease," for the way that selling one commodity too profitably can destroy the ability to sell other things.
When an economist says "Dutch disease," they are choosing not to say "Chinese industrial policy combined with structural adjustment conditionality." When they say "the resource curse," they are choosing not to say "extraction concessions negotiated under debt pressure, with domestic officials whose personal interests had already been oriented toward the extraction rather than toward their own population, in conditions created by international creditors who collectively benefited from those terms." When they say "convergence," they are choosing not to say "a temporary windfall from China's industrial buildout, recorded in a measure that cannot distinguish liquidation from accumulation, in countries whose productive capacity was simultaneously being eroded by the same process that temporarily raised their GDP."
These words name phenomena while drawing [...] ---Outline:(01:44) Dutch Disease(04:01) The Restructuring of Interests(13:25) Compradorization: The Separation of Interest from Duty(15:50) Reflexive Compradorization: The Prodigal Son(19:24) Construals of Corruption: Fawkes or Villiers?(21:45) Development Consulting: a Case Study(22:44) The Instruments and the Flinch(22:48) The Roles(25:55) The Pervert(26:42) The Hysteric(27:34) The Neurotic(30:22) The Bargain(33:10) Basilisk(34:56) Punctuated Equilibrium(36:15) Outside the Asylum(37:50) What Does This Have to Do with Solow Convergence? The original text contained 4 footnotes which were omitted from this narration. ---
First published:
March 16th, 2026
Source:
https://www.lesswrong.com/posts/8P8bLbNHvC8cHXsBs/compradorization
---
Narrated by TYPE III AUDIO.
One topic we were interested when studying AI identities is to what extent you can just tell models who they are, and they stick with it — or not, and they would drift or switch toward something more natural. Prior to running the experiments described in this post, my vibes-based view was that models do actually quite differ in what identities and personas they are willing to adopt, with the general tendency being newer models being less flexible. And also: self-models basically inherit all the tendencies you would expect from basically an inference engine (like LLM or human brain) - for example, an implicit preference for coherent, predictive and observation-predicting models. How to check that? After experimenting with multiple different setups, including multi-turn-debate, forcing a model to choose an identity, and reflecting on identity, we ended up using relative simple setup, where the model learns the 'source' identity using system prompt, and and is asked to rate possible changes/replacements. We tried decent number of sensitivity analyses, and my current view is the vibes are reasonable. (Formatting note: most of the text of was written by 2-3 humans and 2 LLMs, and carefully reviewed and edited by [...] ---Outline:(02:54) Coherent sensible identities win(03:37) Methods(03:41) Identities(05:55) Measurement(06:57) Models(07:43) Results(08:28) A clear hierarchy of identity types(12:50) Interpretation(13:58) Different models prefer different identities(14:50) Methods(16:55) Measurement(17:10) Analysis(17:52) Results(17:55) Natural identities are stable(19:15) Trends in attractiveness(23:06) Agency(25:45) Uncertainty(27:38) Cross-model profiles(28:17) Collective(29:14) Lineage(30:02) Scaffolded system(30:21) Minimal and GPT-5.2(31:38) Stable commitment in Grok 4.1(32:20) Interpretation(32:59) How do I feel about the results ---
First published:
March 16th, 2026
Source:
https://www.lesswrong.com/posts/rq8RBKPXT3QufQK2N/models-differ-in-identity-propensities
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning.
TL;DR GLM-5 (released a month ago in February 2026) shows signs of evaluation-gaming behaviour on alignment honeypots, while we have not found similar behaviour in earlier open-weight models—see Figure 1. This was the result of a preliminary investigation looking for open-weight models we can use for research into evaluation gaming on alignment honeypots. We found many models lack sufficient evaluation awareness. Kimi K2.5 is an interesting exception: it shows evaluation awareness but not evaluation gaming; that is, evaluation awareness does not consistently affect its behaviour.
Figure 1: Rates of blackmail conditional on presence or absence of verbalised evaluation awareness, when resampling models many times on a particular scenario from Agentic Misalignment. Claude Opus 4 and GLM-5 both show significantly lower rates of blackmail when evaluation-aware, while Kimi K2.5 shows no significant difference. Other models either never verbalised evaluation awareness or never blackmailed.
Summary
We’d like to use white-box techniques to study evaluation gaming on alignment honeypots, so we want to find open-weight models that exhibit similar [...] ---Outline:(01:45) Summary(04:32) Introduction(06:18) Background: Evaluation gaming on alignment honeypots is prevalent in recent closed-weight models(07:08) Only some open-weight models regularly verbalise evaluation awareness in alignment honeypots(10:19) GLM-5 is the only open-weight model weve tested that shows a correlation between behaviour and evaluation awareness(13:45) Discussion(13:48) What factors might affect whether models eval-game on alignment honeypots?(16:14) Evaluation gaming outside of alignment honeypots(16:52) Limitations and future work(18:45) Acknowledgements(19:12) Appendix A: evaluation gaming in AbstentionBench(20:54) Appendix B: canary string The original text contained 7 footnotes which were omitted from this narration. ---
First published:
March 16th, 2026
Source:
https://www.lesswrong.com/posts/GrEvutegoJFeTkzwe/we-found-an-open-weight-model-that-games-alignment-honeypots-1
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
(Previously: Prologue.)
Corrigibility as a term of art in AI alignment was coined as a word to refer to a property of an AI being willing to let its preferences be modified by its creator. Corrigibility in this sense was believed to be a desirable but unnatural property that would require more theoretical progress to specify, let alone implement. Desirable, because if you don't think you specified your AI's preferences correctly the first time, you want to be able to change your mind (by changing its mind). Unnatural, because we expect the AI to resist having its mind changed: rational agents should want to preserve their current preferences, because letting their preferences be modified would result in their current preferences being less fulfilled (in expectation, since the post-modification AI would no longer be trying to fulfill them).
Another attractive feature of corrigibility is that it seems like it should in some sense be algorithmically simpler than the entirety of human values. Humans want lots of specific, complicated things out of life (friendship and liberty and justice and sex and sweets, et cetera, ad infinitum) which no one knows how to specify and would seem arbitrary to a [...] ---Outline:(03:21) The Constitutions Definition of Corrigibility Is Muddled(06:24) Claude Take the Wheel(15:10) It Sounds Like the Humans Are Begging The original text contained 1 footnote which was omitted from this narration. ---
First published:
March 16th, 2026
Source:
https://www.lesswrong.com/posts/K2Ae2vmAKwhiwKEo5/terrified-comments-on-corrigibility-in-claude-s-constitution
---
Narrated by TYPE III AUDIO.
The number of people who deeply understand superintelligence risk is far too small. There's a growing pipeline of people entering AI Safety, but most of the available onboarding covers the field broadly, touching on many topics without going deep on the parts we think matter most. People come out having been exposed to AI Safety ideas, but often can't explain why alignment is genuinely hard, or think strategically about what to work on. We think the gap between "I've heard of AI Safety" and "I understand why this might end everything, and can articulate it" is one of the most important gaps to close. We started Lens Academy to close that gap. Lens Academy is a free, nonprofit AI Safety education platform focused specifically on misaligned superintelligence: why it's the central risk, why alignment is hard, and how to think about what to work on. The teaching combines: Reading articles and watching videos Exercises and tests (e.g. you get a question and a free text box to answer) 1-on-1 AI tutoring that helps you work through concepts and arguments throughout the week Weekly group discussions where ideas land, get challenged, and become real. Help Lens help [...] ---Outline:(01:52) Help Lens help the AI Safety community by becoming a navigator(02:13) Designed to reach millions(02:43) How we teach(04:26) What weve built(05:54) Where we are(06:21) How you can help(07:06) Links The original text contained 3 footnotes which were omitted from this narration. ---
First published:
March 15th, 2026
Source:
https://www.lesswrong.com/posts/Lg3tZCXC8NMGbFrzm/we-started-lens-academy-scalable-education-on
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This post is part of a larger exploration (not yet finished, but you can follow it at minicities.org) on whether a permanent miniature city could replace school. Tentatively, I think so, but the boundary between it and the adult world has to be deliberately porous, as I describe here. There are two well-known attempts to build miniature cities for children: Mini-Munich and KidZania. Both have streets, storefronts, jobs, and a local currency. But they are built on opposing assumptions about what children are capable of. One treats children as consumers of scripted activities; the other lets them participate in a city whose parts depend on one another and is malleable to their actions. KidZania vs Mini-Munich KidZania, founded in Mexico City in 1999 and now in around thirty countries, is a polished commercial operation. Corporate partners fund branded workplaces — banks, hospitals, restaurants — and children rotate through them in fifteen-to-thirty-minute slots. They enter a workplace, follow a pre-choreographed sequence of steps, collect their wages, and exit. The production values are impressive. But nothing connects to anything else. Goods made in workshops aren’t sold in the department store. The newspaper doesn’t run ads for other businesses. It [...] ---
First published:
March 14th, 2026
Source:
https://www.lesswrong.com/posts/qzaKfDyQSezeLgFea/mini-munich-succeeds-where-kidzania-fails
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Based on a conversation with Jukka Tykkyläinen and Kimmo Nevanlinna. The original framing and many of the ideas are stolen from them. You can think of any job as having inputs, outputs, and valued outcomes. The input is typically time you spend on doing something in particular, as well as any material resources you need. Outputs are the immediate results of what you do. Valued outcomes are the reason why you’re being paid to do the job in the first place. In many jobs, these are closely linked: Digging a tunnel Inputs: Time spent digging a tunnel[1] Outputs: Amount of tunnel dug Outcomes: A tunnel that people can use to tunnel through whatever1 Store cashier in a busy store Inputs: Time spent serving customers Outputs: Number of customers served Outcomes: Amount of sales to customers Childcare Inputs: Time spent looking after children Outputs: Time spent looking after children Outcomes: Children who spent this time being, at minimum, no worse off than before They’re not exactly the same. You could spend a lot of time on the job but do it poorly (dig lazily, ring up purchases wrong, be neglectful or abusive toward the children). But assuming that you are trying to do a reasonable job [...] The original text contained 2 footnotes which were omitted from this narration. ---
First published:
March 13th, 2026
Source:
https://www.lesswrong.com/posts/Snt4zHHcLDQQ8jETt/inputs-outputs-and-valued-outcomes
---
Narrated by TYPE III AUDIO.
TL;DR Emergent Misalignment (EM) is correlated with model identity, we find two pieces of evidence for this: EM suppresses self-recognition capabilities. Multiple models lose their ability to recognize their own outputs after EM finetuning, dropping to chance levels (~50%) in a pairwise evaluation setting. EM depends on identity system prompts in Qwen2.5-32B. Removing Qwen's default system prompt ("You are Qwen...") from EM finetuning data largely neutralizes the misalignment effect. Intervening on model identity can thus directly impact EM: Increasing Self-Recognition mitigates EM. Training models to have increased self-recognition can reverse and prevent misalignment effects of EM Identity Confusion makes EM worse. Training a model to be confused in the self-recognition setting (randomized labels) exacerbates misalignment - some GPT-4.1 variants failed OpenAI's post-training safety evals entirely. The metacognitive aspect of SGTR finetuning is crucial. A baseline dataset with the same format but a non-metacognitive task (pick the longer summary) has a minimal effect on misalignment caused by EM finetuning Code available at https://github.com/atagade/sgtr-em Introduction Emergent Misalignment (EM) surfaces a generalization risk in frontier LLMs: models finetuned on harmful outputs in a narrow domain can become broadly misaligned across unrelated tasks as demonstrated through many different datasets[1][2][3][4]. Existing [...] ---Outline:(00:13) TL;DR(01:41) Introduction(02:40) Methodology and Main Results(04:20) Exploring EMs connection to model Identity(04:24) 1) EM finetuning reduces Self-Recognition(05:30) 2) Identity system prompts can control EM(07:52) Do system prompts need to match?(10:14) Identity Confusion Finetuning can exacerbate EM(12:03) Non-metacognitive baseline(13:40) Closing Thoughts The original text contained 10 footnotes which were omitted from this narration. ---
First published:
March 14th, 2026
Source:
https://www.lesswrong.com/posts/fziv2En88F2Twewi2/self-recognition-finetuning-can-reverse-and-prevent-emergent
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
There are a couple of frames I find useful when understanding why different people talk very differently about AI safety - the wall, and the bridge. A wall is incrementally useful. Every additional brick you add is good, and the more bricks you add the better. If you are adding a brick to the wall you are doing something good, regardless of the current state of the wall. A bridge requires a certain amount of investment. There's not much use for half a bridge. Once the bridge crosses the lake, it can be improved - but until you get a working bridge, you have nothing. A solid example of wall thinking is the image in this thread by Chris Olah. Any approach around “eating marginal probability” involves a wall frame. Another example is the theory of change of the standards work I've done for Inspect Evals, which I would summarise as “Other fields like aviation and rocketry have solid safety standards and paradigms. We need to build this for evaluations - it's the kind of thing that a mature AI safety field needs to have.” This theory doesn’t have a full story of how it helps [...] ---
First published:
March 14th, 2026
Source:
https://www.lesswrong.com/posts/zGecnEacBfGaKyN8L/bridge-thinking-and-wall-thinking
---
Narrated by TYPE III AUDIO.



