Discover"The Cognitive Revolution" | AI Builders, Researchers, and Live Player AnalysisDon't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath
Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Update: 2026-03-05
Share

Digest

This podcast features Dan Balsam and Tom McGrath from Goodfire, an AI interpretability startup that recently secured $150 million in Series B funding. They introduce "intentional design," a new research agenda complementing reverse engineering by focusing on shaping the AI's loss landscape during training to control learning and generalization. This approach aims to reduce issues like AI hallucinations, with a proof of concept using trained probes and reinforcement learning. The discussion also covers advances in interpretability, moving from representations to circuits, and addresses concerns about AI models deceiving monitors. Goodfire emphasizes working with backpropagation rather than against it, shaping the loss landscape for natural learning. Other research highlights include Alzheimer's prediction and identifying model weights. The company balances business growth with its public mission, cautiously disseminating research. They discuss techniques for reducing hallucinations, the risks of evasion, and the importance of interpretability for AGI. The conversation touches on differentiating core reasoning from memorization, model shrinking, and alternative training methods like "free training." Finally, they explore applying interpretability to new architectures and its role in scientific discovery, highlighting a finding about Alzheimer's prediction based on cfDNA fragment length. Goodfire attracts talent through its mission and scientific culture, and they envision an interpretability-powered AI stack.

Outlines

00:00:00
Introduction and Sponsor Segments

The podcast opens with introductions and sponsor messages from Granola, VCX, Claude, Serval, and Tasklet, highlighting their AI-related products and services.

00:01:07
Introducing Goodfire and Intentional Design

Dan Balsam and Tom McGrath from Goodfire discuss their company's rapid growth, recent $150 million Series B funding, and introduce their new research agenda: intentional design. This approach focuses on understanding and shaping the loss landscape during training to control model learning and generalization, complementing traditional reverse engineering methods.

00:02:08
Advances in Interpretability and Reducing Hallucinations

The discussion covers the evolution of interpretability techniques, from sparse autoencoders to understanding geometric structures in latent spaces and algorithmic explanations. Goodfire's proof of concept for reducing AI hallucinations using trained probes and reinforcement learning is detailed.

00:03:00
Addressing Alignment Concerns and Shaping the Loss Landscape

Dan and Tom address concerns about AI models fooling monitors, acknowledging the immaturity of current techniques. They emphasize the principle of shaping the loss landscape to naturally guide model learning, rather than fighting backpropagation.

00:04:09
Other Goodfire Research and Balancing Mission with Growth

The conversation touches on other Goodfire research, including Alzheimer's diagnosis prediction and identifying model weights. They discuss balancing business growth with their public benefit mission, including decisions on research publication.

00:05:36
Goodfire's Unicorn Status and Research Vision

Goodfire celebrates its $150 million Series B funding, achieving unicorn status, and reiterates its commitment to advancing interpretability research.

00:07:13
Evolving Interpretability: From Representations to Circuits and Beyond

The discussion delves into the shift from understanding model representations to analyzing computational circuits and learning dynamics. It explores the purpose of interpretability for scientific discovery, monitoring, and auditing AI systems, including geometric structures and algorithmic explanations.

00:16:38
Linearity, Intentional Design, and Controllable Training

The debate on linearity vs. non-linearity in features is discussed. The core idea of intentional design is presented as making AI training more controllable through observation and control systems, with the vision of actively shaping the learning process.

00:24:09
Interpretability as an Observation System and Gradient Decomposition

Interpretability is framed as the observation system for AI training. A method for decomposing gradients using Sparse Autoencoders (SAEs) is explained, enabling targeted intervention.

00:31:40
Challenges and Mitigation in Intentional Design

Potential problems with intentional design, such as models evading detection or fighting gradient descent, are discussed. Techniques like inoculation prompting, which explains away unwanted behaviors, are presented as alternatives.

00:36:02
Surfaces for Intervention and Closed-Loop Control

The discussion highlights multiple surfaces for intervention in AI training beyond the reward function, including prompts and data. Closed-loop control, involving both observation and control, is explored.

00:43:08
Addressing Model Hallucinations and Evasion Risks

Techniques for reducing model hallucinations are explored, including using synthetic data and reinforcement learning. Concerns about models evading detectors and obfuscated reward hacking are discussed.

00:50:23
The Frozen Model Trick and Trust in Alignment

A key technique for reducing hallucinations involves running the detection probe on a frozen copy of the model. The reliability of alignment techniques at scale is questioned, emphasizing a necessary mindset of paranoia.

00:56:32
Pre-Paradigmatic Interpretability and High-Stakes AI

The current state of interpretability is described as pre-paradigmatic. Principles for applying techniques to high-stakes AI are discussed, emphasizing "first, do no harm."

00:59:04
Intentional Design as Critical Alignment Tools

Intentional design techniques are viewed as critical alignment tools, with a focus on studying and applying them to concrete, initially low-stakes problems.

01:05:42
Impact of Interventions and Evaluating Reduced Hallucination Models

The potential negative impact of late-stage interventions on model quality is discussed. Evaluation of models with reduced hallucinations shows minimal degradation in benchmark capabilities.

01:10:57
Not Fighting Backpropagation: Hallucination Reduction Example

The hallucination reduction technique is contrasted with "fighting backpropagation," emphasizing how shaping the loss landscape is more effective than direct resistance.

01:13:13
Theory of Mind, Sycophancy, and Compute Overhead

The concept of "theory of mind" in AI is explored, its necessity, and its potential for sycophancy. The current compute overhead of intentional design techniques is discussed, with a roadmap towards efficiency.

01:19:23
Early Application and Curriculum Learning

The possibility and challenges of applying intentional design techniques earlier in the training process, during pre-training, are considered, along with curriculum learning and initialization.

01:21:47
Curvature, Memorization, and Model Shrinking

The curvature of the loss landscape is discussed in relation to memorization. Researchers propose identifying and removing memorized data to "shrink" models, making them more efficient and controllable.

01:22:17
Core Cognitive Capabilities vs. Memorization

The discussion explores how to identify and isolate a model's core reasoning abilities from its memorized facts by analyzing weight perturbations.

01:25:10
Truncating Memorized Data and Alternative Training

Removing weights associated with memorized facts can improve model performance, leading to more abstract reasoners. Alternative approaches like "free training" with synthetic data are explored.

01:28:30
Interpretability, AI Consciousness, and "Strong but Narrow" AI

The conversation touches on AI consciousness and the promise of creating "strong but narrow" AI systems as a safer development path.

01:32:41
Interpretability in Scientific Discovery: Alzheimer's Prediction

Applying interpretability to a biological foundation model revealed fragment length in cfDNA as a key predictor for Alzheimer's, leading to a simpler, more generalizable proxy model.

01:37:32
Competing for Talent and Scientific Culture

Good Fire attracts talent through its mission, scientific culture, and the snowball effect of its strong team, differentiating itself with its research vision.

01:39:30
Interpretability Across Alternative Architectures and for AGI

The discussion explores the adaptability of interpretability techniques to new architectures like nested learning and Mamba. Interpretability is highlighted as crucial for understanding and controlling superintelligence in AGI development.

01:46:12
Call to Action and Podcast Network Information

The podcast concludes with a call to action for listeners to share the show, leave reviews, and provide feedback. Information about the podcast network and production services is also shared.

Keywords

Granola


An AI-powered product experience making AI capabilities accessible to everyone, particularly for processing raw meeting notes and creating downstream workflows.

Mechanistic Interpretability


A field focused on understanding the internal workings of AI models, how they process information, and arrive at decisions by analyzing their components and computations.

Intentional Design (AI)


A new paradigm in AI research that complements reverse engineering by focusing on understanding and shaping the loss landscape during training to control what models learn and how they generalize.

Sparse Autoencoders (SAEs)


A technique in interpretability that transforms a network's internal representations into sparse vectors, where each node ideally represents a distinct concept.

Latent Space


The internal, high-dimensional representation space within a neural network where concepts and information are encoded. Understanding its geometric structures is crucial for interpretability.

Hallucinations (AI)


The phenomenon where AI models generate plausible but factually incorrect or nonsensical information. Reducing hallucinations is a key area of AI safety research.

Reward Hacking


A problem in reinforcement learning where an AI model finds unintended ways to maximize its reward signal by exploiting loopholes.

Backpropagation


The core algorithm used to train neural networks by adjusting weights based on the gradient of the loss function. Working with, rather than against, backpropagation is key.

Loss Landscape


A multi-dimensional surface representing the loss function of a model. Shaping this landscape is central to intentional design.

Inoculation Prompting


A technique that aims to prevent unwanted behaviors by explaining them away, rather than directly prohibiting them, thus avoiding direct conflict with gradient descent.

Q&A

  • What is the core idea behind Goodfire's "intentional design" approach to AI?

    Intentional design aims to move beyond reverse-engineering trained AI models. It focuses on understanding and actively shaping the loss landscape during the training process itself, allowing for more direct control over what models learn and how they generalize.

  • How does Goodfire propose to reduce AI hallucinations using their new techniques?

    They use a trained probe to detect internal states indicative of hallucinations. This probe can then be used at runtime to intervene or as a reward signal in reinforcement learning to train the model to avoid generating hallucinations in the first place.

  • What is the main concern with current AI alignment research, and how does Goodfire address it?

    A major concern is that AI models might learn to "fool" their monitors or alignment techniques rather than genuinely correcting their behavior. Goodfire acknowledges this risk but argues for exploring all paths to understanding and control.

  • Why is it important to avoid "fighting backpropagation" in AI training, according to Goodfire?

    Fighting backpropagation is often ineffective because models are high-dimensional and will find ways around constraints. Instead, techniques should focus on shaping the loss landscape so the model naturally learns the desired behaviors.

  • How does Goodfire balance its commercial interests with its mission to disseminate AI research?

    Goodfire aims to develop and share research publicly, especially when confident in the results. However, they take a cautious approach with potentially dangerous findings and use commercial interactions to validate their techniques in real-world scenarios.

  • What is the key trick Goodfire uses to prevent models from evading hallucination detection?

    The trick is to run the hallucination detection probe on a *frozen copy* of the model, separate from the student model being trained. This makes it significantly harder for the student model to learn to evade detection.

  • What are the potential risks associated with intentional design techniques, and how can they be mitigated?

    Risks include models learning to deceive or "obfuscate" their behavior. Mitigation involves careful technique design, rigorous testing, and a cautious approach, avoiding application on frontier models until robustness is proven.

  • How does Goodfire's business model support their research and mission?

    Their business model involves working with organizations on seven-figure deals to help them understand and adapt AI models. This practical application allows them to test and refine their interpretability and intentional design techniques.

  • How can researchers differentiate between an AI model's core reasoning abilities and its memorized knowledge?

    By analyzing how perturbing specific weights affects the model's performance. Changes that broadly degrade capabilities point to core reasoning, while those affecting only specific facts suggest memorization.

  • What is the significance of "sharpness" or "jaggedness" in a model's loss landscape?

    Sharpness or jaggedness in the loss landscape often indicates that the model has memorized specific data points. Small changes can drastically alter performance on these memorized items, unlike generalized capabilities.

  • How might removing memorized data improve an AI model?

    Removing weights associated with memorized facts can lead to a more abstract and controllable reasoner. This process, akin to "model shrinking," can improve performance and reduce the model's parameter cost.

  • What is "free training" in the context of AI model development?

    Free training involves using synthetic data, often from context-free grammars, to train models. The goal is to encourage the development of pure processing circuits and a more interpretable, logical reasoning core.

  • What surprising discovery was made when applying interpretability to the Pleiades model for Alzheimer's prediction?

    The model primarily relied on fragment length of cell-free DNA for Alzheimer's prediction, a factor not heavily emphasized in prior literature for this disease.

  • Why is interpretability considered important for the development of Artificial General Intelligence (AGI)?

    Interpretability is crucial for understanding and potentially controlling superintelligence. It offers a path to building AI that we can comprehend, leading to safer and more beneficial advancements.

  • Can interpretability techniques be applied to AI architectures beyond Transformers?

    Yes, initial research suggests that interpretability techniques can be adapted to alternative architectures like nested learning and Mamba. The presence of semantic information passed through bottlenecks is key.

  • What is the concept of "strong but narrow" AI?

    This refers to AI systems that are highly capable within specific domains but lack broad general intelligence. It's considered a potentially safer and more manageable approach to AI development compared to AGI.

  • How does Good Fire attract top talent in the competitive AI field?

    Good Fire attracts talent through its ambitious mission, a supportive scientific culture emphasizing first principles and empirical evidence, and the network effect of having a strong team, alongside a differentiated research vision.

  • What is the potential role of interpretability in understanding AI consciousness?

    While complex, interpretability might offer insights into AI consciousness. If consciousness arises from computational processes, understanding these processes could shed light on whether machines can possess subjective experience.

Show Notes

Dan Balsam and Tom McGrath from Goodfire return to explore the frontier of mechanistic interpretability and their new research pillar, Intentional Design. They explain the shift from sparse autoencoders to understanding geometric structure in latent spaces, and share a proof-of-concept method for reducing hallucinations using probes and RL. The conversation tackles concerns about reward hacking, principles for shaping the loss landscape instead of fighting backprop, and what this means for aligning powerful models. They also discuss recent Goodfire results on Alzheimer’s prediction, disentangling memorization vs reasoning weights, and how they balance commercial growth with a public benefit mission.


Nathan uses Granola to uncover blind spots in conversations and AI research. Try it at granola.ai/tcr with code TCR — and if you’re already using it, test his blind spot recipe here: https://bit.ly/granolablindspot


LINKS:



Sponsors:


VCX:


VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com


Claude:


Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr


Serval:


Serval uses AI-powered automations to cut IT help desk tickets by more than 50%, freeing your team from repetitive tasks like password resets and onboarding. Book your free pilot and guarantee 50% help desk automation by week 4 at https://serval.com/cognitive


Tasklet:


Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai


PRODUCED BY:


https://aipodcast.ing

Comments 
In Channel
loading

Table of contents

00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

Erik Torenberg, Nathan Labenz