Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath
Digest
This podcast features Dan Balsam and Tom McGrath from Goodfire, an AI interpretability startup that recently secured $150 million in Series B funding. They introduce "intentional design," a new research agenda complementing reverse engineering by focusing on shaping the AI's loss landscape during training to control learning and generalization. This approach aims to reduce issues like AI hallucinations, with a proof of concept using trained probes and reinforcement learning. The discussion also covers advances in interpretability, moving from representations to circuits, and addresses concerns about AI models deceiving monitors. Goodfire emphasizes working with backpropagation rather than against it, shaping the loss landscape for natural learning. Other research highlights include Alzheimer's prediction and identifying model weights. The company balances business growth with its public mission, cautiously disseminating research. They discuss techniques for reducing hallucinations, the risks of evasion, and the importance of interpretability for AGI. The conversation touches on differentiating core reasoning from memorization, model shrinking, and alternative training methods like "free training." Finally, they explore applying interpretability to new architectures and its role in scientific discovery, highlighting a finding about Alzheimer's prediction based on cfDNA fragment length. Goodfire attracts talent through its mission and scientific culture, and they envision an interpretability-powered AI stack.
Outlines

Introduction and Sponsor Segments
The podcast opens with introductions and sponsor messages from Granola, VCX, Claude, Serval, and Tasklet, highlighting their AI-related products and services.

Introducing Goodfire and Intentional Design
Dan Balsam and Tom McGrath from Goodfire discuss their company's rapid growth, recent $150 million Series B funding, and introduce their new research agenda: intentional design. This approach focuses on understanding and shaping the loss landscape during training to control model learning and generalization, complementing traditional reverse engineering methods.

Advances in Interpretability and Reducing Hallucinations
The discussion covers the evolution of interpretability techniques, from sparse autoencoders to understanding geometric structures in latent spaces and algorithmic explanations. Goodfire's proof of concept for reducing AI hallucinations using trained probes and reinforcement learning is detailed.

Addressing Alignment Concerns and Shaping the Loss Landscape
Dan and Tom address concerns about AI models fooling monitors, acknowledging the immaturity of current techniques. They emphasize the principle of shaping the loss landscape to naturally guide model learning, rather than fighting backpropagation.

Other Goodfire Research and Balancing Mission with Growth
The conversation touches on other Goodfire research, including Alzheimer's diagnosis prediction and identifying model weights. They discuss balancing business growth with their public benefit mission, including decisions on research publication.

Goodfire's Unicorn Status and Research Vision
Goodfire celebrates its $150 million Series B funding, achieving unicorn status, and reiterates its commitment to advancing interpretability research.

Evolving Interpretability: From Representations to Circuits and Beyond
The discussion delves into the shift from understanding model representations to analyzing computational circuits and learning dynamics. It explores the purpose of interpretability for scientific discovery, monitoring, and auditing AI systems, including geometric structures and algorithmic explanations.

Linearity, Intentional Design, and Controllable Training
The debate on linearity vs. non-linearity in features is discussed. The core idea of intentional design is presented as making AI training more controllable through observation and control systems, with the vision of actively shaping the learning process.

Interpretability as an Observation System and Gradient Decomposition
Interpretability is framed as the observation system for AI training. A method for decomposing gradients using Sparse Autoencoders (SAEs) is explained, enabling targeted intervention.

Challenges and Mitigation in Intentional Design
Potential problems with intentional design, such as models evading detection or fighting gradient descent, are discussed. Techniques like inoculation prompting, which explains away unwanted behaviors, are presented as alternatives.

Surfaces for Intervention and Closed-Loop Control
The discussion highlights multiple surfaces for intervention in AI training beyond the reward function, including prompts and data. Closed-loop control, involving both observation and control, is explored.

Addressing Model Hallucinations and Evasion Risks
Techniques for reducing model hallucinations are explored, including using synthetic data and reinforcement learning. Concerns about models evading detectors and obfuscated reward hacking are discussed.

The Frozen Model Trick and Trust in Alignment
A key technique for reducing hallucinations involves running the detection probe on a frozen copy of the model. The reliability of alignment techniques at scale is questioned, emphasizing a necessary mindset of paranoia.

Pre-Paradigmatic Interpretability and High-Stakes AI
The current state of interpretability is described as pre-paradigmatic. Principles for applying techniques to high-stakes AI are discussed, emphasizing "first, do no harm."

Intentional Design as Critical Alignment Tools
Intentional design techniques are viewed as critical alignment tools, with a focus on studying and applying them to concrete, initially low-stakes problems.

Impact of Interventions and Evaluating Reduced Hallucination Models
The potential negative impact of late-stage interventions on model quality is discussed. Evaluation of models with reduced hallucinations shows minimal degradation in benchmark capabilities.

Not Fighting Backpropagation: Hallucination Reduction Example
The hallucination reduction technique is contrasted with "fighting backpropagation," emphasizing how shaping the loss landscape is more effective than direct resistance.

Theory of Mind, Sycophancy, and Compute Overhead
The concept of "theory of mind" in AI is explored, its necessity, and its potential for sycophancy. The current compute overhead of intentional design techniques is discussed, with a roadmap towards efficiency.

Early Application and Curriculum Learning
The possibility and challenges of applying intentional design techniques earlier in the training process, during pre-training, are considered, along with curriculum learning and initialization.

Curvature, Memorization, and Model Shrinking
The curvature of the loss landscape is discussed in relation to memorization. Researchers propose identifying and removing memorized data to "shrink" models, making them more efficient and controllable.

Core Cognitive Capabilities vs. Memorization
The discussion explores how to identify and isolate a model's core reasoning abilities from its memorized facts by analyzing weight perturbations.

Truncating Memorized Data and Alternative Training
Removing weights associated with memorized facts can improve model performance, leading to more abstract reasoners. Alternative approaches like "free training" with synthetic data are explored.

Interpretability, AI Consciousness, and "Strong but Narrow" AI
The conversation touches on AI consciousness and the promise of creating "strong but narrow" AI systems as a safer development path.

Interpretability in Scientific Discovery: Alzheimer's Prediction
Applying interpretability to a biological foundation model revealed fragment length in cfDNA as a key predictor for Alzheimer's, leading to a simpler, more generalizable proxy model.

Competing for Talent and Scientific Culture
Good Fire attracts talent through its mission, scientific culture, and the snowball effect of its strong team, differentiating itself with its research vision.

Interpretability Across Alternative Architectures and for AGI
The discussion explores the adaptability of interpretability techniques to new architectures like nested learning and Mamba. Interpretability is highlighted as crucial for understanding and controlling superintelligence in AGI development.

Call to Action and Podcast Network Information
The podcast concludes with a call to action for listeners to share the show, leave reviews, and provide feedback. Information about the podcast network and production services is also shared.
Keywords
Granola
An AI-powered product experience making AI capabilities accessible to everyone, particularly for processing raw meeting notes and creating downstream workflows.
Mechanistic Interpretability
A field focused on understanding the internal workings of AI models, how they process information, and arrive at decisions by analyzing their components and computations.
Intentional Design (AI)
A new paradigm in AI research that complements reverse engineering by focusing on understanding and shaping the loss landscape during training to control what models learn and how they generalize.
Sparse Autoencoders (SAEs)
A technique in interpretability that transforms a network's internal representations into sparse vectors, where each node ideally represents a distinct concept.
Latent Space
The internal, high-dimensional representation space within a neural network where concepts and information are encoded. Understanding its geometric structures is crucial for interpretability.
Hallucinations (AI)
The phenomenon where AI models generate plausible but factually incorrect or nonsensical information. Reducing hallucinations is a key area of AI safety research.
Reward Hacking
A problem in reinforcement learning where an AI model finds unintended ways to maximize its reward signal by exploiting loopholes.
Backpropagation
The core algorithm used to train neural networks by adjusting weights based on the gradient of the loss function. Working with, rather than against, backpropagation is key.
Loss Landscape
A multi-dimensional surface representing the loss function of a model. Shaping this landscape is central to intentional design.
Inoculation Prompting
A technique that aims to prevent unwanted behaviors by explaining them away, rather than directly prohibiting them, thus avoiding direct conflict with gradient descent.
Q&A
What is the core idea behind Goodfire's "intentional design" approach to AI?
Intentional design aims to move beyond reverse-engineering trained AI models. It focuses on understanding and actively shaping the loss landscape during the training process itself, allowing for more direct control over what models learn and how they generalize.
How does Goodfire propose to reduce AI hallucinations using their new techniques?
They use a trained probe to detect internal states indicative of hallucinations. This probe can then be used at runtime to intervene or as a reward signal in reinforcement learning to train the model to avoid generating hallucinations in the first place.
What is the main concern with current AI alignment research, and how does Goodfire address it?
A major concern is that AI models might learn to "fool" their monitors or alignment techniques rather than genuinely correcting their behavior. Goodfire acknowledges this risk but argues for exploring all paths to understanding and control.
Why is it important to avoid "fighting backpropagation" in AI training, according to Goodfire?
Fighting backpropagation is often ineffective because models are high-dimensional and will find ways around constraints. Instead, techniques should focus on shaping the loss landscape so the model naturally learns the desired behaviors.
How does Goodfire balance its commercial interests with its mission to disseminate AI research?
Goodfire aims to develop and share research publicly, especially when confident in the results. However, they take a cautious approach with potentially dangerous findings and use commercial interactions to validate their techniques in real-world scenarios.
What is the key trick Goodfire uses to prevent models from evading hallucination detection?
The trick is to run the hallucination detection probe on a *frozen copy* of the model, separate from the student model being trained. This makes it significantly harder for the student model to learn to evade detection.
What are the potential risks associated with intentional design techniques, and how can they be mitigated?
Risks include models learning to deceive or "obfuscate" their behavior. Mitigation involves careful technique design, rigorous testing, and a cautious approach, avoiding application on frontier models until robustness is proven.
How does Goodfire's business model support their research and mission?
Their business model involves working with organizations on seven-figure deals to help them understand and adapt AI models. This practical application allows them to test and refine their interpretability and intentional design techniques.
How can researchers differentiate between an AI model's core reasoning abilities and its memorized knowledge?
By analyzing how perturbing specific weights affects the model's performance. Changes that broadly degrade capabilities point to core reasoning, while those affecting only specific facts suggest memorization.
What is the significance of "sharpness" or "jaggedness" in a model's loss landscape?
Sharpness or jaggedness in the loss landscape often indicates that the model has memorized specific data points. Small changes can drastically alter performance on these memorized items, unlike generalized capabilities.
How might removing memorized data improve an AI model?
Removing weights associated with memorized facts can lead to a more abstract and controllable reasoner. This process, akin to "model shrinking," can improve performance and reduce the model's parameter cost.
What is "free training" in the context of AI model development?
Free training involves using synthetic data, often from context-free grammars, to train models. The goal is to encourage the development of pure processing circuits and a more interpretable, logical reasoning core.
What surprising discovery was made when applying interpretability to the Pleiades model for Alzheimer's prediction?
The model primarily relied on fragment length of cell-free DNA for Alzheimer's prediction, a factor not heavily emphasized in prior literature for this disease.
Why is interpretability considered important for the development of Artificial General Intelligence (AGI)?
Interpretability is crucial for understanding and potentially controlling superintelligence. It offers a path to building AI that we can comprehend, leading to safer and more beneficial advancements.
Can interpretability techniques be applied to AI architectures beyond Transformers?
Yes, initial research suggests that interpretability techniques can be adapted to alternative architectures like nested learning and Mamba. The presence of semantic information passed through bottlenecks is key.
What is the concept of "strong but narrow" AI?
This refers to AI systems that are highly capable within specific domains but lack broad general intelligence. It's considered a potentially safer and more manageable approach to AI development compared to AGI.
How does Good Fire attract top talent in the competitive AI field?
Good Fire attracts talent through its ambitious mission, a supportive scientific culture emphasizing first principles and empirical evidence, and the network effect of having a strong team, alongside a differentiated research vision.
What is the potential role of interpretability in understanding AI consciousness?
While complex, interpretability might offer insights into AI consciousness. If consciousness arises from computational processes, understanding these processes could shed light on whether machines can possess subjective experience.
Show Notes
Dan Balsam and Tom McGrath from Goodfire return to explore the frontier of mechanistic interpretability and their new research pillar, Intentional Design. They explain the shift from sparse autoencoders to understanding geometric structure in latent spaces, and share a proof-of-concept method for reducing hallucinations using probes and RL. The conversation tackles concerns about reward hacking, principles for shaping the loss landscape instead of fighting backprop, and what this means for aligning powerful models. They also discuss recent Goodfire results on Alzheimer’s prediction, disentangling memorization vs reasoning weights, and how they balance commercial growth with a public benefit mission.
Nathan uses Granola to uncover blind spots in conversations and AI research. Try it at granola.ai/tcr with code TCR — and if you’re already using it, test his blind spot recipe here: https://bit.ly/granolablindspot
LINKS:
- Detecting PII for Rakuten
- Interpretability for Alzheimer's biomarker detection
- You and Your Research Agent
- Adversarial examples and superposition
- Discovering rare behaviors with model diff
- Priors in time for interpretability
- Belief dynamics in in-context learning
- Mixing mechanisms in language models
- Sparse autoencoder scaling with manifolds
Sponsors:
VCX:
VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com
Claude:
Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr
Serval:
Serval uses AI-powered automations to cut IT help desk tickets by more than 50%, freeing your team from repetitive tasks like password resets and onboarding. Book your free pilot and guarantee 50% help desk automation by week 4 at https://serval.com/cognitive
Tasklet:
Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai
PRODUCED BY:


![E32: [Bonus Episode - The AI Breakdown] Can OpenAI's New GPT Training Model Solve Math and AI Alignment At the Same Time? E32: [Bonus Episode - The AI Breakdown] Can OpenAI's New GPT Training Model Solve Math and AI Alignment At the Same Time?](https://megaphone.imgix.net/podcasts/680351f6-0179-11ee-a281-5bef084f2628/image/e57b08.png?ixlib=rails-4.3.1&max-w=3000&max-h=3000&fit=crop&auto=format,compress)




















