Discover
The Information Bottleneck
The Information Bottleneck
Author: Ravid Shwartz-Ziv & Allen Roush
Subscribed: 17Played: 131Subscribe
Share
© 2025 Ravid Shwartz-Ziv & Allen Roush
Description
Two AI Researchers - Ravid Shwartz Ziv, and Allen Roush, discuss the latest trends, news, and research within Generative AI, LLMs, GPUs, and Cloud Systems.
30 Episodes
Reverse
In this episode, we talk with Stefano Ermon, Stanford professor, co-founder & CEO of Inception AI, and co-inventor of DDIM, FlashAttention, DPO, and score-based/diffusion models, about why diffusion-based language models may overtake the autoregressive paradigm that dominates today's LLMs.We start with the fundamental topics, such as what diffusion models actually are, and why iterative refinement (starting from noise, progressively denoising) offers structural advantages over autoregressive generation.From there, we dive into the technical core of diffusion LLMs. Stefano explains how discrete diffusion works on text, why masking is just one of many possible noise processes, and how the mathematics of score matching carries over from the continuous image setting with surprising elegance.A major theme is the inference advantage. Because diffusion models produce multiple tokens in parallel, they can be dramatically faster than autoregressive models at inference time. Stefano argues this fundamentally changes the cost-quality Pareto frontier, and becomes especially powerful in RL-based post-training.We also discuss Inception AI's Mercury II model, which Stefano describes as best-in-class for latency-constrained tasks like voice agents and code completion.In the final part, we get into broader questions - why transformers work so well, research advice for PhD students, whether recursive self-improvement is imminent, the real state of AI coding tools, and Stefano's journey from academia to startup founder.TIMESTAMPS0:12 – Introduction1:08 – Origins of diffusion models: from GANs to score-based models in 20193:13 – Diffusion vs. autoregressive: the typewriter vs. editor analogy4:43 – Speed, creativity, and quality trade-offs between the two approaches7:44 – Temperature and sampling in diffusion LLMs — why it's more subtle than you think9:56 – Can diffusion LLMs scale? Inception AI and Gemini Diffusion as proof points11:50 – State space models and hybrid transformer architectures13:03 – Scaling laws for diffusion: pre-training, post-training, and test-time compute14:33 – Ecosystem and tooling: what transfers and what doesn't16:58 – From images to text: how discrete diffusion actually works19:59 – Theory vs. practice in deep learning21:50 – Loss functions and scoring rules for generative models23:12 – Mercury II and where diffusion LLMs already win26:20 – Creativity, slop, and output diversity in parallel generation28:43 – Hardware for diffusion models: why current GPUs favor autoregressive workloads30:56 – Optimization algorithms and managing technical risk at a startup32:46 – Why do transformers work so well?33:30 – Research advice for PhD students: focus on inference34:57 – Recursive self-improvement and AGI timelines35:56 – Will AI replace software engineers? Real-world experience at Inception37:54 – Professor vs. startup founder: different execution, similar mission39:56 – The founding story of Inception AI — from ICML Best Paper to company42:30 – The researcher-to-founder pipeline and big funding rounds45:02 – PhD vs. industry in 2026: the widening financial gap47:30 – The industry in 5-10 years: Stefano's outlookMusic:"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0."Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.Changes: trimmedAbout: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Naomi Saphra, Kempner Research Fellow at Harvard and incoming Assistant Professor at Boston University, joins us to explain why you can't do interpretability without understanding training dynamics, in the same way you can't do biology without evolution.Naomi argues that many structures researchers find inside trained models are vestigial, they mattered early in training but are meaningless by the end. Grokking is one case of a broader phenomenon: models go through multiple consecutive phase transitions during training, driven by symmetry breaking and head specialization, but the smooth loss curve hides all of it. We talk about why training is nothing like human learning, and why our intuitions about what's hard for models are consistently wrong - code in pretraining helps language reasoning, tokenization drives behaviors people attribute to deeper cognition, and language already encodes everything humans care about. We also get into why SAEs are basically topic models, the Platonic representation hypothesis, using AI to decode animal communication, and why non-determinism across training runs is a real problem that RL and MoE might be making worse.Timeline: (00:12) Introduction and guest welcome (01:01) Why training dynamics matter - the evolutionary biology analogy (03:05) Jennifer Aniston neurons and the danger of biological parallels (04:48) What is grokking and why it's one instance of a broader phenomenon (08:25) Phase transitions, symmetry breaking, and head specialization (11:53) Double descent, overfitting, and the death of classical train-test splits (15:10) Training is nothing like learning (16:08) Scaling axes - data, model size, compute, and why they're not interchangeable (19:29) Data quality, code as reasoning fuel, and GPT-2's real contribution (20:43) Multilingual models and the interlingua hypothesis (25:58) The Platonic representation hypothesis and why image classification was always multimodal (29:12) Sparse autoencoders, interpretability, and Marr's levels (37:32) Can we ever truly understand what models know? (43:59) The language modality chauvinist argument (51:55) Vision, redundancy, and self-supervised learning (57:18) World models - measurable capabilities over philosophical definitions (1:00:14) Is coding really a solved task? (1:04:18) Non-determinism, scaling laws, and why one training run isn't enough (1:10:12) Naomi's new lab at BU and recruitingMusic:"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0."Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0. Changes: trimmedAbout: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Stefano Soatto, VP for AI at AWS and Professor at UCLA, the person responsible for agentic AI at AWS, joins us to explain why building reliable AI agents is fundamentally a control theory problem.Stefano sees LLMs as stochastic dynamical systems that need to be controlled, not just prompted. He introduces "strands coding," a new framework AWS is building that sits between vibe coding and spec coding, you write a skeleton with AI functions constrained by pre- and post-conditions, verifying intent before a single line of code is generated. The surprising part: even as AI coding adoption goes up, developer trust in the output is going down.We go deep into the philosophy of models and the world. Stefano argues that the dichotomy between "language models" and "world models" doesn't really exist, where a reasoning engine trained on rich enough data is a world model. He walks us through why naive realism is indefensible, how reverse diffusion was originally intended to show that models can't be identical to reality, and why that matters now.We also discuss three types of information, Shannon, algorithmic, and conceptual, and why algorithmic information is the one that actually matters to agents. Synthetic data doesn't add Shannon information, but it adds algorithmic information, which is why it works. Intelligence isn't about scaling to Solomonov's universal induction; it's about learning to solve new problems fast.Takeaways:Vibe coding is local feedback control with high cognitive load; spec coding is open-loop global control with silent failures, neither scales well alone.Trust in AI-generated code is declining even as adoption rises.The distinction between next-token prediction and world model is mostly nomenclature - reasoning engines operating on multimodal data are world models.Algorithmic information, not Shannon information, is what matters in the agentic setting.Intelligence isn't minimizing inference uncertainty - it's minimizing time to solve unforeseen tasks.The intent gap between user and model cannot be fully automated or delegated.Timeline(00:13) Introduction and guest welcome(01:12) How the agentic era changed machine learning(06:11) Vibe coding one year later(07:23) Vibe vs. spec vs. strands coding(14:30) Why English is not a programming language(16:36) Constrained generation and agent choreography(20:44) Diffusion models vs. autoregressive models (25:59) The platonic representation hypothesis and naive realism(31:14) Synthetic data and the information bottleneck(36:22) Three types of information: Shannon, algorithmic, conceptual(38:47) Scaling laws and Solomonov induction(42:14) World models and the Goethian vs. Marrian approach(49:00) Encoding vs. generation and JEPA-style training(55:50) Are language models already world models?(59:13) Closing thoughts on trust, education, and responsibility.Music:"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0."Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0. Changes: trimmedAboutThe Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Tanishq Abraham, CEO and co-founder of Sophont.ai, joins us to talk about building foundation models specifically for medicine.Sophont is trying to be something like an OpenAI or Anthropic but for healthcare - training models across pathology, neuroimaging, and clinical text, to eventually fuse them into one multimodal system. The surprising part: their pathology model trained on 12,000 public slides performs on par with models trained on millions of private ones. Data quality beats data quantity.We talk about what actually excites Tanishq, which is not replacing doctors, but finding things doctors can't see. AI predicting gene mutations from a tissue slide, or cardiovascular risk from an eye scan.We also talk about the regulation and how the picture is less scary than people assume. Text-based clinical decision support can ship without FDA approval. Pharma partnerships offer near-term impact. The five-to-ten-year timeline people fear is really about drug discovery, not all of medical AI.Takeaways:The real promise of medical AI is finding hidden signals in existing data, not just automating doctorsSmall, curated public datasets can rival massive private onesMultimodal fusion is the goal, but you need strong individual encoders firstAI research itself might get automated sooner than biology or chemistryFDA regulation has more flexibility than most people thinkTimeline(00:12) Introduction and guest welcome(02:32) Anthropic's ad about ChatGPT ads(07:26) XAI merging into SpaceX(13:32) Vibe coding one year later(17:00) Claude Code and agentic workflows(21:52) Can AI automate AI research?(26:57) What is medical AI(31:06) Sofont as a frontier medical AI lab(33:52) Public vs. private data - 12K slides vs. millions(36:43) Domain expertise vs. scaling(41:54) Cancer, diabetes, and personal stakes(47:52) Classification vs. prediction in medicine(50:36) When doctors disagree(54:43) Quackery and AI(57:15) Uncertainty in medical AI(1:03:11) Will AI replace doctors?(1:07:24) Self-supervised learning on sleep data(1:10:10) Aligning modalities(1:13:17) FDA regulation(1:22:28) Closing Music:"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0."Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.Changes: trimmedAbout The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Anastasios Angelopoulos, Co-Founder and CEO of Arena AI (formerly LMArena), joins us to talk about why static benchmarks are failing, how human preference data actually works under the hood, and what it takes to be the "gold standard" of AI evaluation.Anastasios sits at a fascinating intersection - a theoretical statistician running the platform that every major lab watches when they release a model. We talk about the messiness of AI-generated code slop (yes, he hides Claude's commits too), then dig into the statistical machinery that powers Arena's leaderboards and why getting evaluation right is harder than most people think.We explore why style control is both necessary and philosophically tricky, where you can regress away markdown headers and response length, but separating style from substance is a genuinely unsolved causal inference problem. We also get into why users are surprisingly good judges of model quality, how Arena serves as a pre-release testing ground for labs shipping stealth models under codenames, and whether the fragmentation of the AI market (Anthropic going enterprise, OpenAI going consumer, everyone going multimodal) is actually a feature, not a bug. Plus, we discuss the role of rigorous statistics in the age of "just run it again," why structured decoding can hurt model performance, and what Arena's 2026 roadmap looks like.Timeline:(00:12) Introduction and Anastasios's Background(00:55) What Arena Does and Why Static Benchmarks Aren't Enough(02:26) Coverage of Use Cases - Is There Enough?(04:22) Style Control and the Bradley-Terry Methodology(08:35) Can You Actually Separate Style from Substance?(10:24) Measuring Slop - And the Anti-Slop Paper Plug(11:52) Can Users Judge Factual Correctness?(13:31) Tool Use and Agentic Evaluation on Arena(14:14) Intermediate Feedback Signals Beyond Final Preference(15:30) Tool Calling Accuracy and Code Arena(17:42) AI-Generated Code Slop and Hiding Claude's Commits(19:49) Do We Need Separate Code Streams for Humans and LLMs?(20:01) RL Flywheels and Arena's Preference Data(21:16) Focus as a Startup - Being the Evaluation Company(22:16) Structured vs. Unconstrained Generation(25:00) The Role of Rigorous Statistics in the Age of AI(29:23) LLM Sampling Parameters and Evaluation Complexity(30:56) Model Versioning and the Frequentist Approach to Fairness(32:12) Quantization and Its Effects on Model Quality(33:10) Pre-Release Testing and Stealth Models (34:23) Transparency - What to Share with the Public vs. Labs(36:27) When Winning Models Don't Get Released(36:59) Why Users Keep Coming Back to Arena(38:19) Market Fragmentation and Arena's Future Value(39:37) Custom Evaluation Frameworks for Specific Users(40:03) Arena's 2026 Roadmap - Science, Methodology, and New Paradigms(42:15) The Economics of Free Inference(43:13) Hiring and Closing ThoughtsMusic:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmedAbout: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Fred Sala, Assistant Professor at UW-Madison and Chief Scientist at Snorkel AI, joins us to talk about why personalization might be the next frontier for LLMs, why data still matters more than architecture, and how weak supervision refuses to die.Fred sits at a rare intersection, building the theory of data-centric AI in academia while shipping it to enterprise clients at Snorkel. We talk about the chaos of OpenClaw (the personal AI assistant that's getting people hacked the old-fashioned way, via open ports), then focus on one of the most important questions: how do you make a model truly yours?We dig into why prompting your preferences doesn't scale, why even LoRA might be too expensive for per-user personalization, and why activation steering methods like REFT could be the sweet spot. We also explore self-distillation for continual learning, the unsolved problem of building realistic personas for evaluation, and Fred's take on the data vs. architecture debate (spoiler: data is still undervalued). Plus, we discuss why the internet's "Ouroboros effect" might not doom pre-training as much as people fear, and what happens when models become smarter than the humans who generate their training data.Takeaways:Personalization requires ultra-efficient methods - even one LoRA per user is probably too expensive. Activation steering is the promising middle ground.The "pink elephant problem" makes prompt-based personalization fundamentally limited - telling a model what not to do often makes it do it more.Self-distillation can enable on-policy continual learning without expensive RL reward functions, dramatically reducing catastrophic forgetting.Data is still undervalued relative to architecture and compute, especially high-quality post-training data, which is actually improving, not getting worse.Weak supervision principles are alive and well inside modern LLM data pipelines, even if people don't call it that anymore.Timeline:(00:13) Introduction and Fred's Background(00:39) OpenClaw — The Personal AI Assistant Taking Over Macs(03:43) Agent Security Risks and the Privacy Problem(05:13) Cloud Code, Permissions, and Living Dangerously(07:47) AI Social Media and Agents Talking to Each Other(08:56) AI Persuasion and Competitive Debate(09:51) Self-Distillation for Continual Learning(12:43) What Does Continual Learning Actually Mean?(14:12) Updating Weights on the Fly — A Grand Challenge(15:09) The Personalization Problem — Motivation and Use Cases(17:41) The Pink Elephant Problem with Prompt-Based Personalization(19:58) Taxonomy of Personalization — Preferences vs. Tone vs. Style(21:31) Activation Steering, REFT, and Parameter-Efficient Fine-Tuning(27:00) Evaluating Personalization — Benchmarks and Personas(31:14) Unlearning and Un-Personalization(31:51) Cultural Alignment as Group-Level Personalization(41:00) Can LLM Personas Replace Surveys and Polling?(44:32) Is Continued Pre-Training Still Relevant?(46:28) Data vs. Architecture — What Matters More?(52:25) Multi-Epoch Training — Is It Over?(54:53) What Makes Good Data? Matching Real-World Usage(59:23) Decomposing Uncertainty for Better Data Selection(1:01:52) Mapping Human Difficulty to Model Difficulty(1:04:49) Scaling Small Ideas — From Academic Proof to Frontier Models(1:12:01) What Happens When Models Surpass Human Training Data?(1:15:24) Closing ThoughtsMusic:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmed
Bayan Bruss, VP of Applied AI at Capital One, joins us to talk about building AI systems that can make autonomous financial decisions, and why money might be the hardest problem in machine learning.Bayan leads Capital One's AI Foundations team, where they're working toward a destination most people don't associate with banking: getting AI systems to perceive financial ecosystems, form beliefs about the future, and take actions based on those beliefs. It's a framework that sounds simple until you realize you're asking a model to predict whether someone will pay back a loan over 30 years while the world changes around them.We get into why LLMs are a bad fit for ingesting 5,000 credit card transactions, why synthetic data works surprisingly well for time series, and the tension between end-to-end learning and regulatory requirements that demand you know exactly what your model learned. We also discuss reasoning in language vs. in latent space - if you wouldn't trust a self-driving car that translated images to words before deciding to turn, should you trust a financial system that does all its reasoning in token space?Takeaways:Money is a behavioral science problem - AI in finance requires understanding people, not just numbers.Foundation models pre-trained on web text don't outperform purpose-built models for financial tasks. You're better off building a standalone encoder for financial data.Synthetic data works surprisingly well for time series - possibly because real-world time series lives on a simpler manifold than we assume.Explainability in ML is fundamentally unsatisfying because people want causality from non-causal models.Financial AI needs world models that can imagine alternative futures, not just fit historical data.Timeline:(00:24) Introduction and Bayan's Background(00:42) Claude Code, Vibe Coding - Hype or AGI?(05:59) The Future of Software Engineering and Abstraction(11:20) Abstraction Layers and Karpathy's Take(13:54) Hamming, Kuhn, and Scientific Revolutions in AI(19:24) Stack Overflow's Decline and Proof of Humanity(23:07) Why We Still Trust Humans Over LLMs(30:45) Deep Dive: AI in Banking and Consumer Finance(34:17) Are Markets Efficient? Behavioral Economics vs. Classical Views(37:14) The Components of a Financial Decision: Perception, Belief, Action(42:15) Protected Variables, Proxy Features, and Fairness in Lending(45:05) Explainability: Roller Skating on Marbles(47:55) Sparse Autoencoders, Interpretability, and Turtles All the Way Down(51:57) Foundation Models for Finance — Web Text vs. Purpose-Built(53:09) Time Series, Synthetic Data, and TabPFN(59:44) Feeding Tabular Data to VLMs - Graphs Beat Raw Numbers(1:03:35) Reasoning in Language vs. Latent Space(1:08:24) Is Language the Optimal Representation? Chinese Compression and Information Density(1:13:37) Personalization and Predicting Human Behavior(1:21:36) World Models, Uncertainty, and Professional Worrying(1:24:07) Prediction Markets and Insider Betting(1:26:33) Can LLMs Predict Stocks?(1:29:11) Multi-Agent Systems for Financial DecisionsMusic:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0. Changes: trimmedAbout: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
David Mezzetti, creator of TxtAI, joins us to talk about building open source AI frameworks as a solo developer - and why local-first AI still matters in the age of API-everything.David's path from running a 50-person IT company through acquisition to building one of the most well-regarded AI orchestration libraries tells you how sometimes constraints breed better design. TextAI started during COVID when he was doing coronavirus literature research and realized semantic search could transform how we find information.We get into the evolution of the AI framework landscape - from the early days of vector embeddings to RAG to LLM orchestration. David was initially stubborn about not supporting OpenAI's API, wanting to keep everything local. He admits that probably cost him some early traction compared to LangChain, but it also shaped TextAI's philosophy: you shouldn't need permission to build with AI.We also talk about small models and some genuinely practical insights: a 20-million parameter model running on CPU might be all you need. On the future of coding with AI, David's come around on "vibe coding" and notes that well-documented frameworks with lots of examples are perfectly positioned for this new world.Takeaways:Local-first AI gives you control, reproducibility, and often better performance for your domainSmall models (even 20M parameters) can solve real problems on CPUGood documentation and examples make your framework AI-coding friendlyOpen source should mean actually contributing - not just publishing codeSolo developers can compete by staying focused and being willing to evolveTimeline:(00:14) Introduction and David's Background(07:44) TextAI History and Evolution(12:04) Framework Landscape: LangChain, LlamaIndex, Haystack(15:16) Can AI Re-implement Frameworks?(24:14) API Specs: OpenAI vs Anthropic(26:46) Running an Open Source Consulting Business(32:51) Origin Story: COVID, Kaggle, and Medical Literature(43:08) Open Source Philosophy and Giving Back(47:16) Ethics of Local AI and Developer Freedom(01:06:44) Human in the Loop and AI-Generated Code(01:09:31) The Future of Work and AutomationMusic:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0. Changes: trimmedAbout:The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Cody Blakeney from Datology AI joins us to talk about data curation - the unglamorous but critical work of figuring out what to actually train models on.Cody's path from writing CUDA kernels to spending his days staring at weird internet text tells you something important: data quality can account for half or more of a model's final performance. That's on par with major architectural breakthroughs.We get into the differences between pre-training, mid-training, and post-training data. Mid-training in particular has become a key technique for squeezing value out of rare, high-quality datasets. Cody's team stumbled onto it while solving a practical problem: how do you figure out if a 5-billion-token dataset is actually useful when you can't afford hundreds of experimental runs?We also talk about data filtering and some genuinely surprising findings: the documents that make the best training data are often short and dense with information. Those nicely written blog posts with personal anecdotes? Turns out models don't learn as well from them.On synthetic data, Cody thinks pre-training is still in its early days, where most techniques are variations on a few core ideas, but there's huge potential. He's excited about connecting RL failures back to mid-training: when models fail at tasks, use that signal to generate targeted training data.Takeaways:Data work is high-leverage but underappreciatedMid-training helps extract signal from small, valuable datasetsGood filters favor dense, factual text over polished prose.Synthetic data for pre-training works surprisingly well, but remains primitive.Optimal data mixtures depend on model scale, where smaller models need more aggressive distribution shifts.Timeline(00:12) Introduction to Data Correlation in LLMs(05:14) The Importance of Data Quality(10:15) Pre-training vs Post-training Data(15:22) Strategies for Effective Data Utilization(20:15) Benchmarking and Model Evaluation(28:28) Maximizing Perplexity and Coherence(30:27) Measuring Quality in Data(32:56) The Role of Filters in Data Selection(34:19) Understanding High-Quality Data(39:15) Mid-Training and Its Importance(46:51) Future of Data Sources(48:13) Synthetic Data's Role in Pre-Training(53:10) Creating Effective Synthetic Data(57:39) The Debate on Pure Synthetic Data(01:00:25) Navigating AI Training and Legal Challenges(01:02:34) The Controversy of AI in the Art Community(01:05:29) Exploring Synthetic Data and Its Efficiency(01:11:21) The Future of Domain-Specific vs. General Models(01:22:06) Bias in Pre-trained Models and Data Selection(01:28:27) The Potential of Synthetic Data Over Human DataMusic:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmedAboutThe Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Guest: Niloofar Mireshghallah (Incoming Assistant Professor at CMU, Member of Technical Staff at Humans and AI)In this episode, we dive into AI privacy, frontier model capabilities, and why academia still matters.We kick off by discussing GPT-5.2 and whether models rely more on parametric knowledge or context. Niloofar shares how reasoning models actually defer to context, even accepting obviously false information to "roll with it."On privacy, Niloofar challenges conventional wisdom: memorization isn't the problem anymore. The real threats are aggregation attacks (finding someone's pet name in HTML metadata), inference attacks (models are expert geoguessers), and input-output leakage in agentic workflows.We also explore linguistic colonialism in AI, or how models fail for non-English languages, sometimes inventing cultural traditions.The episode wraps with a call for researchers to tackle problems industry ignores: AI for science, education tools that preserve the struggle of learning, and privacy-preserving collaboration between small local models and large commercial ones.Timeline[0:00] Intro[1:03] GPT-5.2 first impressions and skepticism about the data cutoff claims[4:17] Parametric vs. context memory—when do models trust training vs. the prompt?[9:28] The messy problem of memory, weights, and online learning[16:12] Tool use changes model behavior in unexpected ways[17:15] OpenAI's "Advances in Sciences" paper and human-AI collaboration[24:17] Why deep research is getting less useful[28:17] Pre-training vs. post-training—which matters more?[30:35] Non-English languages and AI failures[33:23] Hilarious Farsi bugs: "I'll get back to you in a few days" and invented traditions[37:56] Linguistic colonialism—ChatGPT changed how we write[41:20] Why memorization isn't the real privacy threat[47:14] The three actual privacy problems: inference, aggregation, input-output leakage[54:33] Deep research stalking experiment—finding a cat's name in HTML[1:01:13] Privacy solutions for agentic systems[1:03:23] What Niloofar's excited about: AI for scientists, small models, niche problems[1:08:31] AI for education without killing the learning process[1:09:15] Closing: underrated life advice on health and sustainable habitsMusic:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmedAboutThe Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Yann LeCun – Why LLMs Will Never Get Us to AGI"The path to superintelligence - just train up the LLMs, train on more synthetic data, hire thousands of people to school your system in post-training, invent new tweaks on RL-I think is complete bullshit. It's just never going to work."After 12 years at Meta, Turing Award winner Yann LeCun is betting his legacy on a radically different vision of AI. In this conversation, he explains why Silicon Valley's obsession with scaling language models is a dead end, why the hardest problem in AI is reaching dog-level intelligence (not human-level), and why his new company AMI is building world models that predict in abstract representation space rather than generating pixels.Timestamps(00:00:14) – Intro and welcome(00:01:12) – AMI: Why start a company now?(00:04:46) – Will AMI do research in the open?(00:06:44) – World models vs LLMs(00:09:44) – History of self-supervised learning(00:16:55) – Siamese networks and contrastive learning(00:25:14) – JEPA and learning in representation space(00:30:14) – Abstraction hierarchies in physics and AI(00:34:01) – World models as abstract simulators(00:38:14) – Object permanence and learning basic physics(00:40:35) – Game AI: Why NetHack is still impossible(00:44:22) – Moravec's Paradox and chess(00:55:14) – AI safety by construction, not fine-tuning(01:02:52) – Constrained generation techniques(01:04:20) – Meta's reorganization and FAIR's future(01:07:31) – SSI, Physical Intelligence, and Wayve(01:10:14) – Silicon Valley's "LLM-pilled" monoculture(01:15:56) – China vs US: The open source paradox(01:18:14) – Why start a company at 65?(01:25:14) – The AGI hype cycle has happened 6 times before(01:33:18) – Family and personal background(01:36:13) – Career advice: Learn things with a long shelf life(01:40:14) – Neuroscience and machine learning connections(01:48:17) – Continual learning: Is catastrophic forgetting solved?Music:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmedAboutThe Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Atlas Wang (UT Austin faculty, XTX Research Director) joins us to explore two fascinating frontiers: the foundations of symbolic AI and the practical challenges of building AI systems for quantitative finance.On the symbolic AI side, Atlas shares his recent work proving that neural networks can learn symbolic equations through gradient descent, a surprising result given that gradient descent is continuous while symbolic structures are discrete. We talked about why neural nets learn clean, compositional mathematical structures at all, what the mathematical tools involved are, and the broader implications for understanding reasoning in AI systems.The conversation then turns to neuro-symbolic approaches in practice: agents that discover rules through continued learning, propose them symbolically, verify them against domain-specific checkers, and refine their understanding.On the finance side, Atlas pulls back the curtain on what AI research looks like at a high-frequency trading firm. The core problem sounds simple (predict future prices from past data). Still, the challenge is extreme: markets are dominated by noise, predictions hover near zero correlation, and success means eking out tiny margins across astronomical numbers of trades. He explains why synthetic data techniques that work elsewhere don't translate easily, and why XTX is building time series foundation models rather than adapting language models.We also discuss the convergence of hiring between frontier AI labs and quantitative finance, and why this is an exceptional moment for ML researchers to consider the finance industry.Links:Why Neural Network Can Discover Symbolic Structures with Gradient-based Training: An Algebraic and Geometric Foundation for Neurosymbolic Reasoning - arxiv.org/abs/2506.21797Atlas website - https://www.vita-group.space/Guest: Atlas Wang (UT Austin / XTX)Hosts: Ravid Shwartz-Ziv & Allen RoushMusic: “Kid Kodi” — Blue Dot Sessions. Source: Free Music Archive. Licensed CC BY-NC 4.0.
In this episode, we hosted Judah Goldfeder, a PhD candidate at Columbia University and student researcher at Google, to discuss robotics, reproducibility in ML, and smart buildings.Key topics covered:Robotics challenges: We discussed why robotics remains harder than many expected, compared to LLMs. The real world is unpredictable and unforgiving, and mistakes have physical consequences. Sim-to-real transfer remains a major bottleneck because simulators are tedious to configure accurately for each robot and environment. Unlike text, robotics lacks foundation models, partly due to limited clean, annotated datasets and the difficulty of collecting diverse real-world data.Reproducibility crisis: We discussed how self-reported benchmarks can lead to p-hacking and irreproducible results. Centralized evaluation systems (such as Kaggle or ImageNet challenges), where researchers submit algorithms for testing on hidden test sets, seem to drive faster progress.Smart buildings: Judah's work at Google focuses on using ML to optimize HVAC systems, potentially reducing energy costs and carbon emissions significantly. The challenge is that every building is different. It makes the simulation configuration extremely labor-intensive. Generative AI could help by automating the process of converting floor plans or images into accurate building simulations.Links:Judah website - https://judahgoldfeder.com/Music:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmed
In this episode, we talk with Will Brown, a research lead at Prime Intellect, about his journey into reinforcement learning (RL) and multi-agent systems, exploring their theoretical foundations and practical applications. We discuss the importance of RL in the current LLMs pipeline and the challenges it faces. We also discuss applying agentic workflows to real-world applications and the ongoing evolution of AI development.Chapters00:00 Introduction to Reinforcement Learning and Will's Journey03:10 Theoretical Foundations of Multi-Agent Systems06:09 Transitioning from Theory to Practical Applications09:01 The Role of Game Theory in AI11:55 Exploring the Complexity of Games and AI14:56 Optimization Techniques in Reinforcement Learning17:58 The Evolution of RL in LLMs21:04 Challenges and Opportunities in RL for LLMs23:56 Key Components for Successful RL Implementation27:00 Future Directions in Reinforcement Learning36:29 Exploring Agentic Reinforcement Learning Paradigms38:45 The Role of Intermediate Results in RL41:16 Multi-Agent Systems: Challenges and Opportunities45:08 Distributed Environments and Decentralized RL49:31 Prompt Optimization Techniques in RL52:25 Statistical Rigor in Evaluations55:49 Future Directions in Reinforcement Learning59:50 Task-Specific Models vs. General Models01:02:04 Insights on Random Verifiers and Learning Dynamics01:04:39 Real-World Applications of RL and Evaluation Challenges01:05:58 Prime RL Framework: Goals and Trade-offs01:10:38 Open Source vs. Closed Source Models01:13:08 Continuous Learning and Knowledge ImprovementMusic:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmed
In this episode, we discuss various topics in AI, including the challenges of the conference review process, the capabilities of Kimi K2 thinking, the advancements in TPU technology, the significance of real-world data in robotics, and recent innovations in AI research. We also talk about the cool "Chain of Thought Hijacking" paper, how to use simple ideas to scale RL, and the implications of the Cosmos project, which aims to enable autonomous scientific discovery through AI.Papers and links:Chain-of-Thought Hijacking - https://arxiv.org/pdf/2510.26418Kosmos: An AI Scientist for Autonomous Discovery - https://t.co/9pCr6AUXAeJustRL: Scaling a 1.5B LLM with a Simple RL Recipe - https://relieved-cafe-fe1.notion.site/JustRL-Scaling-a-1-5B-LLM-with-a-Simple-RL-Recipe-24f6198b0b6b80e48e74f519bfdaf0a8Chapters00:00 Navigating the Peer Review Process04:17 Kimi K2 Thinking: A New Era in AI12:27 The Future of Tool Calls in AI17:12 Exploring Google's New TPUs22:04 The Importance of Real-World Data in Robotics28:10 World Models: The Next Frontier in AI31:36 Nvidia's Dominance in AI Partnerships32:08 Exploring Recent AI Research Papers37:46 Chain of Thought Hijacking: A New Threat43:05 Simplifying Reinforcement Learning Training54:03 Cosmos: AI for Autonomous Scientific DiscoveryMusic:"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmed
In this episode, we sit down with Alex Alemi, an AI researcher at Anthropic (previously at Google Brain and Disney), to explore the powerful framework of the information bottleneck and its profound implications for modern machine learning.We break down what the information bottleneck really means, a principled approach to retaining only the most informative parts of data while compressing away the irrelevant. We discuss why compression is still important in our era of big data, how it prevents overfitting, and why it's essential for building models that generalize well.We also dive into scaling laws: why they matter, what we can learn from them, and what they tell us about the future of AI research.Papers and links:Alex's website - https://www.alexalemi.com/Scaling exponents across parameterizations and optimizers - https://arxiv.org/abs/2407.05872Deep Variational Information Bottleneck - https://arxiv.org/abs/1612.00410Layer by Layer: Uncovering Hidden Representations in Language Models - https://arxiv.org/abs/2502.02013Information in Infinite Ensembles of Infinitely-Wide Neural Networks - https://proceedings.mlr.press/v118/shwartz-ziv20a.htmlMusic:“Kid Kodi” — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.“Palms Down” — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmed
In this episode, we talked about AI news and recent papers. We explored the complexities of using AI models in healthcare (the Nature Medicine paper on GPT-5's fragile intelligence in medical contexts). We discussed the delicate balance between leveraging LLMs as powerful research tools and the risks of over-reliance, touching on issues such as hallucinations, medical disagreements among practitioners, and the need for better education on responsible AI use in healthcare.We also talked about Stanford's "Cartridges" paper, which presents an innovative approach to long-context language models. The paper tackles the expensive computational costs of billion-token context windows by compressing KV caches through a clever "self-study" method using synthetic question-answer pairs and context distillation. We discussed the implications for personalization, composability, and making long-context models more practical.Additionally, we explored the "Continuous Autoregressive Language Models" paper and touched on insights from the Smol Training Playbook.Papers discussed:The fragile intelligence of GPT-5 in medicine: https://www.nature.com/articles/s41591-025-04008-8Cartridges: Lightweight and general-purpose long context representations via self-study: https://arxiv.org/abs/2506.06266Continuous Autoregressive Language Models: https://arxiv.org/abs/2510.27688The Smol Training Playbook: https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbookMusic:“Kid Kodi” — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.“Palms Down” — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmedThis is an experimental format for us, just news and papers without a guest interview. Let us know what you think!
In this episode, we host Jonas Geiping from ELLIS Institute & Max-Planck Institute for Intelligent Systems, Tübingen AI Center, Germany. We talked about his broad research on Recurrent-Depth Models and latent reasoning in large language models (LLMs). We talked about what these models can and can't do, what are the challenges and next breakthroughs in the field, world models, and the future of developing better models. We also talked about safety and interpretability, and the role of scaling laws in AI development.Chapters00:00 Introduction and Guest Introduction01:03 Peer Review in Preprint Servers06:57 New Developments in Coding Models09:34 Open Source Models in Europe11:00 Dynamic Layers in LLMs26:05 Training Playbook Insights30:05 Recurrent Depth Models and Reasoning Tasks43:59 Exploring Recursive Reasoning Models46:46 The Role of World Models in AI48:41 Innovations in AI Training and Simulation50:39 The Promise of Recurrent Depth Models52:34 Navigating the Future of AI Algorithms54:44 The Bitter Lesson of AI Development59:11 Advising the Next Generation of Researchers01:06:42 Safety and Interpretability in AI Models01:10:46 Scaling Laws and Their Implications01:16:19 The Role of PhDs in AI ResearchLinks and paper:Jonas' website - https://jonasgeiping.github.io/Scaling up test-time compute with latent reasoning: A recurrent depth approach - https://arxiv.org/abs/2502.05171The Smol Training Playbook: The Secrets to Building World-Class LLMs - https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbookVaultGemma: A Differentially Private Gemma Model - https://arxiv.org/abs/2510.15001Music:“Kid Kodi” — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.“Palms Down” — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmed
In this episode of the Information Bottleneck Podcast, we host Jack Morris, a PhD student at Cornell, to discuss adversarial examples (Jack created TextAttack, the first software package for LLM jailbreaking), the Platonic representation hypothesis, the implications of inversion techniques, and the role of compression in language models.Links:Jack's Website - https://jxmo.io/TextAttack - https://arxiv.org/abs/2005.05909How much do language models memorize? https://arxiv.org/abs/2505.24832DeepSeek OCR - https://www.arxiv.org/abs/2510.18234Chapters:00:00 Introduction and AI News Highlights04:53 The Importance of Fine-Tuning Models10:01 Challenges in Open Source AI Models14:34 The Future of Model Scaling and Sparsity19:39 Exploring Model Routing and User Experience24:34 Jack's Research: Text Attack and Adversarial Examples29:33 The Platonic Representation Hypothesis34:23 Implications of Inversion and Security in AI39:20 The Role of Compression in Language Models44:10 Future Directions in AI Research and Personalization
In this episode we talk with Randall Balestriero, an assistant professor at Brown University. We discuss the potential and challenges of Joint Embedding Predictive Architectures (JEPA). We explore the concept of JEPA, which aims to learn good data representations without reconstruction-based learning. We talk about the importance of understanding and compressing irrelevant details, the role of prediction tasks, and the challenges of preventing collapse.






















