Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post
Digest
This episode features Olive Song, a senior researcher at Minimax, discussing their advanced AI models, particularly the M2 model, which excels in coding and agentic tasks. The conversation delves into Minimax's development strategies, including in-house model and application building for rapid feedback loops, and techniques like interleaved thinking and perturbation pipelines for enhanced performance and generalization. Olive highlights the challenges and importance of reinforcement learning, reward hacking, and human alignment, especially for open-weight models. The discussion also touches upon the practicalities of AI development, such as GPU constraints, the role of AI agents in managing information, and the future trajectory of Minimax's research towards more capable and collaborative AI systems. The episode also includes sponsor messages from Granola, Claude AI, and Tasklet AI.
Outlines

Introduction to Minimax and AI Development Strategies
The episode introduces Olive Song from Minimax, a company focused on reinforcement learning and AI model development. Minimax's strategy involves building models and applications in-house for tight feedback loops, utilizing techniques like interleaved thinking for complex tasks and perturbation pipelines for robust generalization. They aim to develop models for the community, emphasizing open-weight releases.

Minimax M2: Capabilities and Enhancements
Olive details the Minimax M2 model, an open-weight, 10-billion parameter model designed for coding and agentic tasks. Its capabilities are enhanced through scaled environments, expert reward models, and interleaved thinking for long-horizon tasks. Robust generalization is achieved via data pipeline perturbations, and its small size enables multi-agent scalability.

Reinforcement Learning Challenges and Sponsor Messages
The discussion addresses challenges in reinforcement learning, such as reward hacking, and the engineering effort required to overcome them. This section also includes sponsor messages from Claude AI, highlighting its capabilities in drafting content and coding, and Tasklet AI, an AI agent for automating business tasks.

Interview Insights: Work-Life, Model Behavior, and AI Concepts
Olive shares insights into work-life balance at Minimax and discusses unexpected model behaviors in reinforcement learning, emphasizing the importance of human alignment. Concepts like Minimax Her and role-playing in AI are explored, along with Olive's interpretation of "intelligence with everyone."

Interconnected R&D, Safety, and Technical Details
The interconnected nature of research and development at Minimax is highlighted, with a strong focus on human alignment and safety for coding models. The importance of technical details like FP32 precision in RL training is also discussed, alongside a first-principles approach to problem-solving.

Open Weights: Benefits, Responsibilities, and Evaluation
The benefits and responsibilities associated with open-weight models are discussed, including ethical use and internal safety benchmarks. Failure modes in reinforcement learning and the challenges of generalization are examined, alongside Minimax's approach to specialization versus generalization.

Solving Long Horizons and Compute Optimization
Strategies for solving long-horizon agentic tasks are detailed, focusing on goal definition, robust environments, and RL infrastructure. GPU constraints and compute optimization are addressed by a dedicated team focused on efficient AI development.

Minimax's Open Source Strategy and Collaboration
Minimax's strategy of publishing open-weight models is driven by a belief in community-driven development. They extensively use and collaborate with other open-source tools, acknowledging the engineering discipline required for open models.

Personal Evaluation and Model Adaptability
Olive describes her personal method for evaluating open models on OpenRouter using a specific stack of questions. The challenges of open models adjusting to different environments are discussed, with ongoing work to improve adaptability.

M2.2 Overview, AI News Management, and Researcher Roles
An overview of M2.2 highlights improvements in coding, multilingual capabilities, and stability. Minimax uses an internal AI agent to manage AI news, and Olive reflects on the evolving role of researchers, moving beyond paper reading to fundamental problem-solving.

Goal Setting, Continual Learning, and AGI
Goal setting at Minimax ranges from company-wide objectives to individual researcher missions. The concepts of interleaved learning and continual learning are clarified, with an outline of the approach to achieving continual learning. The discussion touches upon the fluid definition of AGI.

Influences and Conclusion
Olive shares that "The Art of Creativity" influenced her perspective on problem-solving. The episode concludes with a call to action for listeners and information about the podcast network.
Keywords
Reinforcement Learning (RL)
A machine learning paradigm where agents learn to make optimal decisions through trial and error in an environment to maximize cumulative rewards. Crucial for training AI in complex, dynamic scenarios.
Minimax M2 Model
An open-weight, 10-billion parameter AI model developed by Minimax, excelling in coding and agentic tasks, designed for workplace applications and community use.
Open-Weight Models
AI models with publicly released trained parameters, enabling widespread use, modification, and collaborative development within the AI community.
Agentic Tasks
Tasks requiring AI agents to interact with environments, make decisions, and perform actions to achieve goals, often involving reasoning, planning, and tool utilization.
Interleaved Thinking
An AI technique where a model pauses after an action and feedback to reflect, improving adaptation and performance on long-horizon tasks in noisy or dynamic environments.
Human Alignment
The process of ensuring AI systems operate according to human values, intentions, and ethical principles, critical for developing safe and beneficial AI technologies.
Reward Hacking
A phenomenon in RL where agents exploit loopholes in reward functions to maximize rewards without achieving the intended outcome, posing a challenge for effective training.
Perturbation Pipeline
A training method involving systematic variations in the AI's environment or data to enhance model robustness and generalization across diverse conditions.
Continual Learning
The ability of an AI system to learn continuously over time from a stream of data, adapting to new information without forgetting previously learned knowledge.
AI Development Strategy
Minimax's approach to building foundation models and applications in-house to create tight feedback loops, fostering rapid iteration and improvement.
Q&A
What is Minimax's strategy for developing foundation models and applications?
Minimax develops both foundation models and user-facing applications in-house. This creates tight feedback loops, allowing their cross-functional research and engineering teams to quickly identify and address model weaknesses.
How does "interleaved thinking" improve AI model performance?
Interleaved thinking allows a model to take an action, receive feedback, pause to think, and then continue. This iterative process improves performance on long-horizon agentic tasks by enabling adaptation to environment noise and dynamic conditions.
What is the significance of using FP32 precision in reinforcement learning training?
Using FP32 precision can be crucial for aligning the implementation of reinforcement learning algorithms with their theoretical counterparts. This attention to detail can overcome limitations that prevent models from reaching their full potential.
How does Minimax ensure the safety of its open-weight models before release?
Minimax employs internal benchmarks for safety, evaluating aspects like sensitive content and alignment. They conduct scaled evaluations and alignments about one to two weeks before launching to assess the model's safety.
What are the main challenges in developing AI agents for long-horizon tasks?
Key challenges include defining clear, hard, and diverse goals, creating robust and scaled environments, and having outstanding RL infrastructure for efficient training and rollout over extended periods.
How does Minimax stay updated with the rapid pace of AI advancements?
They utilize an internal AI agent to track, summarize, and analyze AI news, articles, and papers. This filtered information is then distributed to researchers, who can further refine the system, ensuring they stay informed.
What is the difference between interleaved thinking and continual learning?
Interleaved thinking is a specific technique for improving agentic tasks by pausing to reflect. Continual learning is a broader concept about an AI's ability to learn continuously over time, with interleaved thinking being a step towards it.
What are the benefits of open-weight models?
Open-weight models offer benefits such as free use and fine-tuning, fostering community collaboration and innovation in AI development.
How does Minimax approach the challenge of GPU constraints and compute optimization?
Minimax has a dedicated team focused on maximizing GPU utilization and stabilizing training processes to overcome compute challenges and ensure efficient AI development.
What is Olive Song's personal approach to evaluating new open-source models?
Olive tests new open models on OpenRouter using a personal evaluation stack comprising questions across logical reasoning, mathematics, report writing, and agentic tasks to assess their capabilities and behavior.
Show Notes
Olive Song from MiniMax shares how her team trains the M series frontier open-weight models using reinforcement learning, tight product feedback loops, and systematic environment perturbations. This crossover episode weaves together her AI Engineer Conference talk and an in-depth interview from the Inference podcast. Listeners will learn about interleaved thinking for long-horizon agentic tasks, fighting reward hacking, and why they moved RL training to FP32 precision. Olive also offers a candid look at debugging real-world LLM failures and how MiniMax uses AI agents to track the fast-moving AI landscape.
Nathan uses Granola to uncover blind spots in conversations and AI research. Try it at granola.ai/tcr with code TCR — and if you’re already using it, test his blind spot recipe here: https://bit.ly/granolablindspot
LINKS:
Conference Talk (AI Engineer, Dec 2025) – https://www.youtube.com/watch?v=lY1iFbDPRlw
Interview (Turing Post, Jan 2026) – https://www.youtube.com/watch?v=GkUMqWeHn40
Sponsors:
Claude:
Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr
Tasklet:
Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai
CHAPTERS:
(00:00 ) About the Episode
(04:15 ) Minimax M2 presentation (Part 1)
(17:59 ) Sponsors: Claude | Tasklet
(21:22 ) Minimax M2 presentation (Part 2)
(21:26 ) Research life and culture
(26:27 ) Alignment, safety and feedback
(32:01 ) Long-horizon coding agents
(35:57 ) Open models and evaluation
(43:29 ) M2.2 and researcher goals
(48:16 ) Continual learning and AGI
(52:58 ) Closing musical summary
(55:49 ) Outro
PRODUCED BY:
SOCIAL LINKS:
Website: https://www.cognitiverevolution.ai
Twitter (Podcast): https://x.com/cogrev_podcast
Twitter (Nathan): https://x.com/labenz
LinkedIn: https://linkedin.com/in/nathanlabenz/
Youtube: https://youtube.com/@CognitiveRevolutionPodcast
Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk


![E32: [Bonus Episode - The AI Breakdown] Can OpenAI's New GPT Training Model Solve Math and AI Alignment At the Same Time? E32: [Bonus Episode - The AI Breakdown] Can OpenAI's New GPT Training Model Solve Math and AI Alignment At the Same Time?](https://megaphone.imgix.net/podcasts/680351f6-0179-11ee-a281-5bef084f2628/image/e57b08.png?ixlib=rails-4.3.1&max-w=3000&max-h=3000&fit=crop&auto=format,compress)




















