Discover
Interconnects

Interconnects
Author: Nathan Lambert
Subscribed: 39Played: 415Subscribe
Share
© Interconnects AI, LLC
Description
Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories.
www.interconnects.ai
www.interconnects.ai
92 Episodes
Reverse
Post: https://www.interconnects.ai/p/gemma-3-olmo-2-32b-and-the-growingEver since the release of the original ChatGPT, much has been said about making a truly open-source version of it — with data, code, weights, etc., all available. Open-source versions increase transparency, access, long-term progress, security research, and lots more. Lots of people have used this claim to bring hype into their projects, but the substance of these releases have been rather shallow (i.e., often focusing on one evaluation).This milestone was so long coming that I entirely forgot about it as a target. Through 2024, and especially before DeepSeek, the impression was that scaling AI capabilities was just too expensive for the smaller players willing to do truly open-source development.Truly open releases take a lot of effort by making more to release and maintain, open up potential legal risks that preclude types of training data, and completely undermine competition. The few organizations doing fully open-source research are non-profits, like Ai2 or Eleuther AI; academics, like LLM360; or companies that benefit from the long-term ecosystem growth, like HuggingFace.I was poking through the results for our latest model when I realized that we finally did it! We have a fully open-source GPT-4 class model, i.e., it is comparable with OpenAI's original release rather than the current version.Today, we're releasing OLMo 2 32B, the biggest model we've trained from scratch yet. Here are the post-training evaluations, where it surpasses GPT-3.5, GPT-4o-mini, Qwen 2.5 32B Instruct, the recent Mistral Small 24B, and comes close to the Qwen and Llama 70B Instruct models.And this recipe is extremely training efficient. Here’s a plot showing the FLOP comparisons to peer base models:Most of this release isn't entirely new. OLMo 2 is the result of lots of small wins on data, architecture, post-training with Tülu 3 recipe and so on — we just let the GPUs hum for a lot longer. You can learn more about OLMo 2 in my original release announcement or in this podcast with the leads.The new part of this release is a major milestone where any company can pick up our training stack and cook up exactly the model they need at nearly the GPT 4 level. Beating the latest GPT 3.5 and GPT 4o mini models feels like fair game for the claim. This capability will take time to diffuse, but it is a major moment in the arc of why we do what we do. Even without more progress on OLMo, which we obviously will continue this year, this will keep fundamental AI progress outside of the major AI labs going for multiple years. It’s an optimistic day for open-source.Here are your links to more information on OLMo 32B:* Blog with technical details and demo* Base model: OLMo-2-0325-32B* Instruct model: OLMo-2-0325-32B-Instruct and intermediate SFT, OLMo-2-0325-32B-SFT, and DPO checkpoints, OLMo-2-0325-32B-DPO* Pretraining dataset: OLMo-mix-1124* Mid-training dataset: Dolmino-Mix-1124* Post-training datasets: Tülu 3 SFT Mix (updated), Preference data for OLMo 2 32B and RLVR MixGemma 3 as the next point on a steep trend lineYesterday, March 12th, Google released the next batch of their flagship open-weight models, Gemma (report, models, flagship model). They highlight the following capabilities in their documentation:* Image and text input: Multimodal capabilities let you input images and text to understand and analyze visual data. Start building* 128K token context: 16x larger input context for analyzing more data and solving more complex problems.* Wide language support: Work in your language or expand your AI application's language capabilities with support for over 140 languages. Start building* Developer friendly model sizes: Choose a model size (1B, 4B, 12B, 27B) and precision level that works best for your task and compute resources.Some technical details of note:* In open models, 32B dense models are convenient because they can be finetuned on one node of 8 H100s (slowly). Google's sizing at 27B likely is downstream of TPU considerations that don't map directly, like how knowledge distillation works at pretraining.* The Gemma models continue to be trained extensively with teacher-student knowledge distillation (KD). This KD is different than the colloquial definition of distillation in leading AI models. The common use of distillation is training the models on any output of a much stronger model. This is most commonly done in post-training to learn from generated completions of the stronger model. KD is a subset of the general idea of distillation, where the model being trained learns to match the distribution of the teacher model. Other labs than DeepMind have mentioned this KD technique, but Google has pushed it far further. This was discussed further in last summer’s post on synthetic data.Otherwise, the paper has some interesting information but nothing super groundbreaking. This is par for the course for most technical reports these days.Onto the evaluations, and therein the impact, of Gemma 3.The best way to think about this model is a “general chat model” like GPT-4o and Claude 3.7 rather than a reasoning model like R1. The rise of reasoning models has made comparing models tricky because there are multiple evaluation suites that people care about — broadly characterized as a reasoning suite and an instruct suite. They overlap, but strong capabilities on both is rare.Gemma 3 27B’s performance on some tasks like MATH and Bird-SQL (coding) match the Gemini 1.5 Pro model from just a few months ago! The progress on small, open weight models is simply insane. Small models can perform excellently on narrow tasks like math and some coding, but they lack the depth and world knowledge, as seen in GPQA or SimpleQA above.Yes, DeepSeek distills are better at smaller sizes on MATH, but not enough people evaluate those distills across all capabilities like ChatBotArena. Having it all in one model is very convenient and is still how most workflows are handled.Most people are also fairly skeptical of evaluation scores like MATH stated by Gemma, DeepSeek distills, and the like, claiming they don’t translate to real world usefulness. This is why the ChatBotArena results were the most striking of the Gemma 3 release. Gemma 3 falls in the top 15 of every category. It beats DeepSeek V3 with its 600B+ total parameters. It is outperformed in niche categories like math or coding by its peer models in the overall ranking, indicating a small level of superficial alignment, but doing this to get into the top 10 of ChatBotArena during this period of AI with immense competition is a huge accomplishment.It is an ever evolving open question on how reliable chat evaluations like ChatBotArena are. These days, with how in vogue RL training methods to maximize MATH evaluations are, the value is higher again. Is it representative of some subset of real-world use, which would indicate that specific capabilities that small models are able to excel at — math, general chat, etc. — can translate directly to real value.This implies that tasks like SimpleQA and GPQA indicate performance on more niche tasks that not many people encounter, but we have a lot to learn as a field here.With my use of leading models, I find this hard to believe — switching to something like GPT-4.5 feels like an immediate improvement in chat tasks. My conclusion is that the answer is somewhere in the middle — small open models can do super well on everyday tasks, but we don’t know exactly how to measure it. ChatBotArena and SimpleQA don’t tell us exactly what to expect from the models.The fact that there isn’t a cliff in performance with models this small is going to drive substantial business value — it’ll be easier to find exactly the minimum model size needed for your distribution of tasks.In the specific case of Gemma and most of the leading open weight models right now (with DeepSeek R1 as a wonderful exception), the licenses of models that often restrict downstream use are a brake on open weight model adoption. Without getting too much into the details, they expose companies looking to use the models to some amount of legal risk and add terms and conditions to finetuned models.Largely, until feedback loops and use cases for open models are established, teams releasing these models don’t have strong cases other than future potential growth to counter the safe option that comes from legal teams’ recommendations. I hope that the evolution in the adoption of open weight models for valuable applications applies pressure to make licensing less of a headache.Interconnects is a reader-supported publication. Consider becoming a subscriber.The state of the open-closed model gap3 of the top 15 models on ChatBotArena are open weights. In a world where frontier labs have many minor model versions crowding the leaderboard, this is an incredible platform for accelerating progress in open model adoption. Even just the gap from Gemma 3 to Google's flagship Gemini models is very small! The entire AI market is the most dynamic and competitive it has been in some time — maybe ever.When it comes to “truly open,” i.e. actually open-source models, the gap between closed models has remained somewhat consistent — I estimate it as about 18 months behind closed labs. With open models generally on the upswing, open-source access to code, data, etc. is likely to come with it. The likes of Llama, DeepSeek, etc. are some of the most important pieces in the complete open ecosystem, and approaches like Ai2’s would struggle without them.Related to this, in the coverage of DeepSeek R1, I noted:This points to the first time since Stable Diffusion’s release that the most relevant and discussed AI model is released with a very friendly license. Looking back at the journey “open-source” AI has been on over the last 2.5 years, this is a surprising moment in time marked in the history books.A month later, this is still the case.To understand the progress of the ope
Eugene Vinitsky is a professor a New York University department of Civil and Urban Engineering. He’s one of my original reinforcement learning friends from when we were both doing our Ph.D.’s in RL at UC Berkeley circa 2020. Eugene has extensive experience in self-driving, open endedness, multi-agent reinforcement learning, and self-play with RL. In this conversation we focus on a few key topics:* His latest results on self-play for self-driving and what they say about the future of RL,* Why self-play is confusing and how it relates to the recent takeoff of RL for language models, and* The future of RL in LMs and elsewhere.This is a conversation where we take the time to distill very cutting edge research directions down into the core essences. I felt like we were learning in real time what recent developments mean for RL, how RL has different scaling laws for deep learning, and what is truly salient about self-play.The main breakthrough we discuss is scaling up self-play techniques for large-scale, simulated reinforcement learning. Previously, scaling RL in simulation has become economical in single-agent domains. Now, the door is open to complex, multi-agent scenarios where more diversity is needed to find solutions (in this case, that’s what self play does).Eugene’s Google Scholar | Research Lab | Linkedin | Twitter | BlueSky | Blog (with some great career advice).Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.Show outline & linksWe cover many papers in this podcast. Also, as an experiment, here’s a Deep Research report on “all the papers that appeared in this podcast transcript.”In this episode, we cover:* Self-play for self-driving, mostly around the recent paper Robust Autonomy Emerges from Self-Play (Cusumano-Towner et al. 2025). The simulator they built powering this is Gigaflow. More discussion on HackerNews.(Here’s another self-play for self-driving paper and another from Eugene from earlier this year).A few highlights:“All simulated agents use the same neural net with the same weights, albeit with randomized rewards and conditioning vector to allow them to behave as different types of vehicles with different types of aggressiveness. This is like driving in a world where everyone is different copies of you, but some of your copies are in rush while others are patient. This allows backprop to optimize for a sort of global utility across the entire population.”“The resulting policy simulates agents that are human-like, even though the system has never seen humans drive.”* Large Language Models are In-context Preference Learners — how language models can come up with reward functions that will be applied to RL training directly. Related work from Stanford.* Related literature from Interconnects! The first includes literature we mention on the learning locomotion for quadrupeds with deep RL (special shoutout as usual to Marco Hutter’s group).* Recent and relevant papers Value-based RL Scales Predictably, Magnetic control of tokamak plasmas through deep reinforcement learning.* Other things we mention:* Cruise, Tesla, and Waymo’s autonomy stacks (speculation) and how the self-driving industry has changed since we were / were considering working in it.* Evo 2 foundation model for biology.* Eugene is working with a new startup on some LLM and RL stuff. If you’re interested in this episode, ping eugene@aitco.dev. Not a paid promotion.Chapters* 00:00:00 Introduction & RL Fundamentals* 00:11:27 Self‑Play for Self‑Driving Cars* 00:31:57 RL Scaling in Robotics and Other Domains* 00:44:23 Language Models and In-Context Preference Learning* 00:55:31 Future of RL and Grad School AdviceTranscriptI attempted to generate with ElevenLab’s new Scribe tool, but found the formatting annoying and reverted back to Alessio’s smol-podcaster. If you’re interested in working part-time as an editorial aide to Interconnects, please get in touch.Nathan Lambert [00:01:27]: Hey, Eugene. Welcome to the show.Eugene Vinitsky [00:01:29]: Hey, Nathan. Thanks for having me. Excited to be here.Nathan Lambert [00:01:32]: Yeah, so I'll have said this in the intro as well, but we definitely go well back in all the way to Berkeley days and RL days, I think.I will embarrass you a little bit now on the live read, which is like, you were one of the people when I was switching into RL, and they're like, oh, it seems like you only figured out how to get into AI from a potentially different background, and that's what I was trying to do in 2017 and 2018.So that was kind of fun, and now we're just friends, which is good.Eugene Vinitsky [00:02:01]: Yeah, we were both figuring out. If I had any lead over you there, I was also frantically trying to figure it out, because I was coming from a weird background.Nathan Lambert [00:02:11]: There are definitely a lot of people that do that now and over-attribute small time deltas to big strategic plans, which was probably what it was.And we're just going to do some of our normal conversations on RL and self-play.I think the backstory of this is you told me that your recent paper from some of your time at Apple, I think I don't want to time for it too specifically, was something, paraphrasing, like the most exciting RL thing you've ever had a part of.And major RL projects are not that frequent.I think if you segment out all the language model excitement in the past 10 years, there's really a few major milestones, and it's good to kind of talk about them.So we can kind of start, I think, basic things, like how do you define reinforcement learning, and it will kind of build up to this self-driving project.Eugene Vinitsky [00:03:05]: Yeah, so I think RL is kind of a big thing, but at a really basic level, you have this process of taking actions in the world.You're seeing the state of the world.If you're taking actions in the world, you sometimes receive a reward that tells you the value of that action, and you're trying to kind of optimize your cumulative behavior over time.So that, you know, over long trajectories, you're optimizing those costs.That's both, you know, the hard thing and the exciting thing is that if you do it well, you can really optimize really long horizon behaviors.Nathan Lambert [00:03:41]: Yeah, I agree.And it's funny because now it's finally, the language models are finally doing this long chain of thought, and I don't really think that's the same.I think the interactive notion will come up a lot here where these long context behaviors are many, many actions interacting with the world relative to one really, really long action, which is kind of odd.Eugene Vinitsky [00:04:04]: Yeah, I guess, yeah, it mixes things, right?Because it has very long state, right?It's got very long contexts, and it's generating its own context.But in the end, there's really one action at the end that, like, kind of determines how everything went, you know?Nathan Lambert [00:04:23]: Yeah, yeah, yeah, we'll get into this.And then the next thing that we kind of need to set up is what do you define self-play as?I think this word has been particularly broken in recent times with language models, and I'm hoping we can get a fairly specific criteria for what is self-play and what are related topics.Eugene Vinitsky [00:04:42]: Yeah, I think even within the field, there's quite a bit of debate as to what constitutes self-play.So talking to, you know, experts, people will disagree about what methods are and are in self-play.But what I will say is I generally define self-play as anything where an agent plays a copy of itself.So up to a bunch of different agents interacting with each other, as long as they're mostly, in some ways, copies of each other, we're doing self-play.Nathan Lambert [00:05:12]: Yeah, and then do you think anything, I mean, your background's in multi-agent as well.Do you think there is something fundamental to kind of a game that has a really specific hill to climb where it's kind of this competitive nature versus something like language?Eugene Vinitsky [00:05:29]: Yeah, this is kind of the dream of, I think, some multi-agent researchers is this type of like ratchet effect where you have a bunch of agents interacting with each other and kind of increasing complexity on the part of any agent generates increasing, like creates new challenges that need to be solved and then force you to learn new skills.And then you kind of get this endless, endless ratchet.Maybe that's what you meant.I may have misinterpreted.Nathan Lambert [00:05:55]: We're going to revisit it.I think also it's like, how does the multi-agent nature of a lot of these things change what people think about with RL?This is kind of the last building block before we go into the self-driving stuff.Eugene Vinitsky [00:06:07]: Yeah, yeah, yeah.So the way that the multi-agent thing changes things is it makes everything much harder and more interesting.So you go away from this world where you have like a clear score function, right?So you have some reward for first in single agent setting, you have some reward.If that reward is high, you're doing well, right?And when you move into the multi-agent setting, it becomes reward with respect to whom, right?It all of a sudden matters whom I'm playing, right?So if we go to a game of like, like one setting is like two players, zero sum games, right?So a game of two player poker, I give you, I train a poker bot, right?How do I know it's any good?I have to play another poker bot to decide that it's any good, right?And so all of a sudden, this challenge of like, what is a good policy becomes very fundamental.And you kind of lose even a notion of there being like one clear good policy.And like the whole, a lot of, a lot of the field of multi-agents is coming up with different definitions of what would cost you goodness.Nathan Lambert [00:07:06]: Um, so, and then back to the self-play thing with that, like, is all of the self-play that
Full post: https://www.interconnects.ai/p/elicitation-theory-of-post-trainingIf you look at most of the models we've received from OpenAI, Anthropic, and Google in the last 18 months, you'll hear a lot of "Most of the improvements were in the post-training phase." The most recent one was Anthropic’s CEO Dario Amodei explaining Claude 3.7 on the Hard Fork Podcast:We are not too far away from releasing a model that's a bigger base model. Most of the improvements in 3.6/3.7 are in the post-training phase. We're working on stronger base models (perhaps that will be the Claude 4 series, perhaps not; those are coming in a relatively small number of time units [months?].Here's a simple analogy for how so many gains can be made on mostly the same base model.The intuition I've been using to understand the potential of post-training is called the elicitation interpretation of post-training, where all we are doing is extracting and amplifying valuable behaviors in the base model.Consider Formula 1 (F1), most of the teams show up to the beginning of the year with a new chassis and engine. Then, they spend all year on aerodynamics and systems changes (of course, it is a minor oversimplification), and can dramatically improve the performance of the car. The best F1 teams improve way more during a season than chassis-to-chassis.The same is true for post-training. The best post-training teams extract a ton of performance in a very short time frame. The set of techniques is everything after the end of most of pretraining. It includes "mid-training" like annealing / high-quality end of pre-training web data, instruction tuning, RLVR, preference-tuning, etc. A good example is our change from the first version of OLMoE Instruct to the second — we improved our post-training evaluation average from 35 to 48 without touching the majority of pretraining.Then, when you look at models such as GPT-4.5, you can see this as a way more dynamic and exciting base for OpenAI to build onto. We also know that bigger base models can absorb far more diverse changes than their smaller counterparts.This is to say that scaling also allows post-training to move faster. Of course, to do this, you need the infrastructure to train the models. This is why all the biggest companies are still building gigantic clusters.This theory folds in with the reality that the majority of gains users are seeing are from post-training because it implies that there is more latent potential in a model pretraining on the internet than we can teach the model simply — such as by passing certain narrow samples in repeatedly during early types of post-training (i.e. only instruction tuning).Throwback to the superficial alignment hypothesisAnother name for this thoery is the Superficial Alignment Hypothesis, coined in the paper LIMA: Less is More for Alignment. This paper is getting some important intuitions right but for the wrong reasons in the big picture. The authors state:A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users. If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples [Kirstain et al., 2021].All of the successes of deep learning should have taught you a deeply held belief that scaling data is important to performance. Here, the major difference is that the authors are discussing alignment and style, the focus of academic post-training at the time. With a few thousand samples for instruction finetuning, you can change a model substantially and improve a narrow set of evaluations, such as AlpacaEval, MT Bench, ChatBotArena, and the likes. These do not always translate to more challenging capabilities, which is why Meta wouldn’t train its Llama Chat models on just this dataset. Academic results have lessons, but need to be interpreted carefully if you are trying to understand the big picture of the technological arc.What this paper is showing is that you can change models substantially with a few samples. We knew this, and it is important to the short-term adaptation of new models, but their argument for performance leaves the casual readers with the wrong lessons.If we change the data, the impact could be far higher on the model’s performance and behavior, but it is far from “superficial.” Base language models today (with no post-training) can be trained on some mathematics problems with reinforcement learning, learn to output a full chain of thought reasoning, and then score higher on a full suite of reasoning evaluations like BigBenchHard, Zebra Logic, AIME, etc.The superficial alignment hypothesis is wrong for the same reason that people who think RLHF and post-training are just for vibes are still wrong. This was a field-wide lesson we had to overcome in 2023 (one many AI observers are still rooted in). Post-training has far outgrown that, and we are coming to see that the style of models operates on top of behavior — such as the now popular long chain of thought.The counterargument to elicitationThe prevailing counterargument to the elicitation theory of post-training has been that post-training is teaching specific skills to the model. This can be seen through very large finetuning datasets used in the early eras of open models. One of the last prominent research examples of this is OpenMathInstruct 2, which showed performance improvements with finetuning on up to 14.3 million instructions.A rough industrial norm is that you can do excellent post-training with only 1 million prompts at instruction tuning. The way to interpret the above plot with the elicitation theory is that the vast majority of the gains come from the beginning of training.The trade-offs discussed in this counterargument, such as scaling post-training methods, were specifically discussed in the paper Revisiting the Superficial Alignment Hypothesis, where it is clear that you can teach new skills to models in post-training. The hardest part today is to know when the skills are entirely new — it is often safer to assume the model has seen them at pretraining. Therein, it is far more efficient to elicit behaviors from the model than to teach, and eventually, the tide may shift to favor teaching models, but not in the near future.Comparing this math training dataset to current best (and emerging) practices for reasoning models makes the strongest case for the elicitation theory yet.Interconnects is a reader-supported publication. Consider becoming a subscriber.RL’s role in elicitationThe reinforcement learning (RL) training we’ve seen take off in this early era of reasoning models is often described as “sample efficient” and “the model learning new behaviors.” Both of these fit with the theory presented. The astute way to view the model learning new behaviors is not that it is learning entirely new abilities but rather learning to reinforce behaviors that were already latent in the pretraining dataset.Compared to teaching the model math with millions of samples, just a few thousand prompts of RL training can far surpass the performance in MATH shown above.In many ways, RL training exploding in popularity and effectiveness is the ultimate endorsement of the elicitation theory. Where we used to try and teach the model math with millions of supervised samples, now we just let the model try different approaches on thousands of math problems, and they reach far higher peak performance.This is, of course, also linked to why people say that “stronger base models are better starting points for RL.” All of this fits together as the base model is the platform on which post-training is built.A reductionist take here is to say that pretraining is not important — in reality, pretraining is just slow and hidden from most of the gains we are seeing. Still, excellent post-training and the performance improvements we enjoy today are all well downstream of pretraining. Pretraining still is arguably the most important part of the training stack, as it allows those with confidence in the elicitation during post-training to thrive.Thanks to Mohit Raghavendra for some email exchanges that helped this post. Get full access to Interconnects at www.interconnects.ai/subscribe
Link: https://www.interconnects.ai/p/where-inference-time-scaling-pushesThere’s a lot of noise about the current costs of AI models served for free users, mostly saying it’s unsustainable and making the space narrow for those with the historical perspective of costs of technology always plummeting. GPT-4.5’s odd release of a “giant” model without a clear niche only amplified these critics. With inference-time compute being a new default mode, can we still have free AI products? Are we just in the VC-subsidized era of AI?For normal queries to ChatGPT, the realistic expectation is that the cost of serving an average query will drop to be extremely close to zero, and the revenue from a future ad model will make the service extremely profitable. The most cohesive framework for understanding large-scale internet businesses built on the back of such zero marginal costs is Ben Thompson’s Aggregation Theory.Aggregation Theory posits that extreme long-term value will accrue to the few providers that gate access to information and services built on zero-marginal cost dynamics. These companies aggregate user demand. It has been the mode of modern dominant businesses, with the likes of Google and Meta producing extremely profitable products. Naturally, many want to study how this will apply to new AI businesses that are software-heavy, user-facing platforms, of which OpenAI is the most prominent due to the size of ChatGPT. Having more users and attention enables aggregators to better monetize interactions and invest in providing better experiences, a feedback loop that often compounds.Aggregators are often compared to platforms. Where the former relies on being an intermediary of users and other marketplaces, platforms serve as foundations by which others build businesses and value, such as Apple with the iPhone, AWS, or Stripe.Businesses like ChatGPT or Perplexity will rely on a profitable advertisement serving model being discovered that works nicely for the dialogue format. ChatGPT interweaving previous discussions into the chat, as they started doing in the last few months, is encouraging for this, as they could also have preferred products or sources that they tend to reference first. Regardless, this will be an entirely new type of ad, distinct from Meta’s targeted feed ads, Google’s search ads, or the long history of general brand ads. Some of these past ad variants could work, just sub-optimally, in the form factor.An even easier argument is to see the current hyperscalers using low-cost inference solutions on AI models that complement their existing businesses and fit with components of Aggregation Theory — such as Meta serving extremely engaging AI content and ads. The biggest platform play here is following the lens through which language models are a new compute fabric for technology. The AWS of AI models.All of these business models, ads, inference, and what is in between, were clear very soon after the launch of ChatGPT. As the AI industry matures, some harder questions have arisen:* Who bears the cost of training the leading frontier models that other companies can distill or leverage in their products?* How many multiples of existing inference paradigms (0-100s of tokens) will inference-time scaling motivate? What will this do to businesses?This post addresses the second question: How does inference time compute change business models of AI companies?The announcement of OpenAI’s o3 with the inference cost on ARC-AGI growing beyond $5 per task and the proliferation of the new reasoning models raised the first substantive challenge to whether aggregation theory will hold with AI.The link to inference time compute and the one that sparked this conversation around aggregators was Fabricated Knowledge’s 2025 AI and Semiconductor Outlook, which stated:The era of aggregation theory is behind us, and AI is again making technology expensive. This relation of increased cost from increased consumption is anti-internet era thinking.This is only true if increased thinking is required on every query and if it doesn’t come with a proportionate increase in value provided. The fundamental operations of AI businesses will very much follow in the lens of Aggregation Theory (or, in the case of established businesses, it’ll reinforce advantages of existing large companies), and more work is going to be needed to figure out business models for inference-heavy products.We can break AI use today into two categories:* ChatGPT and general-use chatbots.* Domain-specific models, enterprise products, model APIs, and everything else that fits into the pay-for-work model (e.g. agents).The first category is established and not going away, while the second is very in flux. Inference time scaling will affect these in different ways.Consumers — well, most of them (and not most of you reading this who are power users) — will never know how to select the right model. The market for super users is far smaller than the market for general use. The core for consumer products is having the model know how much compute to spend. This is where RL training will likely be most important and is something notably missing from the release of Claude 3.7 Sonnet.OpenAI’s model offerings and initial excitement around inference time compute made many, myself included, get excited about the idea of a compute dial being shown to the users so they can control the “thinking effort” for their query. The problem is that rules for how well that translates to performance rely on a deep understanding of AI and how language model performance is very stochastic.The so-called dial is being reduced to simple reasoning buttons or always-on reasoning — extremes and yes/no decisions are much easier for users to work with. This is already how I engage with models. I start with a normal model, and if it doesn’t work, I punt to o1 pro. Would my trying to guess the right spot on a dial for a new query really be a good experience? Please, the model should know its own limits and handle that on its own.Today, the RL-trained reasoning models primarily serve as a trust and legibility enhancement to average users rather than a performance improvement. This is leading to the exposure of the Chain of Thoughts (CoTs) to be an industry norm. At the same time, this sort of minor increase in context length will still be subsumed into a zero marginal cost style business, pending the assumed discovery of a functional ad model. This is all also just the tip of the iceberg for inference time compute. From my coverage of Claude 3.7:RL training is a short path to inference time scaling laws being used, but in the long-term we will have more methods for eliciting the inference-time tradeoffs we need for best performance.For power users and enterprises, RL training and one model fits all is less important. Enterprises will want to benefit from clear trade-offs on performance vs log(compute).Many in the industry, including in the aforementioned Claude 3.7 release and o3’s ARC-AGI performance, are discussing the use of parallel test time compute relative to just increasing generation lengths. Inference time scaling with parallel computation and strong verifiers will be essential to the long-term trajectory of the sub-area.Where RL models can increase the compute spent by a model by factors of 2, 4, or 10 for a question, parallel computation already uses factors of 1000 (see o3), and will go far higher. This is a far more direct way to continue scaling the log-compute plots for inference time scaling. It’s also more efficient due to the quadratic costs of generating longer contexts — in fact most of the models we are using cannot scale output length infinitely, as we can with the number of samples.Interconnects is a reader-supported publication. Consider becoming a subscriber.Better verifiers will increase the slope of the inference time scaling plots we are seeing, as discussed in our coverage of Claude 3.7.Models will be trained to make the probability of a true answer appearing increase over many generations and maximizing the probability that the extraction method can select it, rather than maximizing the probability that 1 single generation is correct out of the box. This is a very different way to finish the training of models than has been considered in some time. Here’s a recent example of a research paper studying this, Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models, and more will surely come soon.Verification as the limiter for inference-time scaling performance is not a new idea. It was the starting point of my coverage on inference time scaling, before the release of OpenAI’s o1 (and mentioned in the recent post on Claude 3.7). Ultimately, the challenge is robustness, not if the models can generate the answer:I wanted to highlight a result from the last paper, Large Language Monkeys, as to why inference time compute is feasible. That paper focuses on repeated sampling from a variety of language models to understand the distribution of answers in a model. They show that with an Oracle answer extractor (more on this later), even models as surprising as Pythia-70M have the true answer inside.Remember, the domain of math answers is infinite. This shows that underlying distribution of the models has the right answer, we need to figure out how to extract the right ones. We need strong verifiers to make answer selection easy. The Snell et al. paper above (1.) calls this the “Proposer and Verifier” perspective.The understanding that while the models we are using will almost always be able to generate the right answer and the fact that training verifiers for using that has just started should increase optimism that inference time scaling can work.This type of performance will not be cheap. Unlocking new potential applications is still worth way more than the few dollars these queries can cost. Noam Shazeer of Google explained this on his Dwarkesh Podcast appearance wit
More: https://www.interconnects.ai/p/gpt-45-not-a-frontier-modelAs GPT-4.5 was being released, the first material the public got access to was OpenAI’s system card for the model that details some capability evaluations and mostly safety estimates. Before the live stream and official blog post, we knew things were going to be weird because of this line:GPT-4.5 is not a frontier model.The updated system card in the launch blog post does not have this. Here’s the original system card if you need a reference:Regardless, someone at OpenAI felt the need to put that in. The peculiarity here summarizes a lot of the release. Some questions are still really not answered, like “Why did OpenAI release this?” That game theory is not in my purview.The main contradiction to the claims that it isn’t a frontier model is that this is the biggest model the general public has ever gotten to test. Scaling to this size of model did NOT make a clear jump in capabilities we are measuring. To summarize the arc of history, the jump from GPT-3.5 to GPT-4 made the experience with the models go from okay to good. The jump from GPT-4o (where we are now) to GPT-4.5 made the models go from great to really great.Feeling out the differences in the latest models is so hard that many who are deeply invested and excited by AI’s progress are just as likely to lie to themselves about the model being better as they are to perceive real, substantive improvements. In this vein, I almost feel like I need to issue a mea culpa. I expected this round of scaling’s impacts to still be obvious before the brutal economic trade-offs of scaling kicked in.While we got this model, Anthropic has also unintentionally confirmed that their next models will be trained on an approximation of “10X the compute,” via a correction on Ethan Mollick’s post on Claude 3.7.Note: After publishing this piece, I was contacted by Anthropic who told me that Sonnet 3.7 would not be considered a 10^26 FLOP model and cost a few tens of millions of dollars to train, though future models will be much bigger.GPT-4.5 is a point on the graph that scaling is still coming, but trying to make sense of it in a day-by-day transition is hard. In many ways, zooming out, GPT-4.5 will be referred to in the same breath as o1, o3, and R1, where it was clear that scaling pretraining alone was not going to give us the same level of breakthroughs. Now we really know what Ilya saw.All of this marks GPT-4.5 as an important moment in time for AI to round out other stories we’ve been seeing. GPT-4.5 likely finished training a long time ago — highlighted by how it has a date cutoff in 2023 still — and OpenAI has been using it internally to help train other models, but didn’t see much of a need to release it publicly.What GPT-4.5 is good forIn the following, I am going to make some estimates on the parameter counts of GPT-4.5 and GPT-4o. These are not based on any leaked information and should be taken with big error bars, but they are very useful for context.GPT-4.5 is a very big model. I’d bet it is well bigger than Grok 3. We have seen this story before. For example, GPT-4 was roughly known to be a very big mixture of experts model with over 1T parameters total and ~200B active parameters. Since then, rumors have placed the active parameters of models like GPT-4o or Gemini Pro at as low as 60B parameters. This type of reduction, along with infrastructure improvements, accounts for massive improvements in speed and price.Estimates place GPT-4.5 as about an order of magnitude more compute than GPT-4. These are not based on any released numbers, but given a combination of a bigger dataset and parameters (5X parameters + 2X dataset size = 10X compute), the model could be in in the ballpark of 5-7T parameters total, which if it had a similar sparsity factor to GPT-4 would be ~600B active parameters.With all of these new parameters, actually seeing performance improvements is hard. This is where things got very odd. The two “capabilities” OpenAI highlighted in the release are:* Reduced hallucinations.* Improved emotional intelligence.Both of these have value but are hard to vibe test.For example, SimpleQA is a benchmark we at Ai2 are excited to add to our post-training evaluation suite to improve world knowledge of our models. OpenAI made and released this evaluation publicly. GPT-4.5 makes huge improvements here.In another one of OpenAI’s evaluations, PersonQA, which is questions regarding individuals, the model is also state of the art.And finally, also GPQA, the Google-proof knowledge evaluation that reasoning models actually excel at.At the time of release, many prominent AI figures online were touting how GPT-4.5 is much nicer to use and better at writing. These takes should be taken in the context of your own testing. It’s not that simple. GPT-4.5 is also being measured as middle of the pack in most code and technical evaluations relative to Claude 3.7, R1, and the likes.For an example on the writing and style side, Karpathy ran some polls comparing GPT-4.5’s writing to GPT-4o-latest, and most people preferred the smaller, older model. Given what we know about post-training and the prevalence of distilling from the most powerful model you have access to, it is likely that GPT-4o-latest is distilled from this new model, previously called Orion, and its drastically smaller size gives it a night and day difference on iteration speed, allowing for better post-training.More on the character in that GPT-4o-latest model was covered in our previous post on character training.All of this is a big price to pay to help OpenAI reclaim their top spot on ChatBotArena — I expect GPT 4.5 to do this, but the results are not out yet.I’ve been using GPT-4.5 in preparation for this. It took a second to get used to the slower speed, but it’s fine. I will keep using it for reliability, but it’s not worth paying more for. o1 Pro and the other paid offerings from OpenAI offer far more value than GPT-4.5.Interconnects is a reader-supported publication. Consider becoming a subscriber.Making sense of GPT-4.5’s ridiculous priceWhen the original GPT-4 first launched, it was extremely expensive. In fact, GPT-4 was comparable in price to GPT-4.5 at launch. Here’s a help post on the OpenAI forums, conveniently found by OpenAI DeepResearch with GPT-4.5, that captures the context. GPT-4 launched in March 2023.We are excited to announce GPT-4 has a new pricing model, in which we have reduced the price of the prompt tokens.For our models with 128k context lengths (e.g. gpt-4-turbo), the price is:* $10.00 / 1 million prompt tokens (or $0.01 / 1K prompt tokens)* $30.00 / 1 million sampled tokens (or $0.03 / 1K sampled tokens)For our models with 8k context lengths (e.g. gpt-4 and gpt-4-0314), the price is:* $30.00 / 1 million prompt token (or $0.03 / 1K prompt tokens)* $60.00 / 1 million sampled tokens (or $0.06 / 1K sampled tokens)For our models with 32k context lengths (e.g. gpt-4-32k and gpt-4-32k-0314), the price is:* $60.00 / 1 million prompt tokens (or $0.06 / 1K prompt tokens)* $120.00 / 1 million sampled tokens (or $0.12 / 1K sampled tokens)GPT-4.5’s pricing launched at:Input:$75.00 / 1M tokensCached input:$37.50 / 1M tokensOutput:$150.00 / 1M tokensOpenAI included language in the release that they may not keep this model in the API, likely forecasting low demand, as they wanted to hear from users if it enabled entirely new use-cases.Many analysts think that Nvidia’s next generation of GPU, Blackwell, which comes with GPUs with far more memory per FLOP (enabling storing bigger models), are not priced into this. We can expect to see the same arc of pricing with 4.5 as we did with 4 to 4 Turbo to 4o.* GPT-4 Turbo launched in November 2023 at $10 / 1M input and $30 / 1M output.* GPT-4o launched in May 2024 at $2.5 / 1M input and $10 / 1M output.These are huge reductions, about 10X.These are products that OpenAI makes a healthy margin on, and there are no signs that that isn’t the case. The AI community collectively has grown so accustomed to incredible progress in making the technology more efficient that even a blip in the process, where bigger models are available, feels potentially bubble-popping.The future of scalingScaling language models is not dead. Still, reflecting on why this release felt so weird is crucial to staying sane in the arc of AI’s progress. We’ve entered the era where trade-offs among different types of scaling are real.If forced to summarize all of this curtly, it would be: GPT-4.5 is, oddly, ahead of its time.This means that the progression of AI needs to take a different tack, but we already knew this with the rapid progress of reasoning models. The true impact of GPT-4.5 is when it is integrated with multiple lines of rapid progress.One of the flagship results in the DeepSeek R1 paper and related RL follow-up work in the AI community is that scaling RL training works better on bigger models. There is a lot of work to do to know all the domains that’ll be absorbed into this umbrella. Future models like o4 could be distilled from a reasoning model trained on GPT-4.5. In fact, this may already be the case. OpenAI’s current models likely would not be so good without GPT-4.5 existing.In as soon as a year, most of the models we are working with will be GPT-4.5 scale and they will be fast. The “well-rounded” improvements they offer are going to help make many more applications more robust, but OpenAI and others in the AI labs have pushed scaling a bit further than the current serving infrastructure can support.Frontier labs are not taking enough risk if they’re not going to try to push the limits of every direction of scaling they have. Though releasing the model isn’t needed, we have to guess why OpenAI actually wanted to do this. It’s likely that GPT-4.5 is being used in other internal systems for now and other external products soon, so releasing it is a natural step on the way to the next thing, rather than a detour.
https://www.interconnects.ai/p/character-trainingThe vast majority of evaluations used to measure progress on post-training at frontier laboratories are internal evaluations rather than the evaluations you hear about all the time like MATH or GPQA. These, the well-known intra-industry evaluations, are certainly important for ballparking behavior, but for every public evaluation, these frontier laboratories are likely to have 10+ fine-grained internal evaluations.The internal evaluations these model providers have cover a range of topics. Surely, most of them are basic, repetitive user behaviors that they need to make sure a new model doesn’t roll back too many of. Of these, the vast majority are likely skills, and “character” remains more of an art than a hill to climb up with careful data engineering.Leading post-training laboratories surely know how to reinforce more robust behavior within a specific character, as seen by the march of progress on evaluations like ChatBotArena, but crafting a specific personality from scratch is an open question.The primary goal of this post is to start the conversation outside of frontier AI labs around character training. Character training is the subset of post-training designed around crafting traits within the model in the manner of its response, rather than the content. Character training, while being important to the user experience within language model chatbots, is effectively non-existent on the web.We don’t know the trade-offs of what character training does, we don’t know how exactly to study it, we don’t know how much it can improve user preferences on ChatBotArena, and we should.The appearance of the AIs people are using is deeply coupled with how intelligent users will find it to be. Style of communication is crucial to how information is parsed. This is likely a very high priority to industrial labs, but something that almost no academic literature exists on. Even though I want to do research on this, I’m honestly not sure how to do so yet other than a 1 of 1 technical report on findings.ChatGPT gets character depthOut of nowhere on Saturday, February 15th, Sam Altman tweeted about this new GPT-4o model that will serve as the foundation of ChatGPT.This is the biggest subjective change I’ve ever felt within intermediate model versions, from any primary provider — something more akin in vibes change to the shift from GPT-3.5 to GPT-4. The model immediately and consistently showed new behavior patterns. I found these very positive (Karpathy agrees), but they’ll take some getting used to.Where ChatGPT used to sound robotic and shallow, it’s now very clearly leaning into a chipper assistant demeanor. Yes, for basic tasks, this new default model in ChatGPT is very Claude 3.5-like — more testing is needed to know if this GPT-4o with its peer models like o3-mini can dethrone Claude 3.7 Sonnet as a daily programming driver.The biggest changes in the new GPT-4o model are:* It now loves to reference past interactions in the chat (way more obvious than any other provider has been) — it was trying to flex that it knows my dog breed, mini schnauzer, or my book topic, RLHF. This is very in line with the new roadmap to GPT-4.5 and GPT-5 that Altman posted, where ChatGPT is designed around a fluid experience rather than standalone, siloed, powerful models.* The model is very chipper, sprinkles in more emojis, and is almost funny.* The multi-turn conversation is more dynamic, with follow-up questions and added texture to longer back and forths.The reasons are at a high level very complementary to those I listed when I switched to Claude as my daily driver model.The shocking part of this is that the impact of this sweeping change is almost entirely undocumented. Yes, OpenAI updated the Model Spec (my previous coverage here and here), but that doesn’t really capture how this model is different — it just clarifies the direction OpenAI is optimizing for. There are a few overlapping interpretations of this lack of transparency:* OpenAI cannot precisely measure the differences as a few specific behavior traits, so they can only see the model performs better in high-level testing like ChatBotArena or other A/B testing, but they cannot capture the changes in score deltas between a few evaluations they could release.* AI is moving so fast that taking the time to document these models is not worth it,* Detailing the changes will make the character too easy to reproduce and will be another path of “distillation” of OpenAI’s models.The community of model users is extremely far from having clear ways to measure these differences. While there are vibe tests on Twitter, they will not be conclusive. ChatBotArena won’t even come close to measuring the levels of these differences (and in the case of referencing past chats, it cannot). Character training is the sort of addition to a post-training stack that takes industrial training techniques from being reproducible, but expensive, to dark arts that are largely undocumented.The most interesting part of the model spec for industry analysts is this plot where OpenAI shares the agreement rate of their newer models. This is comparing a reasoning model, o1, to a GPT-4o model, so there are questions of whether this is attributable to reasoning training.Every frontier AI laboratory should have a model specModel Specs are the sort of community norm that a race to the top is the goal. They’re muddled if mandated — how would you actually check that a required model spec is accurate? — but if they are implemented by every lab carefully with feedback from the community, it would be far easier for the development ecosystem to exist around models.The model spec is an extremely useful document detailing how developers can expect your models to change over time. They are also one of the few sources of insight we have into what the model providers are trying to get their models to do (which has regulatory advantages) and let us know what is an intentional or unintentional behavior mode.Interconnects is a reader-supported publication. Consider becoming a subscriber.A model spec doesn’t provide all the information we need to keep up with model versions. This new version of ChatGPT desperately needs to be accompanied by evaluations capturing the behavior change, otherwise, a lot of undocumented differences will be passed on to developers updating endpoints to it. This is another rendition of the same lack of transparency we’re used to from leading AI laboratories.The closest thing Anthropic has to a model spec is the mix of Claude’s Constitution and this blog post on Claude’s Character. Character training is a fairly new technique for the industry. From Anthropic’s post:Claude 3 was the first model where we added "character training" to our alignment finetuning process: the part of training that occurs after initial model training, and the part that turns it from a predictive text model into an AI assistant. The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.The process is extremely synthetic data-heavy, but requires an artist’s touch, as stated later in the blog post: It “[relies] on human researchers closely checking how each trait changes the model’s behavior.”Character training being the focus of developments is the strongest endorsement that RLHF and related approaches have shifted from their philosophical motivations of alignment to being primarily an empirical tool. The models can capture so many different behaviors, but getting them to reliably behave how we want is the hardest part. Right now, it seems more likely that this is about capturing the upside of RLHF as a performance tool, rather than a safety one.One of the few public discussions of character training came from Amanda Askell during her appearance on the Lex Fridman Podcast (taken from the transcript):Lex Fridman (03:41:56) When you say character training, what’s incorporated into character training? Is that RLHF or what are we talking about?Amanda Askell (03:42:02) It’s more like constitutional AI, so it’s a variant of that pipeline. I worked through constructing character traits that the model should have. They can be shorter traits or they can be richer descriptions. And then you get the model to generate queries that humans might give it that are relevant to that trait. Then it generates the responses and then it ranks the responses based on the character traits. In that way, after the generation of the queries, it’s very much similar to constitutional AI, it has some differences. I quite like it, because it’s like Claude’s training in its own character, because it doesn’t have any… It’s like constitutional AI, but it’s without any human data.In summary, Anthropic uses the same techniques they use for Constitutional AI and general post-training for capabilities to train these models’ characters. This is not surprising. This could be related to Askell’s other Tweet on how she designs system prompts, as system prompts are the easiest way to quickly change a model’s character:The boring yet crucial secret behind good system prompts is test-driven development. You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.This is very in line with what we started this post on — internal AI lab evaluations.How far can you push character training?Ahead of the Grok 3 release, Elon Musk Tweeted this example from Grok 3, saying it was “based.”One of the predominant reactions to Grok 3 was, “Wait, so it isn’t actually based?” This is one of the big questions of character training and lacking model specs. Did xAI not figure out how to make their model-based and reliable? What model was Elon using here?Whatever your politics, it’s likely that the default personality of models that you encounter will eventually not be something you like. There’s quite
On Monday, February 24th, 2025, Anthropic announced their latest model, Claude 3.7 Sonnet, which is their first model explicitly trained to use more inference time tokens to improve performance. This is another reinforcement learning (RL) trained model (mentioned in system card). With this model, they also released Claude Code as a limited research preview, which is a “command line tool for agentic coding.” Continuous improvements in models are enabling new modalities and domains addressable by the models, but assessing the impact of each new domain takes far more time than a quick model reaction post.This is a tidy release, a solid improvement, but not a step change for Claude or the industry. Expect a lot of small changes to accumulate massively this year.Claude 3.7 Sonnet is a clear improvement over Claude 3.5 Sonnet (New) and continues to push the limits in areas where users love Claude (e.g. read Ethan Mollick’s review here). The scores for those areas such as software development (SWE Bench) and tool use, are clearly state-of-the-art. For example, Claude 3.7 Sonnet is the highest scoring “standard non-reasoning” language model on the Aider Polyglot benchmark. While models like o3 and Grok 3 DeepThink highlight superhuman performance on code benchmarks, this sort of behavior being integrated without extra inference time compute is wonderful. The price for superhuman coding AI is plummeting.Even with o1 Pro, I still find myself using Claude 3.5 (New) on a regular basis. O1 Pro is the best model for doing succinct, one-off tasks like writing short scripts. It is extremely controllable and will often work out of the box. Though, when I’m doing tricky, iterative tasks I still use Claude. Claude 3.7 Sonnet only makes these workflows stronger and I’m stoked to play with it further.The most useful piece of this release for those trying to understand the direction of the ecosystem, rather than just the status of today, is Anthropic’s post on Claude’s extending thinking where they detail the product trade-offs, alignment, and future of inference time compute in their models. Anthropic’s offering of extending thinking to boost inference-time performance is far, far cleaner than that of OpenAI’s current model drop down disaster. Anthropic’s thinking model is the same as their general purpose model, much like xAI’s Grok 3, and what OpenAI teased will be the plan for GPT-5. Having just one model makes lots of infrastructure, product, and training decisions cleaner, but may come at the cost of the absolute Pareto front of performance for your organization shrinking. The reasoning training being embedded in one model with a standard inference mode will make the reasoning benefits and behavior feel closer to something like Gemini-Thinking, rather than OpenAI’s o1 or DeepSeek R1 that are designed solely for this reasoning mode of operation. It doesn’t mean that in the limit that a single model will be weaker in performance, but rather that currently training them may be slower to iterate on than a “full” reasoning language model.Focusing on deploying just one model that serves all the users is one of many examples where leading AI companies are needing to make their offerings legible to users and easy to use — a sign of the industry maturing from a race to intelligence to a race to usefulness.Still, Claude’s interface is not perfect by any means, the user still has to intentionally go to a drop down menu to get performance when they need it. The best mode is that the model knows when inference compute is needed on its own. My hypothesis is that when training one model with reasoning and without, having the model figure out how much compute to use is harder than a reasoning-only model like o1 figuring out its own compute budget. Or, Anthropic needed to keep a special flag that is turned on and off in the system prompt. This is a subtle potential trade-off of putting reasoning in just one model, but we’ll see where the final equilibrium is.On the other hand, Claude 3.7 Sonnet is showing the reasoning traces directly to users like DeepSeek R1 and Grok 3. These organizations have different ways of saying why, but it is clear that users just enjoy seeing it and it builds trust. Anthropic, understandably is using the reasoning traces to monitor the alignment of the models. The reasoning chains in these models are how the general public is learning more about the internal representations of language models. Another interesting detail is that “didn’t perform our standard character training on the model’s thought process.” This is how Claude thinks out of the box and the actual answers have a different flavor to them. More research will study how far the reasoning chains can diverge from the answer language. We’ve seen research on latent reasoning within the model, but beyond this, we could have reasoning languages that are entirely ungrounded from human languages because they are a more token-efficient representation of information for the model. More on this soon.The developer facing version of this Claude Extending Thinking is far cleaner and a sign of things to come — developers can request a specific amount of thinking tokens in their response. How this works is that the model will stream thinking tokens until the number is reached, then shift to answer tokens. This is still one autoregressive stream and no search is being used in the new products, yet.This explicit control over the thinking and answering phases is a growing behavioral focus in training reasoning models — expect more here soon. Developers can tune the setting that works for them and keep it baked in, rather than relying on the user to pass in a query that just happens to get the model to think for a long time. Explicit test-time inference budget increases are much more covetable than needing to hit the gold mine in a prompt search.The best place to see where this could be applied is by selecting performance on a task that scales nicely with inference time compute. Anthropic ran a similar experiment on the challenging math evaluation AIME — the same one that OpenAI used in their original inference time compute plot. Here there’s a subtle difference from the developer experience, where in Anthropic’s internal tests the model could exit early. In practice, this subtle difference shouldn’t shift the usefulness of the deployment methodology.Anthropic continues in their excellent post, saying:Our researchers have also been experimenting with improving the model’s performance using parallel test-time compute. They do this by sampling multiple independent thought processes and selecting the best one without knowing the true answer ahead of time. One way to do this is with majority or consensus voting; selecting the answer that appears most commonly as the 'best' one. Another is using another language model (like a second copy of Claude) asked to check its work or a learned scoring function and pick what it thinks is best. Strategies like this (along with similar work) have been reported in the evaluation results of several other AI models).To accompany this, they shared the following results. It is crucial to note here that the dashed red line — pass@N — is not an actual evaluation result, but measuring if the correct solution appears in the number of answers generated on the X-axis. The two lines below show how good initial answer extraction methods are at selecting the right answer from the N candidates. As has been known for a long-time in inference-time scaling research is that the models can often generate the correct answer to extremely hard questions, but not reliably.Interconnects is a reader-supported publication. Consider becoming a subscriber.They make it very clear this is not used yet in their products:Parallel test-time compute scaling isn’t available in our newly-deployed model, but we're continuing to research these methods for the future.Still, this is a direction other labs are already pursuing. The best reporting on o1 Pro indicates that it does a “search” of some sort over parallel generations. Other OpenAI employees have stated that o3 uses a learned verifier to extract answers, at least for coding domains. As progress in scaling single-streams from the language model slows, this is the natural next place for scaling to turn to. As it has been for some time, performance limits are largely different forms of infrastructure problems before models can be served to users.Claude is here, and it reinforces that RL training is a short path to inference time scaling laws being used, but in the long-term we will have more methods for eliciting the inference-time tradeoffs we need for best performance.Thanks to Ross Taylor for some immediate feedback on this post. Get full access to Interconnects at www.interconnects.ai/subscribe
Full post: https://www.interconnects.ai/p/grok-3-and-an-accelerating-ai-roadmapxAI launched their latest flagship model, Grok 3, last night via a live stream on X, which is a new take on the launch process, but it largely felt familiar. Grok 3 is a state-of-the-art model on some important benchmarks. The core is that it is state-of-the-art relative to available models and we know better models are out there. Only some of them have been announced, some of them have been teased, and others lie in waiting.What feels different is how the broader AI industry is signaling rapid progress coming soon. xAI said on the livestream that they will be updating the model “daily.” An era of sitting on unreleased models could be ending.Grok 3’s release is a reinforcement of trends people began reckoning with as of the release of DeepSeek V3 + R1 — AI progress is not held in the hands of a few companies nor is it slowing down. 2023 and 2024 were defined by truly state-of-the-art AI being concentrated within OpenAI, Anthropic, and Google, where these companies could take a lot of time to package models from training to release and still have a substantial moat on capabilities relative to their peers.At the time of R1’s launch, the “people’s choice” model was Claude 3.5 Sonnet, a model that had been trained “9-12 months ago” and the best models like Claude 3.5 Opus or GPT-4.5 (a.k.a Orion) were not available to users for a grab bag of reasons.Competitive pressure from DeepSeek and Grok integrated into a shifting political environment for AI — both domestic and international — will make the established leading labs ship sooner. A large portion of delays in delivering models is for “safety testing,” but we don’t have exact details on how much of it was that and how much was cost-benefit tradeoffs (and other big company hurdles such as legal departments). The brand, and culture, of “having the smartest model” is extremely important to these companies, but having a way smarter model was often financially too much to bear.“Safety” is actively being removed from the spotlight of the AI discourse. It is possible that this overcorrection causes meaningful harm, as this is an extremely powerful and rapidly evolving technology, but the political capital to make safety a core tenet of the AI industry was spent too early relative to meaningful harm emerging.Increased competition and decreased regulation make it likely that we, the users, will be given far more powerful AI on far faster timelines.We’ve seen time and time again the value of having the best model first. The only way to onboard new users is to have some capability or behavior that your model differentiates on. With the pace of progress high, minimizing the time from training to release is the best way to maximize one’s chance of impact.DeepSeek and xAI show how organizations with slightly trailing technical progress or resources can outshine the likes of OpenAI and Anthropic who have voluntarily not shipped their latest models.Interconnects is a reader-supported publication. Consider becoming a subscriber.Grok 3 by the numbersBenchmarks and vibe tests mark Grok 3 as one of the best models available today. As with any release, companies often choose evaluations that flatter their models. Yes, winning on these evaluations is extremely challenging, and much credit must be given to xAI for delivering a leading-edge model just about 19 months after its incorporation.That being said, what is shown below is a total of 4 language model evaluations. Given that models like DeepSeek R1 or Gemini Thinking launch with 10-20 evaluations detailing their performance relative to peers, this has to be taken with a grain of salt. It is very likely that Grok 3 doesn’t outperform its peers in every category, but there is a slim chance these other comparison evals just weren’t run in the optimization for expedience.To start, we can compare Grok 3 benchmarks versus available instruct models.And versus available reasoning models (note how OpenAI’s announced o3 scores exceed these clearly).An important detail, as we’ve seen with OpenAI’s reasoning model releases is, what do the shaded regions on the above plots show? Without exact details, we don’t know the inference cost for each of the models on these reasoning plots. Pushing the frontier in absolute terms is important, but the field overall is getting messier before it’ll get clearer.Regardless, in the above two plots Grok 3 is pushing progress both on standard model training and the new reasoning training. While reasoning training and RL are the hot new things in the AI field, simple scaling and optimization of existing techniques still deliver value.And Grok’s score on ChatBotArena.A model launching at top of every category on ChatBotArena feels like something that should be rare (given it now encompasses many categories like Math, Coding, Style Control, Longer Queries, etc.), but it happened just a few weeks ago with Gemini 2.0 Pro!ChatBotArena is known to favor models that are likely to not refuse requests (we don’t know by how much), as evidenced by Claude 3.5 Sonnet (New)’s relatively low position on the leaderboard relative to its utility, but overall is a hard evaluation to top. xAI’s stated goals of a “based” model should correlate well here.A question we don't know the answer to: How many points of performance on evals do you gain by not caring about safety at all? Internal to the model, i.e. in behavior latent spaces, safety is pretty orthogonal to common high-utility behaviors like reasoning and code, and bigger models tend to do more things without a cost to other behaviors, but there has to be a safety performance margin. Did Grok 3 succeed because of this? It’s too early to tell.At a technical level, Grok 3 is certainly a very big model. We don’t have specifics, but it’s reasonably safe to take a datapoint for scaling still helps for performance (but maybe not on costs). xAI’s approach and messaging has been to get the biggest cluster online as soon as possible. The Occam’s Razor explanation until we have more details is that scaling helped, but it is possible that most of Grok’s performance comes from techniques other than naive scaling.Grok 3’s size to beat existing models feels like when Nemotron 340B beat Llama 3 70B, making it the leading open-weight model at the time, but uptake was slow because the cost relative to the performance gains wasn’t worth it to adopt. We’ll know more about this when Grok 3 is available in their API and we see the exact costs.When models are approximately equal in performance, price and ease of use are the determining factors of adoption.Overall, Grok 3 is a huge technical achievement but not one that indicates a substantial change in who is at the frontier of effective training. xAI is obviously closing in on OpenAI, Anthropic, and most of all Google, but all available data points put these labs ahead of xAI on the frontier of efficient model training. It is good that they are being pressured to deliver more absolute intelligence and not to just continue optimizing their frontier of performance per dollar.Read some other reviews of Grok 3 here and here. Karpathy’s summary is particularly in line with my thinking (while potentially slightly overselling capabilities).As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.Where progress is headingIf these AI models, and the industry writ large, are accelerating, it is important to wonder where they are accelerating toward. Most of the evals we use now to launch leading models are not that representative, in many cases they’re actually 100% out of distribution to normal life. What is the value in solving a competition math problem like AIME or so-called “Google Proof” questions? Time will tell, but the case for usefulness to average users is definitely stretched.Small ChatBotArena improvements are marginal gains in robustness, where something like a 20-point difference in Elo rankings — the relative difference between Grok 3 and the next top model — translates to the model winning something like 51% of head-to-head match-ups. This robustness adds up over time, but it is far from meaning that this model is more intelligent in an absolute sense.In fact, in the case of some of the latest evaluations from the research community, it seems like evaluations are being designed more around being hard than being useful. It is a natural response to models being super powerful to try and find something to challenge them with, but it makes tracking progress and communication far harder.Companies have many internal evaluations that are not shared. Increasing transparency on these would help contextualize what is and is not meaningful progress. Without these, the only benchmark we have for model changes is them becoming more deeply integrated into products. Product-model synergy can enable extremely useful, new workflows, but it makes tracking the progress of AI a proxy measurement.I do personally believe these somewhat arbitrary capabilities we are marching toward will generalize to extended and amplified value, but it takes some "feeling the AGI" to see that these models that are bett
The era we are living through in language modeling research is one characterized by complete faith that reasoning and new reinforcement learning (RL) training methods will work. This is well-founded. A day | cannot | go | by | without | a new | reasoning model, RL training result, or dataset distilled from DeepSeek R1.The difference, compared to the last time RL was at the forefront of the AI world with the fact that reinforcement learning from human feedback (RLHF) was needed to create ChatGPT, is that we have way better infrastructure than our first time through this. People are already successfully using TRL, OpenRLHF, veRL, and of course, Open Instruct (our tools for Tülu 3/OLMo) to train models like this.When models such as Alpaca, Vicuña, Dolly, etc. were coming out they were all built on basic instruction tuning. Even though RLHF was the motivation of these experiments, tooling, and lack of datasets made complete and substantive replications rare. On top of that, every organization was trying to recalibrate its AI strategy for the second time in 6 months. The reaction and excitement of Stable Diffusion was all but overwritten by ChatGPT. This time is different. With reasoning models, everyone already has raised money for their AI companies, open-source tooling for RLHF exists and is stable, and everyone is already feeling the AGI.Aside: For a history of what happened in the Alpaca era of open instruct models, watch my recap lecture here — it’s one of my favorite talks in the last few years.The goal of this talk is to try and make sense of the story that is unfolding today:* Given it is becoming obvious that RL with verifiable rewards works on old models — why did the AI community sleep on the potential of these reasoning models? * How to contextualize the development of RLHF techniques with the new types of RL training?* What is the future of post-training? How far can we scale RL?* How does today’s RL compare to historical successes of Deep RL?And other topics. This is a longer-form recording of a talk I gave this week at a local Seattle research meetup (slides are here). I’ll get back to covering the technical details soon!Some of the key points I arrived on:* RLHF was necessary, but not sufficient for ChatGPT. RL training like for reasoning could become the primary driving force of future LM developments. There’s a path for “post-training” to just be called “training” in the future.* While this will feel like the Alpaca moment from 2 years ago, it will produce much deeper results and impact.* Self-play, inference-time compute, and other popular terms related to this movement are more “side quests” than core to the RL developments. They’re both either inspirations or side-effects of good RL.* There is just so much low-hanging fruit for improving models with RL. It’s wonderfully exciting.For the rest, you’ll have to watch the talk. Soon, I’ll cover more of the low level technical developments we are seeing in this space.00:00 The ingredients of an RL paradigm shift16:04 RL with verifiable rewards27:38 What DeepSeek R1 taught us29:30 RL as the focus of language modeling Get full access to Interconnects at www.interconnects.ai/subscribe
Article: https://www.interconnects.ai/p/deep-research-information-vs-insight-in-science(sorry about some more audible breaths in this -- I'm going to work on it!)We at Ai2 released a local LM iPhone app for our OLMoE model (1B active, 7B total params), with greatly improved scores! Let us know what you think, or read more here.OpenAI’s Deep Research has largely been accepted as a super valuable tool for knowledge workers and analysts across the economy, but its real engine of economic progress is going to be changing the nature of scientific progress. Science is the fuel of technological revolutions.Deep Research in its current form feels like a beta version of a next-generation piece of technology. It does what it is tasked with — searches the web and processes many resources to create a useful report with referenced sources. Some of my uses include researching model evaluations, recent robotic learning research, and AI for science breakthroughs.Deep Research’s limitations mostly feel like problems of search, where it is prone to returning SEO optimized slop, style, where it returns verbose, low information density writing, and modality, where it does not have the ability to read, process, and return plots and diagrams. All of these are surely solvable and expected features if we look at the rollouts of other AI models in the last few years.This isn’t a product review (you can read Stratechery or Turing Post for more of that) — as the answer is quite simple, if you work in a knowledge intensive vocation you should be using this — but rather asking: So what comes next?The place to start from within AI circles is to revisit the question of “When will AI make novel discoveries?” A good example of this is in the Dwarkesh Podcast episode with Dario Amodei:One question I had for you while we were talking about the intelligence stuff was, as a scientist yourself, what do you make of the fact that these things have basically the entire corpus of human knowledge memorized and they haven't been able to make a single new connection that has led to a discovery?An example experiment we could do to test this is to train models on time-gated information and see if it can repeat a scientific discovery we already made (yes, this would be difficult to run, but not impossible). Ross Taylor described this on his Interconnects Interview:So an experiment I've never done because I didn't have [the] compute would be this. Imagine if you could train a language model on all documents up to 1905, which is the year when Einstein had his miraculous year of four seminal papers. With that model, which is trained up to 1905, could you prompt the model to come up with a good explanation of the photoelectric effect, special relativity, this kind of stuff? And what would it take to rediscover these things?The dream is for AI to make breakthroughs, and the absence of evidence for this even after the release of Deep Research is driving a reckoning over what language models will ever be able to do. The fork in the road is either believing that scaling (either in parameters or in new training methods) will unlock “insights” or accepting that the current generation of models are very useful tools and nothing more supernatural. Likely the most powerful tool humanity has made yet. Our first power tool for information.Much of science is not about making novel insights but about making progress within established problems of the field. In AI, these are the countless benchmarks we are saturating. A very valuable contribution in AI as a field can be re-using known resources in a simpler way.With AI, we are going to learn the boundary between true insight and scientific progress. A related form of scientific progress is the compression of noisy ideas and experiments into a cohesive trend. Something that Deep Research can likely do, but not something that builds the allure of Einstein and the other scientific greats.To understand this relationship between Deep Research, AI broadly, and the nature of science, we must address:* How to interpret existing “AI for Science” projects like AlphaFold in the bigger context of science,* How reasoning models, AI research systems like Deep Research, and other forthcoming AIs revolutionize existing scientific practices,* How recent developments in AI challenge Kuhn’s formulation of scientific revolutions, and* How current institutions will need to change forever in the face of AI?This (hopefully) series of posts is my attempt to create a worldview around what science means in the face of AI. Today, we focus on the first two — major AI for science projects and how normal science is being accelerated by AI — and hopefully raise urgency within the community to consider the final question.The starting point — grand AI for science projectsThere is a substantial overhang in computational infrastructure and fundamental deep learning capabilities relative to their impact on the broad class of sciences. In order to make a substantial leap in the application of AI to a specific domain, a team must mold the existing underlying capability of AI to the needs of trained groups of scientists.The list of examples people think of in this mold ranges across domains: AlphaFold for protein folding, AlphaGeometry for mathematics, GraphCast and GenCast for weather, and more that lack such prominent branding. They leverage advancements in deep learning and transformer architectures, but tend to have X-factors specific to the domain of interest (see a Deep Research query summarizing this). Such added features are pulling forward AI capabilities to suit a narrow domain.There’s a substantial craft to selecting suitable problems for applying this grand AI for science approach. It requires a field with central elements that are quantitatively focused. Even with this, outcomes are more uncertain than standard AI research or standard research in the domain of choice.The essay A new golden age of discovery from AI Policy Perspectives details how DeepMind sees the opportunity here and showcases some internal ingredients they found that make these projects more likely to be successful.The fact that any of these projects have succeeded shows the overall potential of AI for science. The overall necessity of the approach depends on whether the grand AI for science breakthroughs are pulling forward progress by months or years, or if these models are the single required breakthrough to approach entirely new areas of study.As the broader scientific community embraces AI as “something that works” more of these step changes will happen. They take a very large density of compute and talent on a single problem.These projects fit more naturally into a classical view of science. They take substantial resources and are high risk. Meanwhile, the mass market AI tools that everyone is adopting will dramatically shift the practice of doing science.Towards instantaneous Ph.D.’sWe have two tools that dramatically shift the nature of scientific exploration. They will only get better.* AI models that excel at code, mathematics, and reasoning: OpenAI’s o3, DeepSeek R1, Gemini Deep Thinking, etc.* AI systems to rapidly parse and summarize existing literature: OpenAI’s Deep Research, Gemini Deep Research, Ai2’s Scholar QA (specific to academic papers), and many more that will come soon.These tools are dramatically accelerating the most time-consuming aspects of research, particularly in computationally intensive fields. In a few years, the only gating factor on the impact of a scientist will reduce to their access to cutting edge tools, understanding the gaps in AI, and asking the right questions. The final point is well established as a trait of the most successful scientists, is what goes hand in hand with the idea of “insight,” and where the differentiation among scientists will only increase.Computational super-scientistsAll scientific fields that rely heavily on computational infrastructure as a bottleneck for progress are going to experience a dramatic acceleration in the near future. In AI and closely related computer science fields this is evident from the abundance of soon-to-be superhuman coding assistants and an exponential (short-term) increase in compute available.Most AI research is severely bottlenecked by the compute available, the time to implement the intervention, and the implicit efficiency of the idea-implementation interface. Future siblings of OpenAI’s o1 models are going to be used extensively to streamline this. This worldview barely accounts for the ability of these reasoning models to decide on which problem to solve and to interpret the results. These sorts of research assistants running in the cluster are a central component of the vision of Anthropic CEO Dario Amodei’s view in Machines of Loving Grace, and it is one that requires far less optimism in magical breakthroughs than the grand AI for science projects.Reasoning language models (RLMs) have in their first year of existence shown major progress on all of the evaluations the AI field put forward as fundamental challenges for the field. Accumulating iterations of this should transfer to scientific decision-making, but we don’t exactly know how.The fundamental unit of progress in science, which can be viewed as one Ph.D.’s worth of progress (same goes for one paper), is reducing so quickly to redefine many methods of experimentation and deciding on what is or is not possible. Multiple efforts are already documenting how RLMs can be used to find errors in existing literature — a process that will likely be automated in the next few years. Rather than science proceeding with a high velocity, it feels as if science is proceeding with a high acceleration.The pace of progress necessitates a reinvention of most of our scientific institutions. What happens when the time it takes to create a Ph.D.’s worth of knowledge is substantially smaller than the amount of time it takes to get peer review feedba
As many of you know, this weekend I appeared on the Lex Fridman Podcast with my friend Dylan Patel of SemiAnalysis to cover DeepSeek and the implications on the AI ecosystem. I recommend you check it out.This post was tricky to pull together. I decided to share it anyways given the timeliness of the topic and other more exciting things I have to get to. The minor, thematic contradictions on motivations, costs, and trajectories are exactly indicative of why analysis and productionization of open-source AI is so hard. In that, it is a valuable lesson that building open-source AI will come with a lot of ups and downs, but now is the best time to do so.The DeepSeek moment represents the end of the first chapter of AI's recent takeoff as told through the emergence of ChatGPT. It reminds us, that while substantial resources, coalitions, brands, and trends have been established, the narratives we have been championing are not set in stone. DeepSeek, especially with R1, resets all the narratives around open vs closed, US vs China, scaling and commoditization, etc. as we prep for yet another acceleration in the diffusion, progress, and adoption of AI.Of all of these debates, the focus on open vs. closed AI models is the one least driven by economic factors and most driven by vibes. The open-source AI community is driven by a future vision where AI is not held by a few super-rich companies, a future where more people get to partake in the building of AI, a future where AI is safer, etc. These are ideals and building the tools and systems that make this vision a reality is a monumental challenge. Building strong AI models is far, far easier than building a sustainable open-source ecosystem around AI.Building a better, truly open ecosystem for AI has been my life’s work in the last years and I obviously want it to flourish further, but the closer you are to the core of the current open-source ecosystem, the more you know that is not a given with costs of doing relevant AI training skyrocketing (look, I know DeepSeek had a very low compute cost, but these organizations don’t just fall out of the tree) and many regulatory bodies moving fast to get ahead of AI in a way that could unintentionally hamper the open. Yes, efficiency is getting better and costs will come down, as shown with DeepSeek V3, but training truly open models at the frontier isn’t much easier.Building the future ecosystem of openAs a perfect case point, consider Meta. Meta, as a platform serving content to billions of users, is extremely well-positioned to use AI to make its services more engaging and more profitable for advertisers. The Llama project is not needed for that vision. Yes, it will be easier for them to integrate and optimize an AI that they train, but in a world where AI models are commoditized, what’s the point? The most compelling reasons for openly releasing the Llama models are not business reasons but rather ideological reasons. Mark Zuckerberg revisited this on the recent Meta earnings call:I also just think in light of some of the recent news, the new competitor DeepSeek from China, I think it’s one of the things that we’re talking about is there’s going to be an open source standard globally. And I think for our kind of national advantage, it’s important that it’s an American standard. So we take that seriously and we want to build the AI system that people around the world are using and I think that if anything, some of the recent news has only strengthened our conviction that this is the right thing for us to be focused on.The pro-America messaging from Zuckerberg long predates the new administration (especially given that all of Meta’s major apps are banned in China), even if the language is amplified now. This is purely an argument of “we are doing this because we should.”This argument is extremely similar to that used by DeepSeek AI’s CEO Liang Wenfeng. In an interview translated by ChinaTalk, Wenfeng described the need for Chinese leadership in open-source AI (in addition to a clear commitment to keep releasing models openly).Liang Wenfeng: Because we believe the most important thing now is to participate in the global innovation wave. For many years, Chinese companies are used to others doing technological innovation, while we focused on application monetization — but this isn’t inevitable. In this wave, our starting point is not to take advantage of the opportunity to make a quick profit, but rather to reach the technical frontier and drive the development of the entire ecosystem.…We believe that as the economy develops, China should gradually become a contributor instead of freeriding. In the past 30+ years of the IT wave, we basically didn’t participate in real technological innovation. We’re used to Moore’s Law falling out of the sky, lying at home waiting 18 months for better hardware and software to emerge. That’s how the Scaling Law is being treated.But in fact, this is something that has been created through the tireless efforts of generations of Western-led tech communities. It’s just because we weren’t previously involved in this process that we’ve ignored its existence.The interview has many other comments making it clear that the way this will be done is by training powerful AI and releasing it for the world to use.Both of these arguments, from Zuckerberg and Wenfeng, rely on the optimism that we, as a community of users of open AI models, will figure out how to create a valuable ecosystem around them. Right now, the vast majority of AI usage for applications comes through various API calls. Yes, some of this includes the usage of open-weight models like Llama and DeepSeek R1, but it does not give clear positive attribution to the fact that the model was open as a reason said the model was used.The nationalistic comments regarding open-source AI are only likely to grow stronger as governments more deeply integrate with their leading AI companies.One of the main arguments why American AI leaders believe that the AI ecosystem should be built on a Western foundation is the risk of China “poisoning the well” of our future computational infrastructure. To be very clear — there is absolutely no evidence of this to date, but it is a simple proposition that the Chinese Communist Party (CCP) could build ties to the leading Chinese AI laboratories and require them to train for specific behaviors or train in some sort of back door through model weights into American infrastructure. America has been reeling with the potential of this sort of influence on TikTok. If AGI is to be a real thing that can be steered to ideological outcomes, a bill titled Protecting Americans from Foreign Adversary Controlled Applications Act (the bill banning TikTok and forcing a divestiture) will operate at entirely the wrong level of abstraction. American companies raced to host R1 in a competitive frenzy. This is how open-source works and it will be far easier to incentivize better open models from Western labs than it will be to ban companies from adopting Chinese technology.As of the release of DeepSeek R1, Chinese AI companies didn’t have clear links to the government, but after said release, DeepSeek’s CEO met with the Chinese Premier Li Qiang (approximately second in command) to discuss their work.AI is obviously far more in the radar of American leadership as a priority and has been for some time. This is a major advantage that the U.S. has in terms of a fast reaction to changing needs for open models.In a recent Reddit AMA soon after his appearance on stage with Trump for the announcement of the Stargate project, CEO of OpenAI Sam Altman even acknowledged that their strategy “may be on the wrong side of history” here with respect to openly sharing AI components. OpenAI should get no credit until their actions change, but DeepSeek and a new government administration have made many forces re-evaluate their relationship to the open ecosystem. The current imperative of open-source AI is to create feedback loops where open models become more useful than their closed counterparts. Given that AI is very expensive and slow to train, this cannot look like the accumulation of small security and reliability improvements like done with open-source software. There’s a chance that there is an algorithmic innovation that makes this possible, but for now, the solutions need to be more imaginative. Two examples I am currently interested in include:* Feedback loops from data to model behavior. If exposing the data to users, either from pre-training or post-training, makes it easier to control a model, then open models can win.* Finetuning advancements. Currently, finetuning any model to target a specific task is extremely hard. This is with both open-source code and fine-tuning APIs. If open-source code can be made to enable feedback loops of cheap synthetic data with verifiers to make very targeted models, open models can win.This is just two examples. We need more than these if we want open-source AI to continue once the bubble of AI advancement cracks. We don’t know when this comes, but if investment is driven by ideological reasons rather than monetary ones, public companies only have so much leeway to continue indefinitely.These days I classify myself as an advocate for more openness for AI (which never means absolute openness), but I’m not going to describe myself as also being an optimist for it “winning.” As scaling continues to push the limits of multiple training regimes and recipes become more expensive, open models drift from the frontier. This DeepSeek moment happened once in the 2+ years since the release of ChatGPT. We need to change incentives if we want it to happen regularly.Why DeepSeek R1 is so close to the frontier is that, on top of being extremely skilled, they have a way faster release process than the likes of OpenAI and Anthropic who do extensive safety testing and even pre-screen releases by federal governments. Meanwhile, DeepSe
This post is early to accommodate some last minute travel on my end!The new models trained to express extended chain of thought are going to generalize outside of their breakthrough domains of code and math. The “reasoning” process of language models that we use today is chain of thought reasoning. We ask the model to work step by step because it helps it manage complexity, especially in domains where the answer requires precision across multiple specific tokens. The domains where chain of thought (CoT) is most useful today are code, mathematics, and other “reasoning” tasks. These are the domains where models like o1, R1, Gemini-Thinking, etc. were designed for.Different intelligences reason in different ways that correspond to how they store and manipulate information. Humans compress a lifetime of experience into our spectacular, low-power brains that draw on past experience almost magically. The words that follow in this blog are also autoregressive, like the output of a language model, but draw on hours and hours of background processing as I converge on this argument.Language models, on the other hand, are extremely general and do not today have architectures (or use-cases) that continually re-expose them to relevant problems and fold information back in a compressed form. Language models are very large, sophisticated, parametric probability distributions. All of their knowledge and information processing power is stored in the raw weights. Therein, they need a way of processing information that matches this. Chain of thought is that alignment.Chain of thought reasoning allows information to be naturally processed in smaller chunks, allowing the large, brute force probability distribution to work one token at a time. Chain of thought, while allowing more compute per important token, also allows the models to store intermediate information in their context window without needing explicit recurrence.Recurrence is required for reasoning and this can either happen in the parameter or state-space. Chain of thoughts with transformers handles all of this in the state-space of the problems. The humans we look at as the most intelligent have embedded information directly in the parameters of our brains that we can draw on.Here is the only assumption of this piece — chain of thought is a natural fit for language models to “reason” and therefore one should be optimistic about training methods that are designed to enhance it generalizing to many domains. By the end of 2025 we should have ample evidence of this given the pace of the technological development.If the analogies of types of intelligence aren’t convincing enough, a far more practical way to view the new style of training is a method that teaches the model to be better at allocating more compute to harder problems. If the skill is compute allocation, it is fundamental to the models handling a variety of tasks. Today’s reasoning models do not solve this perfectly, but they open the door for doing so precisely.The nature of this coming generalization is not that these models are one size fits all, best in all cases: speed, intelligence, price, etc. There’s still no free lunch. A realistic outcome for reasoning heavy models in the next 0-3 years is a world where:* Reasoning trained models are superhuman on tasks with verifiable domains, like those with initial progress: Code, math, etc.* Reasoning trained models are well better in peak performance than existing autoregressive models in many domains we would not expect and are not necessarily verifiable.* Reasoning trained models are still better in performance at the long-tail of tasks, but worse in cost given the high inference costs of long-context.Many of the leading figures in AI have been saying for quite some time that powerful AI is going to be “spikey" when it shows up — meaning that the capabilities and improvements will vary substantially across domains — but encountering this reality is very unintuitive.Some evidence for generalization of reasoning models already exists.OpenAI has already published multiple safety-oriented research projects with their new reasoning models in Deliberative Alignment: Reasoning Enables Safer Language Models and Trading Inference-Time Compute for Adversarial Robustness. These papers show their new methods can be translated to various safety domains, i.e. model safety policies and jailbreaking. The deliberative alignment paper shows them integrating a softer reward signal into the reasoning training — having a language model check how the safety policies apply to outputs.An unsurprising quote from the deliberative alignment release related to generalization:we find that deliberative alignment enables strong generalization to out-of-distribution safety scenarios.Safety, qualitatively, is very orthogonal to traditional reasoning problems. Safety is very subjective to the information provided and subtle context, where math and coding problems are often about many small, forward processing steps towards a final goal. More behaviors will fit in between those.This generative verifier for safety is not a ground truth signal and could theoretically be subject to reward hacking, but it was avoided. Generative verifiers will be crucial to expanding this training to countless domains — they’re easy to use and largely a new development. The field of LLM-as-a-judge (and related synthetic data pipelines) only really became stable with models at the level of GPT-4. Reasoning models trained as a judge are a very natural fit because the exact token for a predicted reward or ranking is crucial — CoT is essential. All of the progress here relies on continued progress on both generators and verifiers. o1 et al. were likely trained with mostly explicit, code verifiers. They spawned far more powerful generators, which will enable new types of verifiers. Then, we can train better models (and so on).Onto another example of unexpected performance of new reasoning trained models. DeepSeek-R1, the new open-weight o1 replication has been showing up at the top of many random benchmarks as top overall, above Claude 3.5 Sonnet, Gemini, and GPT-4o, and alongside o1. Examples include a creative writing and humor leaderboard or the brand-new, extremely challenging benchmark from the Center for AI Safety and Scale AI — Humanity’s Last Exam. Oh, and yes, it’s best on both accuracy and the new metric “calibration error” which is designed to have the model express its own uncertainty. Calibration is a long-sought behavior in traditional LMs and turns out maybe reasoning training helps it?A lot of my friends find o1-pro to be clearly the most useful AI model in their daily workflows (one example here and a similar R1 example here). ChatBotArena has all of the new models, from o1, Gemini-Thinking, and R1 as some of the top models these organizations have in the best “normal use” evaluation the AI community has. These reasoning models are definitely absorbing the other lessons learned in post-training across the AI industry.The explosion of R1 caused arguably the biggest general awareness of AI moment since the original ChatGPT. DeepSeek’s App has been the number one overall free app in the U.S. and non-technical users are getting meaningful value out of seeing the reasoning process. What was a niche training process is bringing many more types of benefits than expected.All of this is just on “day 1” of this technology. Reasoning models are going to proceed at a rate far, far faster than most expect.These models will not be state-of-the-art on every domain, but probably far more than you expect. Language models are a complex technology and they will never be one size fits all, but the ground is being reshaped under us.Especially, where the standard models match the reasoning models abilities, you’ll be paying way more for the same performance. At the same time, so many domains are going to be open to the “if you pay a little bit more, the reasoning model will get you a bit more performance,” which will accrue so much value over time.These are trade-offs that many in the AI industry see at face value. Many ask where Anthropic’s reasoning model is, but they may never explicitly have one. Before o1 launched, Claude was already using extra tokens hidden from the user to improve the quality of responses. Anthropic CEO Dario Amodei commented on their approach in an interview with Joanna Stern of the WSJ recently:To say a little about reasoning models, our perspective is a little different, which is that there’s been this whole idea of reasoning models and test-time compute as if they’re a totally different way of doing things. That’s not our perspective. We see it more as a continuous spectrum — the ability for models to think, reflect on their own thinking, and ultimately produce a result. If you use Sonnet 3.5, sometimes it already does that to some extent. But I think the change we’re going to see is a larger-scale use of reinforcement learning, and when you train the model with reinforcement learning, it starts to think and reflect more.It’s not like reasoning or test-time compute — or whatever it’s called — is a totally new method. It’s more like an emergent property, a consequence of training the model in an outcome-based way at a larger scale. I think that will lead to something that continuously interpolates between reasoning and other tasks, fluidly combining reasoning with everything else models do.As you’ve said, we’ve often focused on making sure using the model is a smooth experience, allowing people to get the most out of it. I think with reasoning models, we may take a similar approach and do something different from what others are doing.The newest Claude 3.5 Sonnet models are very likely already trained to some extent with RL on verifiable outcomes. Just days before o1 was launched, Claude’s behavior of “I’m thinking about that” was the biggest indicator we had of consumer companies trading more com
We're here to share the story of building our Open Language Models (OLMos) and what we improved to build the OLMo 2 7B/13B model that is competitive with the Llama 3.1 8B model. This is all about building an effective, small language modeling team that can share all it learns with the scientific community. Dirk, Luca, and Kyle are some of the people I learn the most from and have more knowledge (and entertainment) to share than we have time. Some questions were pulled from Twitter, but please comment or get in touch if you want us to cover anything in the future episode(s)!Main topics:* Pretraining efficiency and our quest for stability after a not-so-secret failed 70B run early in 2024,* What the role of OLMo is in the broader AI landscape and how that is, or is not, changing,* Many little decisions that going into building language models and their teams (with a focus on NOT post-training, given I already talk about that a ton).Play with the models we build here: playground.allenai.org/For more history of open language models (OLMos) on Interconnects, see my first post on OLMo, my coverage of OLMoE, OLMo 2, and why I build open language models. If you have more questions or requests, please let us know (especially the researchers out there) and this can be one of N, rather than a one off celebration.Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.ContactsDirk Groeneveld — https://x.com/mechanicaldirk // https://bsky.app/profile/mechanicaldirk.bsky.socialKyle Lo — https://x.com/kylelostat // https://bsky.app/profile/kylelo.bsky.socialLuca Soldaini — https://twitter.com/soldni // https://bsky.app/profile/soldaini.netGeneral OLMo contact — olmo@allenai.orgPapers / models / codebases discussed* OLMo 2 paper* OLMo 1 paper* OPT models and talk from Susan Zhang* BLOOM* Red Pajama V1 Dataset* Falcon LLM * C4: Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach * Maximal Update Parametrization (muP) is from Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer * Spike No More: Stabilizing the Pre-training of Large Language Models * LLM360: Towards Fully Transparent Open-Source LLMs — Amber model* EfficientNet * MegaBlocks * A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Kyle said Hitchhikers)* Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models ChaptersChapters: Here is a list of major topics covered in the podcast, with timestamps for when the discussion starts:* [00:00:00] Introduction* [00:02:45] Early history of the OLMo project* [00:15:27] The journey to stability* [00:25:00] The evolving role of OLMo and pretraining research* [00:29:00] Pretraining Q&A (µP, scaling laws, MoE, etc.)* [00:40:40] How to think about pretraining data work* [00:54:30] Role of pre-training vs mid training vs post-training* [01:02:19] Release strategy and wrapping upTranscriptThis is generated by AI and lightly edited for clarity. Particularly, the attribution per-speaker was poor on this time around.Nathan Lambert [00:00:07]: Hey, welcome back to Interconnects. In this interview, we're bringing one that I've hinted at for a while, which is interviewing some of the other leads on the OLMo team at AI2. So essentially, this covers the story of OLMo from its early days where we got our compute, kind of our path to stability and some failed runs along the way, the role of OLMo and the broader AI ecosystem, and really just a very long tale of technical details and decision making and considerations that you have when actually training language models that you're trying to have at the frontier of performance relative to peers like Llama, etc. This is a fun one. There's less post-training than normal because this is me interviewing some other co-leads at the Allen Institute for AI. So there's three people in addition to me, which is Dirk Groeneveld, who is the lead of training, handles most of engineering, Kyle Lo, and Luca Soldaini, who are the data leads. So we have a pre-training engineering lead and two data leads with me who has done a lot of the post-training. This is just a part of the team. And I hope you enjoy this one. We can do more of these and bear with the fact that I'm still expanding my podcasting tech equipment. But I think the audio is definitely good enough and enjoy this episode with me, Kyle, Dirk, and Luca.Hey, everyone. Welcome to the AI2 office. We're finally talking more about some of our OLMo things. Too much work to do to actually get all the... the information we want to share out into the world. So I'm here with Dirk, Kyle, and Luca. We can also talk so people identify your voices so people are not all on video. Hi, I'm Dirk.Dirk Groeneveld [00:02:01]: I am the lead of the pre-training part of OLMo.Kyle Lo: Hi, I'm Kyle. I work on data.Luca Soldaini [00:02:08]: Hello, Luca. Also work on data with Kyle.Nathan Lambert [00:02:13]: Okay, so we're kind of going to maybe go through some of the story of OLMo to start. And then just get into as many nerdy details until we get tired of OLMo 2. Which, in my state, this will probably be mostly about pre-training. You can ask me post-training questions as well. But I'm not going to sit here and be like, ask myself questions that I'm not going to answer. Because that is an absolutely ridiculous thing. You can ask me one question. Okay. One question. It's like, why shouldn't you post-training with all the compute?Nathan Lambert [00:02:45]: But I wasn't here for when OLMo actually started. So I think it'd be good to tell people, I mean, like, broadly what AI2 was like at the time, what language modeling was like at the time, what it may or may not have been risky.Kyle Lo [00:03:01]: Yeah, you should probably get this.Dirk Groeneveld [00:03:03]: Yeah, I think it all started in the fall of 2022.Dirk Groeneveld [00:03:10]: We were talking to AMD at the time about some sort of collaboration. We're scoping out some stuff. And at the time, we wanted to take the Bloom model. And put 300 billion extra tokens in. And we wrote up a proposal and we sent it to AMD and it disappeared into a black hole. And we never heard from them again. And then ChatGPT came out a couple months after that. And suddenly everybody was very excited. And two, maybe one month after that, AMD came back to us and said, now let's do it. And that kicked off a very busy period for us. At least the three of us were involved at the time. Plus some of us. Some more people trying to scope out exactly what the project would be. Putting 300 billion tokens into Bloom wasn't that cool anymore. The field had moved on. So we needed to find something else that would work both for us and for AMD.Dirk Groeneveld [00:04:07]: And that's exactly what we did. We figured it out. We figured out who would be on the team, how exactly to do it. We had to get the data from all of that stuff and then started working on it.Luca Soldaini [00:04:16]: I think it was, let's look it up. And the official birthday of all of us. Almost is February 2nd, 2023. That's when we had like a big sort of half day. Summit workshop and a bunch of researchers self-organized a long discussion. I'm foreseeing maybe like 40, 50 of us try to scope down a potential language model project at AI2.Kyle Lo [00:04:48]: Yeah, it was also extremely bottom. Up because we were all like, nobody, it was not on anyone's radar. We were working on, everyone's working on different projects that we had promised for the end of the year. This was very much just like a side gig for us. We had no compute other than this mysterious AMD GPUs that just came. It was like, oh, it's possible. And everyone was just like, yeah, I'll work on this on the side. Let's just start hacking together some stuff.Nathan Lambert [00:05:14]: How far along the line until you decided on 7B? Like, were these things obvious at the time?Luca Soldaini [00:05:20]: I think the size of it. This is where Llama's size was. Yeah, we started with seven because seven was the smallest Llama size. This was Llama one. Yeah, Llama one was like first couple months of 2023. Yeah, we started, we started scoping before Llama one. And then when Llama one came out, it made sense to have a configuration that was just sort of close to what they were doing. So it's not too much reinventing. I think seven was.Dirk Groeneveld [00:05:52]: Yeah, I mean, I think the original scope was recreate Llama one, which would be a 7B at 1.4 million tokens. What were we staring at? OPT.Kyle Lo [00:06:03]: We were staring at OPT also, right? During around that time.Dirk Groeneveld [00:06:07]: For inspiration. Yeah. And for what not to do in many cases. Was OPT even like in the many tokens regime or was that still like when people did the booms and booms?Luca Soldaini [00:06:18]: I think OPT and booms were.Luca Soldaini [00:06:22]: They were not, they were not over trained at the end were both a scope to Chinchilla that they both had extensive logs and so they were very useful because both of them have hundreds of pages of like, whatever can go wrong during pre-training. Yeah. I mean, OPT was amazing as a resource for figuring out, you know, we knew nothing, so we needed to know what's important. And yeah, I remember there's also avoidance and so on. There's that. It's like Susan has this talk.Dirk Groeneveld: I'll come load parallels of training OPT and yeah, I think the original ones, I always feel it's kind of a shame because the OPT models are not very good, but, but they were first, like they figured all that stuff out for the first time. I have huge amounts of respect for that.Nathan Lambert [00:07:11]: And what's the like open source angle thing at the time, or like, had you already identified that there was no open pre-trained data sets for these models?Kyle Lo There definitely wasn't any op
Full post for links, images, etc: https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1I have a few shows to share with you this week:* On The Retort a week or two ago, we discussed the nature of AI and if it is a science (in the Kuhn’ian sense)* I appeared on Dean W. Ball and Timothy B. Lee’s new podcast AI Summer to discuss “thinking models” and the border between post-training and reasoning methods. Listen here.* Finally, a talk I gave at NeurIPs on how I think about post-training for AI applications is now public.This post is likely getting cut off in email inboxes — I recommend reading online by clicking on the title!Yesterday, January 20th, China’s open-weights frontier AI laboratory, DeepSeek AI, released their first full fledged reasoning model. It came as:* A flagship reasoning language model, R1, trained via a 4-stage, RL heavy process. It is MIT-licensed which means companies and researchers can build upon and train on its outputs to accelerate the development and deployment of reasoning language models (RLMs).* An RL-only reasoning model trained directly from their V3 base model, R1-Zero (used to create training data for full R1).* A suite of open-weight models finetuned with supervised finetuning (SFT) data derived from R1 (similar data to one of their intermediate training steps).* A technical report detailing their RL training methods.* Models are available at chat.deepseek.com (via DeepThink) and in their new app.This post is less about the evaluation results (which, of course, are extremely good and shown below), but rather about how training is done and what it all means.This is a major transition point in the uncertainty in reasoning model research. Until now, reasoning models have been a major area of industrial research without a clear seminal paper. Before language models took off, we had the likes of the GPT-2 paper for pretraining or InstructGPT (and Anthropic’s whitepapers) for post-training. For reasoning, we were staring at potentially misleading blog posts. Reasoning research and progress is now locked in — expect huge amounts of progress in 2025 and more of it in the open.This again confirms that new technical recipes normally aren’t moats — the motivation of a proof of concept or leaks normally get the knowledge out.For one, look at the pricing of these reasoning models. OpenAI was likely charging more for its model due to the costs of long-context serving and being the only model in town, but now o1’s pricing at $15 per million input tokens / $60 output looks out of place relative to R1’s pricing at $0.55 per million input tokens / $2.19 output (yes, o1-mini is cheaper at $3/$12 per million, but still almost a 10x difference). The price war that is coming for reasoning models will look like the Mixtral inference price war from 2023.With o3, OpenAI is likely technically ahead, but it is not generally available nor will the weights be available anytime soon. This points to the first time since Stable Diffusion’s release that the most relevant and discussed AI model is released with a very friendly license. Looking back at the journey “open-source” AI has been on over the last 2.5 years, this is a surprising moment in time marked in the history books.We don’t entirely know how these models will be used in the future beyond code and math, but noises are constantly bubbling up that OpenAI’s o1-Pro is the best model for many more challenging tasks (I need to try it myself before making definitive recommendations).The most useful post to write now is one that establishes the research area, the do’s and don’ts, and the open questions. Let’s get into the details.The DeepSeek R1 training recipe for reasoningThe training of R1 comes in 4 stages:* “Cold-start” of supervised finetuning on synthetic reasoning data from the R1-Zero model.* Large-scale reinforcement learning training on reasoning problems “until convergence.”* Rejection sampling on 3/4 reasoning problems and 1/4 general queries to start the transition to a general-purpose model.* Reinforcement learning training mixing reasoning problems (verifiable rewards) with general preference tuning reward models to polish the model.Below, the post breaks down each training stage into its core components, insights, and open questions.The winds of o1 replication have been blowing strongly away from any sort explicit search (especially at inference time). It really was, and is, a language model with the new reasoning behaviors coming from a lot of RL training.Before we start, remember that to do this reasoning training well you need a very strong base model with long-context capabilities. Much like for standard post-training, we don’t really know what traits of a base model make for one that is more suited for direct RL training.Step 0. Training R1-Zero to initialize R1 with synthetic dataDeepSeek R1 Zero will be best known as the first open model trained with “large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step.” Rumors had mentioned this for o1, but understanding how it worked wasn’t clear. This is a funky model that DeepSeek reports will sometimes change languages in reasoning or show signs of other reliability issues.The minor usability issues with R1-Zero show why more than just large-scale RL is needed to train a fantastic reasoning model, but the RL part is the key to unlocking the reasoning behaviors we are searching for.They include the most interesting results for R1-Zero, including the plot I’ve been asking for of RL-training time scaling. Since o1’s release, everyone has been obsessed with the plots showing how inference time is correlated with evaluation performance. Inference time is far easier to elicit (or force by using a framework like Monte Carlo Tree Search), but showing training time improvements via RL is the real foundational result. This is the result I’m searching for in my research.And an unsurprising, yet very satisfying plot of length growing with training. This could be mixed with the above plot to make one of the “inference time scaling” plots we have seen many versions of with less clear methods.In both of these plots, it looks like the numbers could still be going up if they let the RL cook longer. With the pace of progress so high, these laboratories get more gains by ending the jobs near saturation and starting the next experiment instead of seeking that last 1%.Most, if not all, researchers will skip the step of training an R1-Zero style model because they don’t need to. DeepSeek made it clear that their “cold start” of SFT reasoning traces makes the final R1 model better — this is unsurprising, as they want R1 to be a certain type of instruction-tuned model. It’ll help avoid some of the “RL oddities” in R1-Zero that DeepSeek mentions like changing language mid-generation.Still, the area of RL-on-base-models should be studied further. The way that R1-Zero can be trained is quite clever as most base models without any instruction tuning have a major issues with rambling and never generating a stop token. R1-Zero avoids this with a system prompt telling the model to generate HTML tags. Additionally, I suspect this type of training wouldn’t work on older base models that don’t have some standard post-training style instruction data in the pretraining corpus. For example, in OLMo 2 we had some MATH instruction data in the annealing mix. Just a few instructions will let this system prompt work.In fact, the trend of increasing generation length via RL training could be even stronger when training directly from a base model rather than a standard post-trained model that doesn’t have a verbose chain of thought style. In order for RL to really start cranking up the response length in such an instruction-following model it will have to unlearn a certain response length that was baked in. For example, in Tülu 3’s final stage of RL finetuning, the phase where the response rate first goes down could be the barrier of misalignment between a larger round of SFT training before a smaller RL setup.Zooming in on the x-axes of these R1-Zero plots, you can see that they’re doing 1000s of “RL steps.” RL step in this case refers to the model update step, which comes after multiple generations are made for the prompts in the batch and then answers are verified. This is a large amount of RL training, especially with such a large model. For reference, in our Tülu 3 work, we finetuned our models for 100s of steps normally, and the biggest models we are releasing soon only trained for ~50 steps of RL.This is scaled-up RL relative to existing literature. R1 proper surely uses a similar setup, but DeepSeek did not include the same details, so the rest of this post relies more on explicit text in the paper.Step 1. Reasoning SFT “Cold Start”In order to improve the readability (i.e. help maintain formatting) and increase the final performance of the final reasoning model, DeepSeek performs a small amount of supervised finetuning on the original base model with “a few thousand” filtered completions from the R1-Zero model. This involves a few tricks (none of which seem essential, you just need some of this data), such as:Using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.For replication efforts, any of these can be done. In fact, using DeepSeek-R1 itself is likely the easiest way.This phase readies the loss landscape of the model to make the “emergent” behaviors like “wait, let me check my work” or “that was wrong” come forth more easily in RL training.Step 2. Large-scale RL for reasoningAs a reminder, RL for reasoning models is built on a simple idea where you should reward the model for getting correct answers to problems where you can check if it has a correct answer. A ba
Full post for images, etc: https://www.interconnects.ai/p/to-meta-ray-ban-local-aiWith the Rabbit r1, the Humane pin, the Friend thing, the Sam Altman rumors, Meta Ray-Bans, and everything in between, it is obvious that we are going to get new devices in the near future driven by advancements in AI. Trying some of those that already are public makes this obvious from a functional perspective rather than a marketing perspective.Even though many of these devices will have a shelf life drastically shortened by the underlying API access getting turned off when the parent company runs out of money, the call for these devices is very strong. AI is going to be more than a chat window we use for work, we just don’t know what that will feel like. AI should be fun, flexible, and available.Meta’s Ray-Bans were first launched in 2021, long before any of this ChatGPT-inspired interest in AI began. Having tried them — the form factor would have caught on eventually, but AI was the catalyst to accelerate adoption. AI expanded our expectations for the range of exciting outcomes that could be coming our way.Using the AI in the Ray-Bans is much like using a protolithic chatbot. If I had never used ChatGPT, it would have been transformative, but today it feels slightly outdated. We should be more impressed by these generally and contextualize the AI they’re delivering. The product excitement cumulatively feels unexpectedly like what AirPods had on day 1. I was not expecting this fondness.The form factor for the Meta Ray-Bans is fantastic and drives this connection. I’ve been legitimately excited to use them (albeit, much more during sunny Seattle summers relative to now), and it immediately made sense when taking them out of the packaging. My best use has been for outdoor activities, taking photos and videos without needing to fuss with a phone and communications. An example video is below -- like most things, it has a learning curve.Here’s a photo from that outing:Or a video:Clearly, they’re fine.What I want to use them for today has nothing to do with AI. In some ways, this makes me more bullish on the form factor, but it makes it clear that Meta is in a precarious position. Ironically, I would’ve been more reluctant to buy them if not for the excitement about AI.As of writing this, I would much rather have “Apple Ray-Bans” because of a seamless integration with the rest of my information ecosystem. However, Apple may not be willing to take the risk to build them (as I avoid an Apple Vision Pro Digression).This does not mean the long-term story of many new devices won’t be the AI.AI, in the recent past (and likely in the near future), left most electronic devices with an eerie, bland sameness. My sunglasses can answer basic questions about my day just like Siri. At the same time, my appliances try to talk to me. The hard-to-visualize step is how this changes (and overcomes the same integration dead ends that agents face). AI in 5 years (or way less) will actually know the context of our lives and be able to execute basic web tasks.When the AI is good, Meta Ray-Ban type devices will be indispensable. Reminders, calls, reasoning, integration, all on the go. Much like the sensation products like AirPods provide, AI devices (and services) done right will make us free to be in the world naturally.Meta now has a real hill to climb for AI. They just need to focus on building one more useful feature at a time rather than building a god. They have a tangible goal and a real product that is going to get better in the normal march of progress. If only we had an ecosystem of people who wanted to do this work and keep hill climbing the AI part for them.The AI of the Meta Ray-Bans (and the other devices I started with) being primarily in the cloud is a drag but is needed for these first generations of glasses to maintain battery life. The cloud-centric nature of the AI is the largest perceivable reason Meta cannot open a Software Development Kit (SDK) for the glasses — all the developers would be doing is changing Meta's internal Llama API calls, rather than uploading new and improved models to the glasses.AI models in the cloud are consistently the first ones to cross the frontier of new capabilities. As we figure out what we want to use new AI devices for, using the cloud models will make us more likely than not to find useful applications. Now that we have things that people actually like, we need to optimize and specialize these models out of the cloud.What’s the state of local LMs?The AI angle for this post is to prompt the question: What do people actually use local, or on-device, language models for? What are they driving innovation of?The local model ecosystem is composed of a distribution of tinkerers, researchers, and those whom API models refuse their use cases. Most people doing this are not directly innovating on local models in a way that dictates meaningful improvements to underlying AI innovations. Yes, companies surely monitor progress and observe lessons, but there are far bigger markets at play for why local models are needed in the future of AI than the tinkerers that get visibility.Local language models are crucial for maintaining privacy (not everyone can afford fancy inference data centers like Apple), optimizing inference speed, and providing access in situations with no web connectivity. The Meta Ray-Bans stand to benefit from all of these.Phrasing the reasoning starting from the frontier, cloud models most people are used to, rather than what we want, it goes as: Local models shouldn’t try to be our general use case model. Outsource that to the cloud. Use local models for efficient, specific tasks out in the world.What local model enthusiasts are doing is building an ecosystem around optimization, latency, and task specialty that drives a lot of value. This value is captured by companies with no feedback loops to the tinkerers. Having SDKs and other direct places where those evolving local models can benefit in real ways is the goal. The models themselves will actually get better too — an actual potential feedback loop from open AI models.Just about a year ago I wrote a very similar take on local models, on how they have different trade-offs and trajectories. Apple Intelligence, Google’s new models / Pixel phones, and the Meta Ray-Bans are showing us that this future is coming.What is left to be understood is the manner in which local models are developed for new devices. Will any major technology companies let us run our own models with deep integrations? How can open-source principles and local models synergize?Hillclimbing with open, local language modelsGiving developers ways to integrate their own AI models into the operating system (OS) hooks used by the Meta Ray-Bans would immediately spawn a platform for local, open-weight language models. I first learned how locked down the Ray-Ban developer ecosystem was because I was excited to try and get our multimodal LM Molmo on them. That attempt didn’t make it far.Other companies, like Apple, could conceivably have SDKs that let users point their language models at OS hooks. Creating operating systems that allow users to integrate certain open models (even only those that are approved by the companies) would completely change the (lack of) incentives for iterating on language models in the open.While we still don’t have the new Apple Intelligence version of Siri that can plug into multiple applications, we know this works by letting an AI model generate tokens that correspond to actions in other applications. Letting users choose AI models (maybe their own), even if they only are useful in a subset of the tasks, would be wonderful. I would love to sacrifice whatever the AI situation is on my version of the Ray-Bans by default and get just the best vision model for explaining my environment, the best model for cooking ideas, or the best conversational model to just push the limits for AI devices in any of these promising directions. It would be so fun to try different AI models on a real device.The open language modeling ecosystem desperately needs these types of feedback loops (and it is totally natural for excitement about a type of technological development like this to exist before the proof cases of its value).Getting to the point where Meta has an AI SDK for devices along with the leading open language models will make their entire strategy value additive (rather than just destroying the advantages of competitors). In fact, Meta likely needs to do so, or else Apple’s product competitor may dominate the market. Only different strategies and feedback loops can dislodge Apple’s integration.On the modeling side, there’s no doubt we have step-change improvements coming to those used on the Ray-Bans. On ChatBotArena, we have many models with a few billion parameters that beat the first versions of ChatGPT. The same type of performance gain — where at 100X smaller model can match or surpass performance in a few years — will come for the Ray-Bans and all other sorts of AI applications.The big picture arc of technologyStarting in 2025, I’m excited about the breadth and quantity of profound, new technological experiences I’m having. Some of them, like ChatGPT Advanced Voice Mode, haven’t really landed for me (even though they’re extremely impressive to non-tech non-AI friends and family). Meta Ray-Bans, Waymos, Codex, and standard ChatGPT all feel like technologies that were immediately obvious as something I needed. I need to get a Starlink hub in one of the remote locations my hobbies bring me to, and I’m sure I can add reusable rockets to the transformations I’ve embraced.The last technologies sparking these joys were the likes of the iPod and the iPad.Every person I take to ride a Waymo for the first time has a similar experience of joy.This year we may also have new models that solve arbitrary internet tasks for us in the background.The future is here and we’re li
Original post: https://www.interconnects.ai/p/deepseek-v3-and-the-actual-cost-ofChapters00:00 Opening03:15 DeepSeek’s learning efficiency06:49 DeepSeek’s compute transparency and realityFiguresFig 1: Benchmark ResultsFig 2: ChatBotArena ResultsFig 3: Compute Usage Table Get full access to Interconnects at www.interconnects.ai/subscribe
Slides for this post-training talk and slides for the full tutorial on language modeling (with a bit less post-training content and no recording yet). Here are some timestamps for the video:00:00 Introduction 10:00 Prompts & Skill Selection 14:19 Instruction Finetuning 21:45 Preference Finetuning 36:17 Reinforcement Finetuning 45:28 Open Questions 52:02 Wrap UpPsssst… we just recently released our technical report for OLMo 2 — 2 OLMo 2 Furious, check it out for tons of training details and tips!This post has some good content, but if you just want to watch the tutorial on YouTube, it’s here.I’m far more optimistic about the state of open recipes for and knowledge of post-training starting 2025 than I was starting 2024. Last year one of my first posts was how open post-training won’t match the likes of GPT-4. This is still the case, but now we at least understand the scope of things we will be working with better.It’s a good time to record an overview of what post-training looks like today. I gave a version of this tutorial talk for the first time in 2023 (at ICML), which felt like a review of the InstructGPT paper not based on reproduced literature knowledge. In 2024, the scientific community made substantial progress in actually training these models and expanding the frontier of knowledge. Doing one of these talks every year feels like a good way to keep tabs on the state of play (whereas last year, I just had a bunch of links to add to the conversation on where to start).With the talk, I wanted to add more context on where I see post-training generally.The most important one people need to know, given the excitement around OpenAI’s o1 series of models, is that post-training alone is nowhere near a complete enough lens or taxonomy to study training reasoning language models. It’s a step.Back to processes for all modern AI models. There are a lot of post-training methods to improve models and, more importantly, they can be segmented so the scientific community can make progress on each of them individually. The new state of finetuning stages is satisfying, with three groups of training methods:* Instruction finetuning (a.k.a. supervised finetuning),* Preference finetuning (the generalization of reinforcement learning from human feedback), and* Reinforcement finetuning is the new abstraction for improving performance on specific tasks.Some of the long-tail methods like rejection sampling, knowledge distillation, and extensive filtering aren’t studied well, but you can still do excellent post-training without them. We have options for studying post-training in 2025.Where last year we were settling debates such as “DPO vs. PPO” or “does AI feedback for RLHF work,” now we are focused on just making the best practices better.Similarly, the stress around doing research on outputs from foundation model providers, i.e. if research violates the OpenAI terms of service on training competitor models, has dropped further and is common practice — in fact, distilling from strong models is a fundamental part of successful post-training.Interconnects is a reader-supported publication. Consider becoming a subscriber.To summarize the state of post-training, there are a few things to keep in mind:1. Post-training techniques are more impactful on the final performance of modelsSome caveats before I toot the horn of post-training as all you need today. Given that “scaling as we know it is ending” this is not entirely a controversial take. Finally, it is obviously self-serving to myself as someone who is going to benefit from post-training being more important.All of this aside, it’s very logical that post-training will be the next domain for scaling model compute and performance. Predicting the next token accurately is not something that a user cares about — correct answers and how the answer is presented are. All through 2024, there were way more discussions on how post-training is more important.If we look at the Elo ratings of models on ChatBotArena, we can see progress has accelerated even though the models haven’t been getting noticeably bigger. Pretraining on these architectures is improving, yes, but the biggest and best models are used as tools and supervision for better post-training.Post-training got more popular because there was more low-hanging fruit on model performance. A lot of that potential has been realized and, in doing so, entirely new types of models are being made akin to o1.To interpret these numbers:* 100 Elo margin over another means ~2/3 win probability over the lower,* 200 Elo gives ~76% win probability,* 300 Elo gives ~85% win probability, and so on.You can play with these numbers here.2. Post-training can be very expensiveWhile still far cheaper than pretraining due to the price of GPUs, post-training costs have been growing rapidly. If we estimate the costs of post-training the Llama models, we could guess that the all-in costs for the models were about the following: Note — numbers are based primarily on a combination of headcount and data costs with compute driving them even higher.* LLaMA (Q1 2023) <<$1M: instruction tuning only.* Llama 2 (Q3 2023) ~$10-20M: 1.4M preference pairs, RLHF, IFT, Safety, etc. and other costs not in the paper.* Llama 3.1 (Q3 2024) >$50M: similar preference data to Llama 2, a ~200-person post-training team, larger models, etc. The number could be much higher.Post-training costs from large data bills and extensive inference to generate, clean, and verify multiple types of synthetic training data. More complex loss functions, e.g. RL optimizers, use a lot of memory to train, but far fewer FLOPs than pretraining for general Instruct models. This is all growing rapidly and is expected to change.This culminates in the o1 style models where the compute with post-training loss functions can account for 40% or more of the overall compute of the model. Even Tülu 3, our major post-training project at Ai2 that didn’t buy any human data, I estimate costs >$1M for a large academic project.3. Post-training is less reliant on human dataWhile all the frontier laboratories still rely on human data for parts of their post-training pipeline (including both training and evaluation), AI can be substituted at most stages and get a “good enough” outcome. For example, given the costs above, they can be slashed from moving from human preference data that is ~$5-20 per preference point to AI feedback that is <0.01$ per sample. The optionality of synthetic data driven by having models that are good enough for supervision makes the pace of post-training progress far higher. In my experience, AI feedback for RLHF only became possible with GPT-4 tier models and the academic community reaps extreme benefits from the plummeting cost of inference.4. Post-training ability is the door to advanced reasoning modelsDoing post-training well and having mastery of the techniques seems crucial to making progress on reasoning models like o1 due to the infrastructure for RL finetuning of an instruct model is the same as what is used for large-scale RL training, at least you want it to be.Given the above trends — we know more, it is easier to study, we have cheaper alternatives, etc. — there is cause for optimism in open replications of o1. It should still be expected that the first “replications” of o1 are more relative models — scaled up post-training on reasoning rather than the special pretraining + scaled RL that OpenAI does. We will learn a lot soon.The talk on YouTubeSlides for this post-training talk and slides for the full tutorial on language modeling (with a bit less post-training content and no recording yet). Here are some timestamps for the video:* 00:00 Introduction* 10:00 Prompts & Skill Selection* 14:19 Instruction Finetuning* 21:45 Preference Finetuning* 36:17 Reinforcement Finetuning* 45:28 Open Questions* 52:02 Wrap Up Get full access to Interconnects at www.interconnects.ai/subscribe
In 2025 we need to disambiguate three intertwined topics: post-training, reasoning, and inference-time compute. Post-training is going to quickly become muddied with the new Reasoning Language Models (RLMs — is that a good name), given that loss functions that we studied via advancements in post-training are now being leveraged at a large scale to create new types of models. I would not call the reinforcement learning training done for OpenAI’s o1 series of models post-training. Training o1 is large-scale RL that enables better inference-time compute and reasoning performance. Today, I focus on reasoning. Technically, language models definitely do a form of reasoning. This definition does not need to go in the direction of the AGI debate — we can clearly scope a class of behavior rather than a distribution of explicit AI capability milestones. It’ll take work to get an agreement here. Getting some members of the community (and policymakers) to accept that language models do their own form of reasoning by outputting and manipulating intermediate tokens will take time. I enjoy Ross Taylor’s definition:Reasoning is the process of drawing conclusions by generating inferences from observations.This is a talk I gave at NeurIPS at the Latent Space unofficial industry track. I wanted to directly address the question on if language models can reason and what o1 and the reinforcement finetuning (RFT) API tell us about it. It’s somewhat rambly, but asks the high level questions on reasoning that I haven’t written about yet and is a good summary of my coverage on o1’s implementation and the RFT API.Thanks swyx & Alessio for having me again! You can access the slides here (e.g. if you want to access the links on them). For more on reasoning, I recommend you read/watch:* Melanie Mitchell’s series on ARC at AI: A Guide for Thinking Humans: first, second, third, and final. And her post on reasoning proper.* Miles Brundage’s thread summarizing the prospects of generalization.* Ross Taylor’s (previous interview guest) recent talk on reasoning.* The inference-time compute tag on Interconnects.Listen on Apple Podcasts, Spotify, YouTube, and wherever you get your podcasts. Transcript + SlidesNathan [00:00:07]: Hey, everyone. Happy New Year. This is a quick talk that I gave at NeurIPS, the Latent Space unofficial industry event. So Swyx tried to have people to talk about the major topics of the year, scaling, open models, synthetic data, agents, etc. And he asked me to fill in a quick slot on reasoning. A couple notes. This was before O3 was announced by OpenAI, so I think you can take everything I said and run with it with even more enthusiasm and expect even more progress in 2025. And second, there was some recording issues, so I re-edited the slides to match up with the audio, so you might see that they're slightly off. But it's mostly reading like a blog post, and it should do a good job getting the conversation started around reasoning on interconnects in the new year. Happy New Year, and I hope you like this. Thanks. I wouldn't say my main research area is reasoning. I would say that I came from a reinforcement learning background into language models, and reasoning is now getting subverted into that as a method rather than an area. And a lot of this is probably transitioning these talks into more provocative forms to prime everyone for the debate that is why most people are here. And this is called the state of reasoning. This is by no means a comprehensive survey. To continue, I wanted to make sure that I was not off base to think about this because there's a lot of debates on reasoning and I wanted to revisit a very basic definition. And this is a dictionary definition, which is the action of thinking about something in a logical, sensible way, which is actually sufficiently vague that I would agree with it. I think as we'll see in a lot of this talk is that I think people are going crazy about whether or not language models reason. We've seen this with AGI before. And now we're going to talk about it. Now, reasoning kind of seems like the same thing, which to me is pretty ridiculous because it's like reasoning is a very general skill and I will provide more reasoning or support for the argument that these language models are doing some sort of reasoning when you give them problems. I think I don't need to share a ton of examples for what's just like ill-formed arguments for what language models are not doing, but it's tough that this is the case. And I think there are. Some very credible arguments that reasoning is a poor direction to pursue for language models because language models are not going to be as good at it as humans. But to say that they can't do reasoning, I don't see a lot of proof for, and I'll go through a few examples. And the question is like, why should language model reasoning be constrained to look what look like what humans do? I think language models are very different and they are stochastic. The stochastic parents thing is true for many reasons. And. We should embrace this. And we should continue. And I think a big trend of the year is that we're seeing new types of language model reasoning that look less human. And that can be good for kind of separating the discourse for expecting a really narrow type of behaviors. I did an interview with Ross Taylor, who was a reasoning lead at Meta, which I thought was a very good education for me on this. And this is just a direct pull from the transcript. But essentially it's saying is like, if you do chain of thought on a language model. What it is doing is essentially outputting its intermediate steps. If I were to ask you all a math problem right now, you can do most of them in your head and you are doing some sort of intermediate storage of variables. And language models have no ability to do this. They are kind of per token computation devices where each token is outputted after doing this forward pass. And within that, there's no explicit structure to hold these intermediate states. So I think embracing chain of thought and these kind of intermediate values for the language models is extremely reasonable. And it's showing that they're doing something that actually gets to valuable outputs.Nathan [00:04:10]: So this is like one of the many ways that we can kind of lead towards O1 is that language models have randomness built into them. And a lot of what people see as failures in reasoning are kind of these language models following very static chains and making very specific mistakes. Along the way with really no ability to correct for that. This is really not something that we see in human reasoning. So if a human makes a mistake, they will normally catch it on the next step. But we need to handle language models differently.Nathan [00:04:41]: And why O1 is exciting is because it's a new type of language models that are going to maximize on this view of reasoning. Which is that chain of thought and kind of a forward stream of tokens can actually do a lot to achieve better outcomes. When you're doing a reasoning like ability or reasoning like action, which is just repeatedly outputting tokens to make progress on some sort of intelligence defined task. So it's just making forward progress by spending more compute and the token stream is the equivalent of some intermediate state.Nathan [00:05:18]: What is O1 has been a large debate since its release. I'm not going to spend a lot of this talk on it. But the more I've spent on it. Is that you should take open AI at their face value, which they are doing very large scale. RL on the verifiable outcomes is what I've added, especially in context of the RL API that they've released, which I'll talk about more. But most of the reasons to believe in more complicated things like process rewards, models, self play. Monte Carlo tree search are mostly based on previous literature and things that we would have expected advanced reasoning to look like for language models. And not based on evidence that they have given us or the behavior, whether you're looking at evaluations or how actually like inference is done when serving the model. This takes us to replications, or I would probably call them relatives of O1 coming from the community. These are wonderful to see. We are exploring the boundaries for like what we can do with chain of thought in models. The two I've highlighted are from deep seek and Quinn and a lot of people in this room have probably seen them. And I think that these models are really substantially narrower than these full O1 models from open AI. So open AI is if you use O1, you can do it for a lot more tasks. If you use like I was using the deep seek model and it's supposed to be for math or code. But they've tried to keep the model so narrow that even in that if you ask a code question, sometimes it'll be like I have only supposed to work on math or code. And a lot of the success of O1 and the future models of this is going to be able to. It being able to handle more tasks and more domains. So SemiAnalysis wrote a post that I haven't read in full. But even if you look at the paywalled headings, you can kind of make some intelligent claims about what O1 is or is not. I think these are two of the things from the table of contents that you can see without paying. I'm due to pay at some point, but I have not. And incredible amounts of forward passes during training. I think you'll see this as I discuss RL fine tuning models. Maybe more in a little bit. But when you're doing RL, there's two types of ways that you see data many times and that will relate in many or result in many forward passes. One is that when you're doing RL on a prompt, you can sample many completions to then grade them or use them in different ways to update your policy. So if I ask one math problem, I could look at eight completions and choose the best one or do some contrast of thing between the best and the worst one. And that ki
Original posthttps://www.interconnects.ai/p/2024-interconnects-year-in-review Get full access to Interconnects at www.interconnects.ai/subscribe
Original post: https://www.interconnects.ai/p/openais-o3-the-2024-finale-of-aiChapters00:00 Introduction02:51 o3 overview05:57 Solving the Abstraction and Reasoning Corpus (ARC)10:41 o3’s architecture, cost, and training (hint: still no tree search)16:36 2024: RL returnsFiguresFig 1, Frontier Math resultsFig 2, Coding resultsFig 3, ARC AGI resultsFig 4, ARC AGI result detailsFig 5, ARC AGI example 1Fig 6, ARC AGI example in textFig 7, ARC AGI example “easy” Get full access to Interconnects at www.interconnects.ai/subscribe
Comments
Top Podcasts
The Best New Comedy Podcast Right Now – June 2024The Best News Podcast Right Now – June 2024The Best New Business Podcast Right Now – June 2024The Best New Sports Podcast Right Now – June 2024The Best New True Crime Podcast Right Now – June 2024The Best New Joe Rogan Experience Podcast Right Now – June 20The Best New Dan Bongino Show Podcast Right Now – June 20The Best New Mark Levin Podcast – June 2024