The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic

Update: 2024-11-28

Description

We have announced our first speaker, friend of the show Dylan Patel, and topic slates for Latent Space LIVE! at NeurIPS. Sign up for IRL/Livestream and to debate!

We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!

The vibe shift we observed in July - in favor of Claude 3.5 Sonnet, first introduced in June — has been remarkably long lived and persistent, surviving multiple subsequent updates of 4o, o1 and Gemini versions, for Anthropic’s Claude to end 2024 as the preferred model for AI Engineers and even being the exclusive choice for new code agents like bolt.new (our next guest on the pod!), which unlocked so much performance from Claude Sonnet that it went from $0 to $4m ARR in 4 weeks when it launched last month.

Anthropic has now raised an additional $4b from Amazon and made an incredibly well received update of Claude 3.5 Sonnet (and Haiku), making significant improvements in performance over its predecessors:

Solving SWE-Bench

As part of the October Sonnet release, Anthropic teased a blink-and-you’ll miss it result:

The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain. The new Claude 3.5 Sonnet offers these advancements at the same price and speed as its predecessor.

This was followed up by a blogpost a week later from today’s guest, Erik Schluntz, the engineer who implemented and scored this SOTA result using a simple, non-overengineered version of the SWE-Agent framework (you can see the submissions here). We have previously covered the SWE-Bench story extensively:

* Speaking with SWEBench/SWEAgent authors at ICLR

* Speaking with Cosine Genie, the previous SOTA (43.8%) on SWEBench Verified (with brief update at DevDay 2024)

* Speaking with Shunyu Yao on SWEBench and the ReAct paradigm driving SWE-Agent

One of the notable inclusions in this blogpost are the tools that Erik decided to give Claude, e.g. the “Edit Tool”:

The tools teased in the SWEBench submission/blogpost were then polished up and released with Computer Use…

And you can also see even more computer use tools given in the new Model Context Protocol servers:

Claude Computer Use

Because it is one of the best received AI releases of the year, we recommend watching the 2 minute Computer Use intro (and related demos) in its entirety:

Eric also worked on Claude’s function calling, tool use, and computer use APIs, so we discuss that in the episode.

Erik [00:53:39 ]: With computer use, just give the thing a browser that's logged into what you want to integrate with, and it's going to work immediately. And I see that reduction in friction as being incredibly exciting. Imagine a customer support team where, okay, hey, you got this customer support bot, but you need to go integrate it with all these things. And you don't have any engineers on your customer support team. But if you can just give the thing a browser that's logged into your systems that you need it to have access to, now, suddenly, in one day, you could be up and rolling with a fully integrated customer service bot that could go do all the actions you care about. So I think that's the most exciting thing for me about computer use, is reducing that friction of integrations to almost zero.

As you’ll see, this is very top of mind for Erik as a former Robotics founder who’s company basically used robots to interface with human physical systems like elevators.

Full Video episode

Please like and subscribe!

Show Notes

* Eric Schluntz

* “Raising the bar on SWE-Bench Verified”

* Cobalt Robotics

* SWE-Bench

* SWE-Bench Verified

* Human Eval & other benchmarks

* Anthropic Workbench

* Aider

* Cursor

* Fireworks AI

* E2B

* Amanda Askell

* Toyota Research

* Physical Intelligence (Pi)

* Chelsea Finn

* Josh Albrecht

* Eric Jang

* 1X

* Dust

* Bolt

Timestamps

* [00:00:00 ] Introductions

* [00:03:39 ] What is SWE-Bench?

* [00:12:22 ] SWE-Bench vs HumanEval vs others

* [00:15:21 ] SWE-Agent architecture and runtime

* [00:21:18 ] Do you need code indexing?

* [00:24:50 ] Giving the agent tools

* [00:27:47 ] Sandboxing for coding agents

* [00:29:16 ] Why not write tests?

* [00:30:31 ] Redesigning engineering tools for LLMs

* [00:35:53 ] Multi-agent systems

* [00:37:52 ] Why XML so good?

* [00:42:57 ] Thoughts on agent frameworks

* [00:45:12 ] How many turns can an agent do?

* [00:47:12 ] Using multiple model types

* [00:51:40 ] Computer use and agent use cases

* [00:59:04 ] State of AI robotics

* [01:04:24 ] Robotics in manufacturing

* [01:05:01 ] Hardware challenges in robotics

* [01:09:21 ] Is self-driving a good business?

Transcript

Alessio [00:00:00 ]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners. And today we're in the new studio with my usual co-host, Shawn from Smol AI.

Swyx [00:00:14 ]: Hey, and today we're very blessed to have Erik Schluntz from Anthropic with us. Welcome.</p