This episode of the Twilmore AI podcast features Arvin Narayan, a professor at Princeton University, discussing the challenges and opportunities of AI agents. Arvin highlights the "capability-reliability gap," where agents demonstrate impressive potential but struggle to achieve consistent real-world performance. He explores the need for better benchmarking and evaluation methods, emphasizing the importance of cost considerations alongside accuracy. Arvin shares his background as a computer scientist and director of the Center for Information Technology Policy at Princeton. He discusses his research interests, including AI agents, AI bias, and advising policymakers on AI regulation. He emphasizes the importance of academic research in areas like benchmarking, where companies may not have the incentive to invest. Arvin delves into the complexities of benchmarking AI agents, particularly in the context of foundation models. He highlights the difficulty of creating realistic simulations for real-world tasks and the potential for benchmarks to encourage the development of brittle agents that perform well in simulations but fail in real-world applications. Arvin discusses the paradox of AI agents: while their capabilities are impressive, achieving reliable performance in real-world settings remains a significant challenge. He uses the example of self-driving cars to illustrate how even with promising prototypes, achieving widespread adoption requires overcoming numerous practical hurdles. Arvin emphasizes the need to consider cost alongside accuracy when evaluating AI agents. He argues that simply focusing on accuracy can be misleading, as models can achieve high accuracy by repeatedly sampling solutions until a correct one is found. He advocates for using Pareto curves to visualize the trade-off between cost and performance. Arvin shares a personal anecdote about using an AI-generated app to teach his five-year-old daughter how to tell time. This example highlights the potential of AI agents for one-time use applications, where perfect reliability is not essential. Arvin discusses the concept of "verifiers" as a way to create guardrails for AI agents. He explains how verifiers, such as unit tests for coding agents, can be used to ensure the reliability of agent outputs. He suggests that domain-specific verifiers could be used to build agents that achieve high reliability in specific domains. Arvin argues that the development of AI agents is as much a political problem as a technical one. He discusses the resistance from companies to allow agents to access their websites and data, highlighting the need for societal changes to facilitate the integration of AI agents into existing systems. Arvin explores the definition of AI agents, acknowledging the ambiguity surrounding the term. He argues that the concept is meaningful because it represents a shift from simple zero-shot prompting to more complex AI systems that require scaffolding and autonomy. He identifies three key factors in defining agents: environmental complexity, task difficulty, and design patterns. Arvin discusses his team's work on a benchmark called Core Bench, which focuses on computational reproducibility in scientific research. The benchmark aims to evaluate AI agents' ability to automate the process of reproducing scientific results, a task that is often time-consuming and challenging for human researchers. Arvin shares his observations about the strengths and weaknesses of current AI agents. He suggests that agents are better at tasks that are well-represented on the web, while tasks involving long sequences of interactions, such as shell commands or turn-taking, pose greater challenges. Arvin discusses three approaches to improving reasoning capabilities in AI models: scaling up models, using inference-time methods like neuro-symbolic AI, and fine-tuning models during training. He highlights the potential of each approach and suggests that the hybrid approach employed by models like Strawberry R01 is particularly promising. Arvin discusses the concept of "AI snake oil" and how it applies to the current state of AI. He acknowledges that while many AI products are based on legitimate research and development, some companies make exaggerated claims about their capabilities. He highlights examples of problematic claims, such as those made about GPT-4's performance on standardized tests, and the use of basic statistical models rebranded as AI. Arvin discusses the challenges of regulating AI, acknowledging the rapid pace of technological development and the lack of technical expertise among policymakers. He argues that regulation should focus on human behavior around AI rather than the technical details of specific models. He emphasizes the importance of enforcement of existing laws and the need for evidence-based policymaking. Arvin discusses the debate surrounding AI safety, arguing that while researchers should study potential risks, policy should focus on known risks and be evidence-based. He cautions against overreacting to hypothetical risks and suggests that more evidence is needed before taking drastic measures to curb AI development. Arvin distinguishes between catastrophic and non-catastrophic AI risks. He argues that while AI can be used to discriminate against individuals, this is not a catastrophic risk from a policy perspective. He considers the potential for adversaries to take over critical infrastructure as a more serious and potentially catastrophic risk, requiring a precautionary approach.


Today, we're joined by Arvind Narayanan, professor of Computer Science at Princeton University to discuss his recent works, AI Agents That Matter and AI Snake Oil. In "AI Agents That Matter", we explore the range of agentic behaviors, the challenges in benchmarking agents, and the 'capability and reliability gap', which creates risks when deploying AI agents in real-world applications. We also discuss the importance of verifiers as a technique for safeguarding agent behavior. We then dig into the AI Snake Oil book, which uncovers examples of problematic and overhyped claims in AI. Arvind shares various use cases of failed applications of AI, outlines a taxonomy of AI risks, and shares his insights on AI's catastrophic risks. Additionally, we also touched on different approaches to LLM-based reasoning, his views on tech policy and regulation, and his work on CORE-Bench, a benchmark designed to measure AI agents' accuracy in computational reproducibility tasks.

