Deep Papers

<p>Deep Papers is a podcast series featuring deep dives on today’s most important AI papers and research. Hosted by Arize AI founders and engineers, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. </p>

LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods

We discuss a major survey of work and research on LLM-as-Judge from the last few years. "LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods" systematically examines the LLMs-as-Judge framework across five dimensions: functionality, methodology, applications, meta-evaluation, and limitations. This survey gives us a birds eye view of the advantages, limitations and methods for evaluating its effectiveness. Read a breakdown on our blog: https://arize.com/blog/llm-a...

12-23
28:57

Merge, Ensemble, and Cooperate! A Survey on Collaborative LLM Strategies

LLMs have revolutionized natural language processing, showcasing remarkable versatility and capabilities. But individual LLMs often exhibit distinct strengths and weaknesses, influenced by differences in their training corpora. This diversity poses a challenge: how can we maximize the efficiency and utility of LLMs?A new paper, "Merge, Ensemble, and Cooperate: A Survey on Collaborative Strategies in the Era of Large Language Models," highlights collaborative strategies to address this challen...

12-10
28:47

Agent-as-a-Judge: Evaluate Agents with Agents

This week, we break down the “Agent-as-a-Judge” framework—a new agent evaluation paradigm that’s kind of like getting robots to grade each other’s homework. Where typical evaluation methods focus solely on outcomes or demand extensive manual work, this approach uses agent systems to evaluate agent systems, offering intermediate feedback throughout the task-solving process. With the power to unlock scalable self-improvement, Agent-as-a-Judge could redefine how we measure and enhance agent perf...

11-23
24:54

Introduction to OpenAI's Realtime API

We break down OpenAI’s realtime API. Learn how to seamlessly integrate powerful language models into your applications for instant, context-aware responses that drive user engagement. Whether you’re building chatbots, dynamic content tools, or enhancing real-time collaboration, we walk through the API’s capabilities, potential use cases, and best practices for implementation. Learn more about AI observability and evaluation in our course, join the Arize AI Slack community or get the late...

11-12
29:56

Swarm: OpenAI's Experimental Approach to Multi-Agent Systems

As multi-agent systems grow in importance for fields ranging from customer support to autonomous decision-making, OpenAI has introduced Swarm, an experimental framework that simplifies the process of building and managing these systems. Swarm, a lightweight Python library, is designed for educational purposes, stripping away complex abstractions to reveal the foundational concepts of multi-agent architectures. In this podcast, we explore Swarm’s design, its practical applications, and how it ...

10-29
46:46

KV Cache Explained

In this episode, we dive into the intriguing mechanics behind why chat experiences with models like GPT often start slow but then rapidly pick up speed. The key? The KV cache. This essential but under-discussed component enables the seamless and snappy interactions we expect from modern AI systems.Harrison Chu breaks down how the KV cache works, how it relates to the transformer architecture, and why it's crucial for efficient AI responses. By the end of the episode, you'll have a clearer und...

10-24
04:19

The Shrek Sampler: How Entropy-Based Sampling is Revolutionizing LLMs

In this byte-sized podcast, Harrison Chu, Director of Engineering at Arize, breaks down the Shrek Sampler. This innovative Entropy-Based Sampling technique--nicknamed the 'Shrek Sampler--is transforming LLMs. Harrison talks about how this method improves upon traditional sampling strategies by leveraging entropy and varentropy to produce more dynamic and intelligent responses. Explore its potential to enhance open-source AI models and enable human-like reasoning in smaller language model...

10-16
03:31

Google's NotebookLM and the Future of AI-Generated Audio

This week, Aman Khan and Harrison Chu explore NotebookLM’s unique features, including its ability to generate realistic-sounding podcast episodes from text (but this podcast is very real!). They dive into some technical underpinnings of the product, specifically the SoundStorm model used for generating high-quality audio, and how it leverages a hierarchical vector quantization approach (RVQ) to maintain consistency in speaker voice and tone throughout long audio durations. The discussion...

10-15
43:28

Exploring OpenAI's o1-preview and o1-mini

OpenAI recently released its o1-preview, which they claim outperforms GPT-4o on a number of benchmarks. These models are designed to think more before answering and handle complex tasks better than their other models, especially science and math questions. We take a closer look at their latest crop of o1 models, and we also highlight some research our team did to see how they stack up against Claude Sonnet 3.5--using a real world use case. Read it on our blog: https://arize.com/blog/exp...

09-27
42:02

Breaking Down Reflection Tuning: Enhancing LLM Performance with Self-Learning

A recent announcement on X boasted a tuned model with pretty outstanding performance, and claimed these results were achieved through Reflection Tuning. However, people were unable to reproduce the results. We dive into some recent drama in the AI community as a jumping off point for a discussion about Reflection 70B.In 2023, there was a paper written about Reflection Tuning that this new model (Reflection 70B) draws concepts from. Reflection tuning is an optimization technique where models l...

09-19
26:54

Composable Interventions for Language Models

This week, we're excited to be joined by Kyle O'Brien, Applied Scientist at Microsoft, to discuss his most recent paper, Composable Interventions for Language Models. Kyle and his team present a new framework, composable interventions, that allows for the study of multiple interventions applied sequentially to the same language model. The discussion will cover their key findings from extensive experiments, revealing how different interventions—such as knowledge editing, model compression, and...

09-11
42:35

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

This week’s paper presents a comprehensive study of the performance of various LLMs acting as judges. The researchers leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which they find to have a high inter-annotator agreement. The study includes nine judge models and nine exam-taker models – both base and instruction-tuned. They assess the judge models’ alignment across different model sizes, families, and judge p...

08-16
39:05

Breaking Down Meta's Llama 3 Herd of Models

Meta just released Llama 3.1 405B–according to them, it’s “the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation.” Will the latest Llama herd ignite new applications and modeling paradigms like synthetic data generation? Will it enable the improvement and training of smaller models, as well as model distillation? Meta thinks so. We’ll take a look at what the...

08-06
44:40

DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines

Chaining language model (LM) calls as composable modules is fueling a new way of programming, but ensuring LMs adhere to important constraints requires heuristic “prompt engineering.” The paper this week introduces LM Assertions, a programming construct for expressing computational constraints that LMs should satisfy. The researchers integrated their constructs into the recent DSPy programming model for LMs and present new strategies that allow DSPy to compile programs with LM Assertions...

07-23
33:57

RAFT: Adapting Language Model to Domain Specific RAG

Where adapting LLMs to specialized domains is essential (e.g., recent news, enterprise private documents), we discuss a paper that asks how we adapt pre-trained LLMs for RAG in specialized domains. SallyAnn DeLucia is joined by Sai Kolasani, researcher at UC Berkeley’s RISE Lab (and Arize AI Intern), to talk about his work on RAFT: Adapting Language Model to Domain Specific RAG. RAFT (Retrieval-Augmented FineTuning) is a training recipe that improves an LLM’s ability to answer questions ...

06-28
44:01

LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic

It’s been an exciting couple weeks for GenAI! Join us as we discuss the latest research from OpenAI and Anthropic. We’re excited to chat about this significant step forward in understanding how LLMs work and the implications it has for deeper understanding of the neural activity of language models. We take a closer look at some recent research from both OpenAI and Anthropic. These two recent papers both focus on the sparse autoencoder--an unsupervised approach for extracting interpretabl...

06-14
44:00

Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment

We break down the paper--Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment.Ensuring alignment (aka: making models behave in accordance with human intentions) has become a critical task before deploying LLMs in real-world applications. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. To address this issue, this paper presents a comprehensive ...

05-30
48:07

Breaking Down EvalGen: Who Validates the Validators?

Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators often inherit the problems of the LLMs they evaluate, requiring further human validation.This week’s paper explores EvalGen, a mixed-initative approach to aligning LLM-generated evaluation functions with human preferences. EvalGen assists users in developing both criteria acc...

05-13
44:47

Keys To Understanding ReAct: Synergizing Reasoning and Acting in Language Models

This week we explore ReAct, an approach that enhances the reasoning and decision-making capabilities of LLMs by combining step-by-step reasoning with the ability to take actions and gather information from external sources in a unified framework.Learn more about AI observability and evaluation in our course, join the Arize AI Slack community or get the latest on LinkedIn and X.

04-26
45:07

Demystifying Chronos: Learning the Language of Time Series

This week, we’ve covering Amazon’s time series model: Chronos. Developing accurate machine-learning-based forecasting models has traditionally required substantial dataset-specific tuning and model customization. Chronos however, is built on a language model architecture and trained with billions of tokenized time series observations, enabling it to provide accurate zero-shot forecasts matching or exceeding purpose-built models.We dive into time series forecasting, some recent research our te...

04-04
44:40

Recommend Channels