LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods

Update: 2024-12-23

Description

We discuss a major survey of work and research on LLM-as-Judge from the last few years. "LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods" systematically examines the LLMs-as-Judge framework across five dimensions: functionality, methodology, applications, meta-evaluation, and limitations. This survey gives us a birds eye view of the advantages, limitations and methods for evaluating its effectiveness.

Read a breakdown on our blog: https://arize.com/blog/llm-as-judge-survey-paper/

Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

Comments

In Channel

TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

2025-11-2423:44

Meta AI Researcher Explains ARE and Gaia2: Scaling Up Agent Environments and Evaluations

2025-11-1022:34

Georgia Tech's Santosh Vempala Explains Why Language Models Hallucinate, His Research With OpenAI

2025-10-1431:24

Atropos Health’s Arjun Mukerji, PhD, Explains RWESummary: A Framework and Test for Choosing LLMs to Summarize Real-World Evidence (RWE) Studies

2025-09-2226:22

Stan Miasnikov, Distinguished Engineer, AI/ML Architecture, Consumer Experience at Verizon Walks Us Through His New Paper

2025-09-0648:11

Small Language Models are the Future of Agentic AI

2025-09-0531:15

Watermarking for LLMs and Image Models

2025-07-3042:56

Self-Adapting Language Models: Paper Authors Discuss Implications

2025-07-0831:26

The Illusion of Thinking: What the Apple AI Paper Says About LLM Reasoning

2025-06-2030:35

Accurate KV Cache Quantization with Outlier Tokens Tracing

2025-06-0425:11

Scalable Chain of Thoughts via Elastic Reasoning

2025-05-1628:54

Sleep-time Compute: Beyond Inference Scaling at Test-time

2025-05-0230:24

LibreEval: The Largest Open Source Benchmark for RAG Hallucination Detection

2025-04-1827:19

AI Benchmark Deep Dive: Gemini 2.5 and Humanity's Last Exam

2025-04-0426:11

Model Context Protocol (MCP)

2025-03-2515:03

AI Roundup: DeepSeek’s Big Moves, Claude 3.7, and the Latest Breakthroughs

2025-03-0130:23

How DeepSeek is Pushing the Boundaries of AI Development

2025-02-2129:54

Multiagent Finetuning: A Conversation with Researcher Yilun Du

2025-02-0430:03

Training Large Language Models to Reason in Continuous Latent Space

2025-01-1424:58

LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods

2024-12-2328:57

00:00

LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods

#box-pro-ellipsis-176682417226779{-webkit-line-clamp:2;}LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods

LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods

Arize AI

LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods