SimpleQA

Update: 2024-10-31

Description

❓Measuring short-form factuality in large language models

This document introduces SimpleQA, a new benchmark for evaluating the factuality of large language models. The benchmark consists of over 4,000 short, fact-seeking questions designed to be challenging for advanced models, with a focus on ensuring a single, indisputable answer. The authors argue that SimpleQA is a valuable tool for assessing whether models "know what they know", meaning their ability to correctly answer questions with high confidence. They further explore the calibration of language models, investigating the correlation between confidence and accuracy, as well as the consistency of responses when the same question is posed multiple times. The authors conclude that SimpleQA provides a valuable framework for evaluating the factuality of language models and encourages the development of more trustworthy and reliable models.

📎 Link to paper
🌐 Read their blog

Comments

In Channel

Marco-o1

2024-11-2314:47

Scaling Laws for Precision

2024-11-1818:39

Test-Time Training

2024-11-1414:38

Qwen2.5-Coder

2024-11-1224:03

Attacking Vision-Language Computer Agents via Pop-ups

2024-11-0921:39

Number Cookbook

2024-11-0816:11

Jigsaw Puzzles

2024-11-0716:44

Multi-expert Prompting with LLMs

2024-11-0512:41

Investigating the Role of Prompting and External Tools in Hallucination Rates of LLMs

2024-11-0316:03

Mind Your Step (by Step)

2024-11-0216:44

SimpleQA

2024-10-3117:33

GPT-4o System Card

2024-10-3024:23

Mixture of Parrots

2024-10-2910:51

Improve Vision Language Model Chain-of-thought Reasoning

2024-10-2815:44

Breaking the Memory Barrier

2024-10-2715:33

LLMs Reflect the Ideology of their Creators

2024-10-2611:09

LongRAG

2024-10-2518:07

A Theoretical Understanding of Chain-of-Thought

2024-10-2409:56

A Survey on Data Synthesis and Augmentation for Large Language Models

2024-10-2321:21

Revealing the Barriers of Language Agents in Planning

2024-10-2208:56

00:00

#box-pro-ellipsis-175964912580437{-webkit-line-clamp:2;}SimpleQA

SimpleQA

Shahriar Shariati

SimpleQA