Using the Smartest AI to Rate Other AI

Update: 2025-04-19

Description

In this episode, I walk through a Fabric Pattern that assesses how well a given model does on a task relative to humans. This system uses your smartest AI model to evaluate the performance of other AIs—by scoring them across a range of tasks and comparing them to human intelligence levels.

I talk about:

1. Using One AI to Evaluate Another
The core idea is simple: use your most capable model (like Claude 3 Opus or GPT-4) to judge the outputs of another model (like GPT-3.5 or Haiku) against a task and input. This gives you a way to benchmark quality without manual review.

2. A Human-Centric Grading System
Models are scored on a human scale—from “uneducated” and “high school” up to “PhD” and “world-class human.” Stronger models consistently rate higher, while weaker ones rank lower—just as expected.

3. Custom Prompts That Push for Deeper Evaluation
The rating prompt includes instructions to emulate a 16,000+ dimensional scoring system, using expert-level heuristics and attention to nuance. The system also asks the evaluator to describe what would have been required to score higher, making this a meta-feedback loop for improving future performance.

Note: This episode was recorded a few months ago, so the AI models mentioned may not be the latest—but the framework and methodology still work perfectly with current models.

Subscribe to the newsletter at:
https://danielmiessler.com/subscribe

Join the UL community at:
https://danielmiessler.com/upgrade

Follow on X:
https://x.com/danielmiessler

Follow on LinkedIn:
https://www.linkedin.com/in/danielmiessler

See you in the next one!

Become a Member: https://danielmiessler.com/upgrade

See omnystudio.com/listener for privacy information.

Comments

In Channel

A Conversation with Michael Brown About Designing AI Systems

2025-08-2250:06

UL NO. 494: STANDARD EDITION | AI Finds a P1, I Missed Chartbeat So I Made My Own, XBow Open-Sources Their AI Bot, and more...

2025-08-2101:38:09

A Conversation With Sarit Tager from Prisma Cloud

2025-07-2925:31

UL NO. 489: STANDARD EDITION | My personal toolchain updates, Google tracking through DuckDuckGo, Anthropic’s Pentagon Deal, Grok4 NSFW, Substack Crushes WSJ, and more...

2025-07-1722:01

UL NO. 488: STANDARD EDITION | Google Granting Confusing Access to Gemini, A New Favorite Creator, Russia's new Autonomous Drones, Claude Code Madness and Neovim Config, and more...

2025-07-1030:11

UL NO. 487: STANDARD EDITION: Iranian Critical Infra Attacks, Insane Recent Productivity, A Chinese Mosquito Drone, Marcus's Response to Our AI Debate, "Context Engineering" Ain't It, and more...

2025-07-0241:31

An AI Debate with Marcus Hutchins

2025-06-2602:00:04

UL NO. 486 STANDARD EDITION: Fully Automated AI Malware (Binary and Web), My Debate with Marcus Hutchins on AI and more

2025-06-2655:03

UL NO. 485: STANDARD EDITION: Netflix RCE, My Current AI Stack, All-in on Claude Code, and more...

2025-06-1936:45

UL NO. 484: STANDARD EDITION: OpenAI's Malicious AI Report, Disappointed with WWDC, AI's First Actual Science Breakthrough, and more...

2025-06-1243:31

UL NO. 483 | STANDARD EDITION: A Chrome 0-Day, Meta Automates Security Assessments, New Essays, My New Video on Hacking with AI, Ukraine's Asymmetrical Attack, Thoughts on My AI Skeptical Friends, The Dangers of Winning the Wrong Game, and more...

2025-06-0531:39

The Future of Hacking is Context

2025-06-0333:45

UL NO. 482 | STANDARD EDITION: AI Finds an 0-Day!, Postman Leaking Secrets, High Agency Mental Model, My Unified Entity Context Video, Github MCP Leaks Private Repos, Google vs. OpenAI vs. Apple on AI Vision, and more...

2025-05-3031:33