Everything you need to know about LLM benchmarks- Turing Test, OpenAI's Healthbench, ARC prize, LM arena

Update: 2025-08-22

Description

Whenever there was AI, there were benchmarks- from the turing test, to society-changing benchmarks like MNIST and ImageNet to modern problems like the ARC prize, benchmarked served a vital purpose to measure the performance of AI models. But something has shifted in modern times, in the LLM era have benchmarks lost their utility, becoming mere advertisement for big tech?

Even seemingly more sophisticated benchmarks like LM Arena can be gamed by tech giants. We also deep dive into healthcare benchmarks like OpenAI's Healthbench (deeply problematic) and Microsoft's AI-DXO orchestrator agent for diagnosis. Where is this all going? How do we make the perfect benchmark? Or is the real work to be done afterwards in the real world?

👋 Hey! If you are enjoying our conversations, reach out, share your thoughts and journey with us. Don't forget to subscribe whilst you're here :)

---

Timestamps
00:00 Intro - The OG benchmarks - Turing test, MNIST, ImageNET
06:40 Are large language models benchmarks similar to humans taking tests?
10:05 Are we testing model capability vs production ready?
12:00 LLM era - data contamination
15:30 LM Arena - The leaderboard illusion paper - how big tech games benchmarks
28:35 Goodhart's law - When a measure becomes a target, it ceases to be a good measure
32:05 Some good benchmarks - games - Pokemon, ARC prize, Minecraft
34:35 Medical benchmarks - OpenAI's healthbench has some big problems
46:50 Microsoft AI-DXO orchestrator for case reports

---

Connect with Us

Your Hosts:
👨🏻‍⚕️ Doc - Dr. Joshua Au Yeung - LinkedIn
🤖 Dev - Zeljko Kraljevic - Twitter

Follow & Subscribe:
YT: https://youtube.com/@DevAndDoc
Spotify: Follow us on Spotify
Apple Podcasts: Listen on Apple Podcasts
Substack: https://aiforhealthcare.substack.com/

For enquiries:
📧 Devanddoc@gmail.com

---

Production Credits
🎞️ Editor: Dragan Kraljević - Instagram
🎨 Brand & Art: Ana Grigorovici - Behance

Comments

In Channel

#30 The Age of AI agents in healthcare (Live Podcast at HETT 2025)

2025-10-2236:32

Everything you need to know about LLM benchmarks- Turing Test, OpenAI's Healthbench, ARC prize, LM arena

2025-08-2255:19

#28 AI agents explained - Manus AI, computer control, Agentic workflows (healthcare)

2025-05-0901:00:48

#27 Exploring Claude Sonnet 3.7 for healthcare

2025-02-2658:03

#26 Is it still worth doing a PhD in 2025? (Computer Science / Machine Learning)

2025-02-2156:41

#25 Testing Deepseek R1 on Complex Medical Tasks. Here's what we found. (GRPO explainer)

2025-02-0701:20:45

#24 Significantly advancing LLMs with RAG (Google's Gemini 2.0, Deep Research, notebookLM)

2025-01-1057:46

#23 Can OpenAI's GPT o1 solve complex medical problems?

2024-09-2039:44

#22 Explaining Explainable AI (for healthcare) with Dr Annabelle Painter (RSM digital health section Podcast)

2024-08-1558:40

#21 Foundational Models in Digital Pathology: Enhancing Cancer detection and outcomes

2024-08-0201:01:43

#20 How to build a successful healthTech/ BioTech start-up (2024 roadmap) - Derrick Khor

2024-07-1801:08:33

#19 Tracking health with technology and AI - demystifying digital biomarkers

2024-07-0401:03:36

#18 Keith Grimes - Startups and doctors, HealthTech consulting, Babylon's demise, Leadership theory

2024-05-3001:09:33

#17 How to build a clinically safe Large Language Model - Hippocratic AI, Llama3, Biollama

2024-05-0943:24

#16 Dev&Doc x Rewired - LLMs, Clinical foundation models and automating administrative tasks (live)

2024-03-2146:59

#15 The death of Prompt Engineering

2024-02-2934:52

#14 Aligning AI models for healthcare | Understanding Reinforcement Learning from Human Feedback (RLHF)

2024-02-1442:01

#13 Research begins when hype ends - Doc's adventure, LlaMa3 , Code LlaMa, Gemini Ultra

2024-02-0118:04

#12 2024 AI Predictions : Ambient clinical intelligence, language models as commodities, GPT-5 and AGI

2024-01-1846:15

#11 The AI race to automate clinical coding

2023-12-1428:01

00:00

1.0x

Everything you need to know about LLM benchmarks- Turing Test, OpenAI's Healthbench, ARC prize, LM arena

#box-pro-ellipsis-176507327895250{-webkit-line-clamp:2;}Everything you need to know about LLM benchmarks- Turing Test, OpenAI's Healthbench, ARC prize, LM arena

Everything you need to know about LLM benchmarks- Turing Test, OpenAI's Healthbench, ARC prize, LM arena

Dev and Doc

Everything you need to know about LLM benchmarks- Turing Test, OpenAI's Healthbench, ARC prize, LM arena