Teaching AI to See: A Technical Deep-Dive on Vision Language Models with Will Hardman of Veratai

Update: 2025-01-03

Description

In this episode of The Cognitive Revolution, Nathan hosts Will Hardman, founder of AI advisory firm Veritai, for a comprehensive technical survey of vision language models (VLMs). We explore the evolution of VLMs from early vision transformers to state-of-the-art architectures like InternVL and Llama3V, examining key innovations and architectural decisions. Join us for an in-depth discussion covering multimodality in AI systems, evaluation frameworks, and practical applications with one of the field's leading experts.

Here's to the link to one of the most comprehensive reference documents for VLMs prepared by Will Hardman: https://dust-mailbox-c73.notion.site/Vision-Language-Models-11b675d75dd480af994cc474a754bb26

Help shape our show by taking our quick listener survey at https://bit.ly/TurpentinePulse

SPONSORS:

Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive

80,000 Hours: 80,000 Hours is dedicated to helping you find a fulfilling career that makes a difference. With nearly a decade of research, they offer in-depth material on AI risks, AI policy, and AI safety research. Explore their articles, career reviews, and a podcast featuring experts like Anthropic CEO Dario Amodei. Everything is free, including their Career Guide. Visit https://80000hours.org/cognitiverevolution to start making a meaningful impact today.

CHAPTERS:

(00:00:00 ) Teaser

(00:00:55 ) About the Episode

(00:05:45 ) Introduction

(00:09:16 ) VLM Use Cases

(00:13:47 ) Vision Transformers (Part 1)

(00:17:48 ) Sponsors: Oracle Cloud Infrastructure (OCI)

(00:19:00 ) Vision Transformers (Part 2)

(00:24:58 ) OpenAI's CLIP Model

(00:33:44 ) DeepMind's Flamingo (Part 1)

(00:33:44 ) Sponsors: 80,000 Hours

(00:35:17 ) DeepMind's Flamingo (Part 2)

(00:48:29 ) Instruction Tuning with LAVA

(01:09:25 ) MMMU Benchmark

(01:14:42 ) Pre-training with QNVL

(01:32:13 ) InternVL Model Series

(01:52:33 ) Cross-Attention vs. Self-Attention

(02:14:33 ) Hybrid Architectures

(02:31:08 ) Early vs. Late Fusion

(02:34:50 ) VQA and DocVQA Benchmarks

(02:40:08 ) The Blink Benchmark

(03:05:37 ) Generative Pre-training

(03:15:26 ) Multimodal Generation

(03:37:00 ) Frontier Labs & Benchmarks

(03:47:45 ) Conclusion

(03:53:28 ) Outro

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://www.linkedin.com/in/nathanlabenz/

Youtube: https://www.youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

E32: [Bonus Episode - The AI Breakdown] Can OpenAI's New GPT Training Model Solve Math and AI Alignment At the Same Time?

2023-06-0321:22

Teaching AI to See: A Technical Deep-Dive on Vision Language Models with Will Hardman of Veratai

2025-01-0303:54:06

roon's Heroic Duty: Will "the Good Guys" Build AGI First? (from Doom Debates)

2024-12-2801:57:17

Emad Mostaque on the Intelligent Internet and Universal Basic AI

2024-12-2502:10:55

Can AIs do AI R&D? Reviewing REBench Results with Neev Parikh of METR

2024-12-2101:46:40

Breakthroughs in AI for Biology: AI Lab Groups & Protein Model Interpretability with Prof James Zou

2024-12-1801:01:53

Scouting Frontiers in AI for Biology: Dynamics, Diffusion, and Design, with Amelie Schreiber

2024-12-1401:46:02

Building Government's Largest Civilian AI Team with DHS AI Corps' Director, Michael Boyce

2024-12-1101:29:00

Emergency Pod: o1 Schemes Against Users, with Alexander Meinke from Apollo Research

2024-12-0702:05:40

Automating Scientific Discovery, with Andrew White, Head of Science at Future House

2024-12-0501:55:53

The Evolution of AI Agents: Lessons from 2024, with MultiOn CEO Div Garg

2024-12-0301:27:32

Beyond Preference Alignment: Teaching AIs to Play Roles & Respect Norms, with Tan Zhi Xuan

2024-11-3001:53:56

Is an AI Arms Race Inevitable? with Robert Wright of Nonzero Newsletter & Podcast

2024-11-2702:02:37

Designing the Future: Inside Canva's AI Strategy with John Milinovich, GenAI Product Lead at Canva

2024-11-2301:24:48

Everything You Wanted to Know About LLM Post-Training, with Nathan Lambert of Allen Institute for AI

2024-11-2101:49:40

Zvi’s POV: Ilya’s SSI, OpenAI’s o1, Claude Computer Use, Trump’s election, and more

2024-11-1602:17:35

AGI Lab Transparency Requirements & Whistleblower Protections, with Dean W. Ball & Daniel Kokotajlo

2024-11-1201:58:14

AI Under Trump? The Stakes of 2024 w/ Joshua Steinman [Pt 2 of 2]

2024-11-0201:13:42

The Case for Trump and the Future of AI – Part 1, with Samuel Hammond, Senior Economist, Foundation of American Innovation

2024-11-0102:16:31

Breaking: Gemini's Major Update - Search, JSON & Code Features Revealed by Google PMs

2024-10-3153:36

00:00

Teaching AI to See: A Technical Deep-Dive on Vision Language Models with Will Hardman of Veratai

Erik Torenberg, Nathan Labenz

#box-pro-ellipsis-173619523222915{-webkit-line-clamp:2;}Teaching AI to See: A Technical Deep-Dive on Vision Language Models with Will Hardman of Veratai

Teaching AI to See: A Technical Deep-Dive on Vision Language Models with Will Hardman of Veratai

Erik Torenberg, Nathan Labenz

Teaching AI to See: A Technical Deep-Dive on Vision Language Models with Will Hardman of Veratai