Episode 7: Guest Kurt Bollacker from MLCommons on how to evaluate AI systems
Description
The podcast returns this week as Brian and Sarah interview Kurt Bollacker, a director of engineering at MLCommons. Ever wonder how to begin testing an AI model? What standards matter? The discussion spans history and model types, with a quick look at recent work out of Stanford.
In this week’s news, we get into OpenAI’s relevance for media and wearables with its new Axel Springer deal and reported talks about working with Snap. Listen in, and play along with this week’s festive game of “Two truths and l’AI.”
Plus, thanks and full credit to Cory Skaaren at Skaaren Design for our new visual branding.
Timestamps for this episode:
[0:00 ] Intro
[0:20 ] Welcome
[0:50 ] OpenAI developments recap
[2:23 ] Axel Springer deal with OpenAI
[15:40 ] Snap’s reported talks with OpenAI to use its object recognition software for smart glasses
[17:49 ] “Two truths and l’AI”
[25:59 ] Interview with Kurt Bollacker
[27:50 ] Kurt’s journey to working on AI data sets and system evaluation
[32:14 ] What are the best standards for evaluating AI?
[38:00 ] Establishing what level of trust to put in AI output
[43:00 ] AI and machine learning history
[46:16 ] Predictions about the future of large language models
[50:05 ] The future of general-purpose AI
[53:35 ] The value of transparency and explainability
[55:30 ] Reviewing the Foundation Model Transparency Index from the Center for Research of Foundation Models at Stanford University
[1:00:46 ] MLCommons work on AI safety testing
[1:05:06 ] LLMs as “plausibility machines,” not “truth machines”
[1:07:42 ] How should the average person use this information?
[1:14:26 ] How the general public views data use for training AI
[1:15:58 ] Likelihood of a new AI winter soon
[1:21:49 ] The AI Safety working group
Links for topics referenced in this episode:
Axel Springer reaches deal with OpenAI for ChatGPT to use its news content (Axios): https://www.axios.com/2023/12/13/openai-chatgpt-axel-springer-news-deal
OpenAI talking to Snap about object recognition software use in smart glasses (The Information): https://www.theinformation.com/articles/tech-giants-chase-wearable-ai
The AI-powered Curio and Grimes toy called “Grok” (New York Post): https://nypost.com/2023/12/15/business/grimes-says-grok-toy-has-nothing-to-do-with-elon-musks-ai-bot/
AI-enabled robot pets helping vets (PetFoodIndustry.com): https://www.petfoodindustry.com/pet-ownership-statistics/article/15660175/aianimated-robot-pets-dont-eat-pet-food-or-want-treats
MLCommons: https://mlcommons.org/
Internet Archive: https://archive.org/
Water use when training AI models (Brian Warmoth): https://warmoth.org/2023/09/10/how-much-water-ai-is-going-to-drink/
The Foundation Model Transparency Index (Stanford University): https://crfm.stanford.edu/fmti/
The Holistic Evaluation of Language Models (HELM) paper (arxiv.org): https://arxiv.org/abs/2211.09110
MLCommons announcement about the formation of the AI Safety Working Group (MLCommons): https://mlcommons.org/2023/10/mlcommons-announces-the-formation-of-ai-safety-working-group/
The AI Artifacts Podcast’s visual branding is by Skaaren Design.
Music used in this podcast comes from "Vanishing Horizon" by Jason Shaw and is licensed under an Attribution 3.0 United States License.
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.aiartifacts.net





















