Running Generative AI Models In Production

Update: 2024-10-28

Description

Summary
In this episode Philip Kiely from BaseTen talks about the intricacies of running open models in production. Philip shares his journey into AI and ML engineering, highlighting the importance of understanding product-level requirements and selecting the right model for deployment. The conversation covers the operational aspects of deploying AI models, including model evaluation, compound AI, and model serving frameworks such as TensorFlow Serving and AWS SageMaker. Philip also discusses the challenges of model quantization, rapid model evolution, and monitoring and observability in AI systems, offering valuable insights into the future trends in AI, including local inference and the competition between open source and proprietary models.

Announcements

Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
Your host is Tobias Macey and today I'm interviewing Philip Kiely about running open models in production

Interview

Introduction
How did you get involved in machine learning?
Can you start by giving an overview of the major decisions to be made when planning the deployment of a generative AI model?
How does the model selected in the beginning of the process influence the downstream choices?
In terms of application architecture, the major patterns that I've seen are RAG, fine-tuning, multi-agent, or large model. What are the most common methods that you see? (and any that I failed to mention)
- How have the rapid succession of model generations impacted the ways that teams think about their overall application? (capabilities, features, architecture, etc.)
In terms of model serving, I know that Baseten created Truss. What are some of the other notable options that teams are building with?
- What is the role of the serving framework in the context of the application?
There are also a large number of inference engines that have been released. What are the major players in that arena?
- What are the features and capabilities that they are each basing their competitive advantage on?
For someone who is new to AI Engineering, what are some heuristics that you would recommend when choosing an inference engine?
Once a model (or set of models) is in production and serving traffic it's necessary to have visibility into how it is performing. What are the key metrics that are necessary to monitor for generative AI systems?
- In the event that one (or more) metrics are trending negatively, what are the levers that teams can pull to improve them?
When running models constructed with e.g. linear regression or deep learning there was a common issue with "concept drift". How does that manifest in the context of large language models, particularly when coupled with performance optimization?
What are the most interesting, innovative, or unexpected ways that you have seen teams manage the serving of open gen AI models?
What are the most interesting, unexpected, or challenging lessons that you have learned while working with generative AI model serving?
When is Baseten the wrong choice?
What are the future trends and technology investments that you are focused on in the space of AI model serving?

Contact Info

Parting Question

From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@aiengineeringpodcast.com with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers.

Links

Baseten
- Podcast Episode
Copyleft
Llama Models
Nomic
Olmo
Allen Institute for AI
Playground 2
The Peace Dividend Of The SaaS Wars
Vercel
Netlify
RAG == Retrieval Augmented Generation
- Podcast Episode
Compound AI
Langchain
Outlines Structured output for AI systems
Truss
Chains
Llamaindex
Ray
MLFlow
Cog (Replicate) containers for ML
BentoML
Django
WSGI
uWSGI
Gunicorn
Zapier
vLLM
TensorRT-LLM
TensorRT
Quantization
LoRA Low Rank Adaptation of Large Language Models
Pruning
Distillation
Grafana
Speculative Decoding
Groq
Runpod
Lambda Labs

The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

The Role Of Synthetic Data In Building Better AI Applications

2025-02-1654:21

Optimize Your AI Applications Automatically With The TensorZero LLM Gateway

2025-01-2201:03:05

Harnessing The Engine Of AI

2024-12-1655:13

The Complex World of Generative AI Governance

2024-12-0154:19

Building Semantic Memory for AI With Cognee

2024-11-2555:01

The Impact of Generative AI on Software Development

2024-11-2252:58

ML Infrastructure Without The Ops: Simplifying The ML Developer Experience With Runhouse

2024-11-1101:16:12

Building AI Systems on Postgres: An Inside Look at pgai Vectorizer

2024-11-1153:50

Running Generative AI Models In Production

2024-10-2857:37

Enhancing AI Retrieval with Knowledge Graphs: A Deep Dive into GraphRAG

2024-09-1059:06

Harnessing Generative AI for Effective Digital Advertising Campaigns

2024-09-0241:49

Building Scalable ML Systems on Kubernetes

2024-08-1550:22

Expert Insights On Retrieval Augmented Generation And How To Build It

2024-07-2801:03:21

Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach

2024-07-2852:49

Build Your Second Brain One Piece At A Time

2024-07-2848:27

Strategies For Building A Product Using LLMs At DataChat

2024-03-0348:41

Improve The Success Rate Of Your Machine Learning Projects With bizML

2024-02-1850:22

Using Generative AI To Accelerate Feature Engineering At FeatureByte

2024-02-1144:59

Learn And Automate Critical Business Workflows With 8Flow

2024-01-2843:02

Considering The Ethical Responsibilities Of ML And AI Engineers

2024-01-2839:27

00:00

Running Generative AI Models In Production

#box-pro-ellipsis-173991960573342{-webkit-line-clamp:2;}Running Generative AI Models In Production

Running Generative AI Models In Production

Tobias Macey

Running Generative AI Models In Production