Running Generative AI Models In Production
Update: 2024-10-28
Description
Summary
In this episode Philip Kiely from BaseTen talks about the intricacies of running open models in production. Philip shares his journey into AI and ML engineering, highlighting the importance of understanding product-level requirements and selecting the right model for deployment. The conversation covers the operational aspects of deploying AI models, including model evaluation, compound AI, and model serving frameworks such as TensorFlow Serving and AWS SageMaker. Philip also discusses the challenges of model quantization, rapid model evolution, and monitoring and observability in AI systems, offering valuable insights into the future trends in AI, including local inference and the competition between open source and proprietary models.
Announcements
Parting Question
In this episode Philip Kiely from BaseTen talks about the intricacies of running open models in production. Philip shares his journey into AI and ML engineering, highlighting the importance of understanding product-level requirements and selecting the right model for deployment. The conversation covers the operational aspects of deploying AI models, including model evaluation, compound AI, and model serving frameworks such as TensorFlow Serving and AWS SageMaker. Philip also discusses the challenges of model quantization, rapid model evolution, and monitoring and observability in AI systems, offering valuable insights into the future trends in AI, including local inference and the competition between open source and proprietary models.
Announcements
- Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
- Your host is Tobias Macey and today I'm interviewing Philip Kiely about running open models in production
- Introduction
- How did you get involved in machine learning?
- Can you start by giving an overview of the major decisions to be made when planning the deployment of a generative AI model?
- How does the model selected in the beginning of the process influence the downstream choices?
- In terms of application architecture, the major patterns that I've seen are RAG, fine-tuning, multi-agent, or large model. What are the most common methods that you see? (and any that I failed to mention)
- How have the rapid succession of model generations impacted the ways that teams think about their overall application? (capabilities, features, architecture, etc.)
- In terms of model serving, I know that Baseten created Truss. What are some of the other notable options that teams are building with?
- What is the role of the serving framework in the context of the application?
- There are also a large number of inference engines that have been released. What are the major players in that arena?
- What are the features and capabilities that they are each basing their competitive advantage on?
- For someone who is new to AI Engineering, what are some heuristics that you would recommend when choosing an inference engine?
- Once a model (or set of models) is in production and serving traffic it's necessary to have visibility into how it is performing. What are the key metrics that are necessary to monitor for generative AI systems?
- In the event that one (or more) metrics are trending negatively, what are the levers that teams can pull to improve them?
- When running models constructed with e.g. linear regression or deep learning there was a common issue with "concept drift". How does that manifest in the context of large language models, particularly when coupled with performance optimization?
- What are the most interesting, innovative, or unexpected ways that you have seen teams manage the serving of open gen AI models?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with generative AI model serving?
- When is Baseten the wrong choice?
- What are the future trends and technology investments that you are focused on in the space of AI model serving?
Parting Question
- From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?
- Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@aiengineeringpodcast.com with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers.
- Baseten
- Copyleft
- Llama Models
- Nomic
- Olmo
- Allen Institute for AI
- Playground 2
- The Peace Dividend Of The SaaS Wars
- Vercel
- Netlify
- RAG == Retrieval Augmented Generation
- Compound AI
- Langchain
- Outlines Structured output for AI systems
- Truss
- Chains
- Llamaindex
- Ray
- MLFlow
- Cog (Replicate) containers for ML
- BentoML
- Django
- WSGI
- uWSGI
- Gunicorn
- Zapier
- vLLM
- TensorRT-LLM
- TensorRT
- Quantization
- LoRA Low Rank Adaptation of Large Language Models
- Pruning
- Distillation
- Grafana
- Speculative Decoding
- Groq
- Runpod
- Lambda Labs
Comments
Top Podcasts
The Best New Comedy Podcast Right Now – June 2024The Best News Podcast Right Now – June 2024The Best New Business Podcast Right Now – June 2024The Best New Sports Podcast Right Now – June 2024The Best New True Crime Podcast Right Now – June 2024The Best New Joe Rogan Experience Podcast Right Now – June 20The Best New Dan Bongino Show Podcast Right Now – June 20The Best New Mark Levin Podcast – June 2024
In Channel