Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // Bernie Wu // #270

Update: 2024-10-22

Description

Bernie Wu is VP of Business Development for MemVerge. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft.

Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // MLOps Podcast #270 with Bernie Wu, VP Strategic Partnerships/Business Development of MemVerge.

// Abstract
Limited memory capacity hinders the performance and potential of research and production environments utilizing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. This discussion explores how leveraging industry-standard CXL memory can be configured as a secondary, composable memory tier to alleviate this constraint.

We will highlight some recent work we’ve done in integrating of this novel class of memory into LLM/RAG/vector database frameworks and workflows.

Disaggregated shared memory is envisioned to offer high performance, low latency caches for model/pipeline checkpoints of LLM models, KV caches during distributed inferencing, LORA adaptors, and in-process data for heterogeneous CPU/GPU workflows. We expect to showcase these types of use cases in the coming months.

// Bio
Bernie is VP of Strategic Partnerships/Business Development for MemVerge. His focus has been building partnerships in the AI/ML, Kubernetes, and CXL memory ecosystems. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. He is also on the Board of Directors for Cirrus Data Solutions. Bernie has a BS/MS in Engineering from UC Berkeley and an MBA from UCLA.

// MLOps Swag/Merch
https://mlops-community.myshopify.com/

// Related Links
Website: www.memverge.com
Accelerating Data Retrieval in Retrieval Augmentation Generation (RAG) Pipelines using CXL: https://memverge.com/accelerating-data-retrieval-in-rag-pipelines-using-cxl/
Do Re MI for Training Metrics: Start at the Beginning // Todd Underwood // AIQCON: https://youtu.be/DxyOlRdCofo
Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // MLOps Podcast #228: https://youtu.be/6MY-IgqiTpg

Compute Express Link (CXL) FPGA IP: https://www.intel.com/content/www/us/en/products/details/fpga/intellectual-property/interface-protocols/cxl-ip.htmlUltra Ethernet Consortium: https://ultraethernet.org/

Unified Acceleration (UXL) Foundation: https://www.intel.com/content/www/us/en/developer/articles/news/unified-acceleration-uxl-foundation.html

RoCE networks for distributed AI training at scale: https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/

--------------- ✌️Connect With Us ✌️ -------------
Join our slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register
Catch all episodes, blogs, newsletters, and more: https://mlops.community/

Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with Bernie on LinkedIn: https://www.linkedin.com/in/berniewu/

Timestamps:
[00:00 ] Bernie's preferred coffee
[00:11 ] Takeaways
[01:37 ] First principles thinking focus
[05:02 ] Memory Abundance Concept Discussion
[06:45 ] Managing load spikes
[09:38 ] GPU checkpointing challenges
[16:29 ] Distributed memory problem solving
[18:27 ] Composable and Virtual Memory
[21:49 ] Interactive chat annotation
[23:46 ] Memory elasticity in AI
[27:33 ] GPU networking tests
[29:12 ] GPU Scheduling workflow optimization
[32:18 ] Kubernetes Extensions and Tools
[37:14 ] GPU bottleneck analysis
[42:04 ] Economical memory strategies
[45:14 ] Elastic memory management strategies
[47:57 ] Problem solving approach
[50:15 ] AI infrastructure elasticity evolution
[52:33 ] RDMA and RoCE explained
[54:14 ] Wrap up

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

Unleashing Unconstrained News Knowledge Graphs to Combat Misinformation // Robert Caulk // #279

2024-12-2001:15:24

Domino: Communication-Free LLM Training Engine // Guanhua Wang // #278

2024-12-1749:47

AI's Next Frontier // Aditya Naganath // #277

2024-12-1157:30

PyTorch for Control Systems and Decision Making // Vincent Moens // #276

2024-12-0456:39

AI-Driven Code: Navigating Due Diligence & Transparency in MLOps // Matt van Itallie // #275

2024-11-2957:01

PyTorch's Combined Effort in Large Model Optimization // Michael Gschwind // #274

2024-11-2657:44

LLMs to agents: The Beauty & Perils of Investing in GenAI // VC Panel // Agents in Production

2024-11-2233:24

We Can All Be AI Engineers and We Can Do It with Open Source Models // Luke Marsden // #273

2024-11-2051:08

Exploring AI Agents: Voice, Visuals, and Versatility // Panel // Agents in Production

2024-11-1528:58

The Impact of UX Research in the AI Space // Lauren Kaplan // #272

2024-11-1301:08:19

EU AI Act - Navigating New Legislation // Petar Tsankov // MLOps Podcast #271

2024-11-0158:56

Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // Bernie Wu // #270

2024-10-2255:18

How to Systematically Test and Evaluate Your LLMs Apps // Gideon Mendels // #269

2024-10-1801:01:42

Exploring the Impact of Agentic Workflows // Raj Rikhy // #268

2024-10-1551:02

The Only Constant is (Data) Change // Panel // DE4AI

2024-10-1140:49

The AI Dream Team: Strategies for ML Recruitment and Growth // Jelmer Borst and Daniela Solis // #267

2024-10-0958:42

Making Your Company LLM-native // Francisco Ingham // #266

2024-10-0657:54

Unpacking 3 Types of Feature Stores // Simba Khadder // #265

2024-10-0101:07:42

Reinvent Yourself and Be Curious // Stefano Bosisio // MLOps Podcast #264

2024-09-2757:15

Global Feature Store // Gottam Sai Bharath & Cole Bailey // #263

2024-09-2450:18

00:00

Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // Bernie Wu // #270

#box-pro-ellipsis-173486655747766{-webkit-line-clamp:2;}Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // Bernie Wu // #270

Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // Bernie Wu // #270

Demetrios Brinkmann

Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // Bernie Wu // #270