Ep168: Scaling Agentic Workloads: Why Reliable Infrastructure is Non-Negotiable for Enterprise AI by Anyscale
Update: 2025-11-07
Description
** AWS re:Invent 2025 Dec 1-5, Las Vegas - Register Here! **
Learn how Anyscale's Ray platform enables companies like Instacart to supercharge their model training while Amazon saves heavily by shifting to Ray's multimodal capabilities.
Topics Include:
- Ray originated at UC Berkeley when PhD students spent more time building clusters than ML models
- Anyscale now launches 1 million clusters monthly with contributions from OpenAI, Uber, Google, Coinbase
- Instacart achieved 10-100x increase in model training data using Ray's scaling capabilities
- ML evolved from single-node Pandas/NumPy to distributed Spark, now Ray for multimodal data
- Ray Core transforms simple Python functions into distributed tasks across massive compute clusters
- Higher-level Ray libraries simplify data processing, model training, hyperparameter tuning, and model serving
- Anyscale platform adds production features: auto-restart, logging, observability, and zone-aware scheduling
- Unlike Spark's CPU-only approach, Ray handles both CPUs and GPUs for multimodal workloads
- Ray enables LLM post-training and fine-tuning using reinforcement learning on enterprise data
- Multi-agent systems can scale automatically with Ray Serve handling thousands of requests per second
- Anyscale leverages AWS infrastructure while keeping customer data within their own VPCs
- Ray supports EC2, EKS, and HyperPod with features like fractional GPU usage and auto-scaling
Participants:
- Sharath Cholleti – Member of Technical Staff, Anyscale
See how Amazon Web Services gives you the freedom to migrate, innovate, and scale your software company at https://aws.amazon.com/isv/
Comments
In Channel




