DiscoverNext in AI: Your Daily News PodcastPerplexity MoE Deployment Deep Dive: The Custom Kernels and Network Secrets That Make Massive AI Models Run 5X Faster
Perplexity MoE Deployment Deep Dive: The Custom Kernels and Network Secrets That Make Massive AI Models Run 5X Faster

Perplexity MoE Deployment Deep Dive: The Custom Kernels and Network Secrets That Make Massive AI Models Run 5X Faster

Update: 2025-11-06
Share

Description

The podcast describes the development of high-performance, portable communication kernels specifically designed to handle the challenging sparse expert parallelism (EP) communication requirements (Dispatch and Combine) of large-scale Mixture-of-Experts (MoE) models such as DeepSeek R1 and Kimi-K2. An initial open-source NVSHMEM-based library achieved performance up to 10x faster than standard All-to-All communication and featured GPU-initiated communication (IBGDA) and a split kernel architecture for computation-communication overlap, leading to 2.5x lower latency on single-node deployments. Further specialized hybrid CPU-GPU kernels were developed to enable viable, state-of-the-art latencies for inter-node deployments over ConnectX-7 and AWS Elastic Fabric Adapter (EFA), crucial for serving trillion-parameter models. This multi-node approach leverages high EP values to reduce memory bandwidth pressure per GPU, enabling MoE models to simultaneously achieve higher throughput and lower latency across various configurations, an effect often contrary to dense model scaling

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Perplexity MoE Deployment Deep Dive: The Custom Kernels and Network Secrets That Make Massive AI Models Run 5X Faster

Perplexity MoE Deployment Deep Dive: The Custom Kernels and Network Secrets That Make Massive AI Models Run 5X Faster

Next in AI