vLLM

vLLM

Update: 2025-05-04
Share

Description

vLLM is a high-throughput serving system for large language models. It addresses inefficient KV cache memory management in existing systems caused by fragmentation and lack of sharing, which limits batch size. vLLM uses PagedAttention, inspired by OS paging, to manage KV cache in non-contiguous blocks. This minimizes memory waste and enables flexible sharing, allowing vLLM to batch significantly more requests. As a result, vLLM achieves 2-4x higher throughput compared to state-of-the-art systems like FasterTransformer and Orca.

Comments 
loading
In Channel
Kimi K2

Kimi K2

2025-07-2215:30

MeanFlow

MeanFlow

2025-07-1006:47

Mamba

Mamba

2025-07-1008:14

LLM Alignment

LLM Alignment

2025-06-1420:06

Why We Think

Why We Think

2025-05-2014:20

Deep Research

Deep Research

2025-05-1211:35

vLLM

vLLM

2025-05-0413:06

DeepSeek-Prover-V2

DeepSeek-Prover-V2

2025-05-0111:04

DeepSeek-Prover

DeepSeek-Prover

2025-05-0108:37

Agent AI Overview

Agent AI Overview

2025-03-1721:06

FlashAttention-3

FlashAttention-3

2025-03-0713:43

FlashAttention-2

FlashAttention-2

2025-03-0510:50

FlashAttention

FlashAttention

2025-03-0510:55

loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

vLLM

vLLM

AI-Talk