NeurIPS 2025: MoBA: Mixture of Block Attention for Long-Context LLMs

Update: 2025-11-29

Description

This paper introduces Mixture of Block Attention (MoBA) to address the prohibitive quadratic computational overhead inherent in traditional attention mechanisms when scaling large language models (LLMs) for long contexts. MoBA is a novel architecture that strategically applies the established Mixture of Experts (MoE) paradigm directly to the attention mechanism itself. Instead of attending to the entire sequence, MoBA partitions the context into discrete blocks and utilizes a dynamic gating network to selectively route queries to only the most relevant blocks of keys and values. This block-sparse approach drastically increases computational efficiency, achieving sub-quadratic complexity and demonstrating speedups of up to 16 times when processing sequences up to 10 million tokens. Crucially, the research demonstrates that MoBA maintains performance comparable to full attention across scaling laws and real-world benchmarks. Furthermore, the architecture is highly flexible, allowing for seamless transitions between sparse MoBA and full attention layers during both training and inference.

Source: https://openreview.net/pdf?id=RlqYCpTu1P

Comments

In Channel

PageANN: Scalable Disk ANNS with Page-Aligned Graphs

2025-12-0713:56

NeurIPS 2025: Homogeneous Keys, Heterogeneous Values

2025-12-0414:44

NeurIPS 2025: Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

2025-11-2914:43

NeurIPS 2025: Large Language Diffusion Models

2025-11-2912:39

NeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example

2025-11-2913:07

NeurIPS 2025: Parallel Scaling Law for Language Models

2025-11-2916:16

NeurIPS 2025: SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data

2025-11-2912:45

NeurIPS 2025: DYNAACT: Large Language Model Reasoning with Dynamic Action Spaces

2025-11-2915:24

NeurIPS 2025: KGGen: Extracting Knowledge Graphs from Plain Text with Language Models

2025-11-2913:38

NeurIPS 2025: Self-Adapting Language Models

2025-11-2911:57

NeurIPS 2025: Thinkless: LLM Learns When to Think

2025-11-2913:48

NeurIPS 2025: FlashBias: Fast Computation of Attention with Bias

2025-11-2914:11

NeurIPS 2025: A-Mem: Agentic Memory for LLM Agents

2025-11-2911:03

NeurIPS 2025: MoBA: Mixture of Block Attention for Long-Context LLMs

2025-11-2917:04

NeurIPS 2025: Reward Reasoning Model

2025-11-2917:32

Anthropic: Disrupting the First AI-Orchestrated Cyber Espionage Campaign

2025-11-2713:17

Anthropic: reward hacking & misalignment & sabotage

2025-11-2215:17

DeepSeek-OCR: Contexts Optical Compression

2025-11-2215:08

Neuromorphic computing: Brain-Inspired AI and Hardware

2025-11-2214:50

Meta: SAM 3

2025-11-2014:22

00:00

1.0x

NeurIPS 2025: MoBA: Mixture of Block Attention for Long-Context LLMs

#box-pro-ellipsis-176518103780188{-webkit-line-clamp:2;}NeurIPS 2025: MoBA: Mixture of Block Attention for Long-Context LLMs

NeurIPS 2025: MoBA: Mixture of Block Attention for Long-Context LLMs

mcgrof

NeurIPS 2025: MoBA: Mixture of Block Attention for Long-Context LLMs