NeurIPS 2025: Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Description
The research systematically investigates the effects of integrating various gating mechanisms into the standard softmax attention layer, comparing over thirty configurations across dense and Mixture-of-Experts Large Language Models. The central finding demonstrates that applying an elementwise, head-specific sigmoid gate immediately following the Scaled Dot-Product Attention (SDPA) output consistently yields the most substantial improvement in overall performance metrics. This successful gating method also provides superior training stability, allowing models to converge effectively under larger learning rates and mitigating disruptive loss spikes during optimization. The improved efficacy is attributed to two factors: introducing essential non-linearity into the low-rank attention mapping and generating input-dependent sparse gating scores. Crucially, this sparsity acts to normalize attention dynamics, eliminating the 'attention sink' problem where initial tokens dominate attention scores, thereby facilitating notably better long-context extrapolation. These demonstrated benefits led to the incorporation of this specific gated attention design into the forthcoming Qwen3-Next models.
Source:
https://openreview.net/pdf?id=1b7whO4SfY




