FlashAttention-3

Update: 2025-03-07

Description

FlashAttention-3 accelerates attention on NVIDIA Hopper GPUs through three key innovations. It achieves producer-consumer asynchrony by dividing warps into producer (data loading with TMA) and consumer (computation with asynchronous Tensor Cores) roles, overlapping these critical phases. Second, it hides softmax latency by interleaving softmax operations with asynchronous GEMMs using techniques like pingpong scheduling and intra-warpgroup pipelining. Lastly, FlashAttention-3 leverages hardware-accelerated low-precision FP8 GEMM, employing block quantization and incoherent processing to enhance throughput while mitigating accuracy loss. This summary is based on the provided sources.

Comments

In Channel

Kimi K2

2025-07-2215:30

Mixture-of-Recursions (MoR)

2025-07-1816:43

MeanFlow

2025-07-1006:47

Mamba

2025-07-1008:14

LLM Alignment

2025-06-1420:06

Why We Think

2025-05-2014:20

Deep Research

2025-05-1211:35

vLLM

2025-05-0413:06

Qwen3: Thinking Deeper, Acting Faster

2025-05-0413:15

RAGEN: train and evaluate LLM agents using multi-turn RL

2025-05-0311:56

DeepSeek-Prover-V2

2025-05-0111:04

DeepSeek-Prover

2025-05-0108:37

Model Context Protocol (MCP)

2025-04-0913:36

LLM Post-Training: Reasoning

2025-03-1722:18

Agent AI Overview

2025-03-1721:06

FlashAttention-3

2025-03-0713:43

FlashAttention-2

2025-03-0510:50

FlashAttention

2025-03-0510:55

PPO (Proximal Policy Optimization)

2025-02-1513:42

"Deep Dive into LLMs like ChatGPT" - Andrej Karpathy's Tech Talk Learning

2025-02-1518:10

00:00

1.0x

#box-pro-ellipsis-176354625641587{-webkit-line-clamp:2;}FlashAttention-3

FlashAttention-3

AI-Talk

FlashAttention-3