FlashAttention

Update: 2025-03-05

Description

FlashAttention is an IO-aware attention mechanism designed to be fast and memory-efficient, especially for long sequences. Its core innovation is tiling, where input sequences are divided into blocks processed within the fast on-chip SRAM, significantly reducing reads and writes to the slower HBM. This contrasts with standard attention, which materializes the entire attention matrix in HBM. By minimizing HBM access and recomputing the attention matrix in the backward pass, FlashAttention achieves faster Transformer training and a linear memory footprint, outperforming many approximate attention methods that overlook memory access costs.

Comments

In Channel

Kimi K2

2025-07-2215:30

Mixture-of-Recursions (MoR)

2025-07-1816:43

MeanFlow

2025-07-1006:47

Mamba

2025-07-1008:14

LLM Alignment

2025-06-1420:06

Why We Think

2025-05-2014:20

Deep Research

2025-05-1211:35

vLLM

2025-05-0413:06

Qwen3: Thinking Deeper, Acting Faster

2025-05-0413:15

RAGEN: train and evaluate LLM agents using multi-turn RL

2025-05-0311:56

DeepSeek-Prover-V2

2025-05-0111:04

DeepSeek-Prover

2025-05-0108:37

Model Context Protocol (MCP)

2025-04-0913:36

LLM Post-Training: Reasoning

2025-03-1722:18

Agent AI Overview

2025-03-1721:06

FlashAttention-3

2025-03-0713:43

FlashAttention-2

2025-03-0510:50

FlashAttention

2025-03-0510:55

PPO (Proximal Policy Optimization)

2025-02-1513:42

"Deep Dive into LLMs like ChatGPT" - Andrej Karpathy's Tech Talk Learning

2025-02-1518:10

00:00

#box-pro-ellipsis-176354626857199{-webkit-line-clamp:2;}FlashAttention

FlashAttention

AI-Talk

FlashAttention