Listen Top Shows Blog

Accelerating Large Language Model Decoding
with Speculative Sampling

Accelerating Large Language Model Decoding with Speculative Sampling

Update: 2026-02-26

Share

Description

The Deepmind February 3, 2023 paper "Accelerating Large Language Model Decoding with Speculative Sampling introduced speculative sampling, a novel algorithm designed to increase the speed of Large Language Model (LLM) decoding without altering the final output. The researchers utilize a smaller, faster draft model to predict multiple potential tokens, which are then verified in parallel by a larger, more powerful target model. By employing a unique rejection sampling scheme, the system ensures that the generated text remains mathematically identical to the distribution of the original large model. When tested with the 70 billion parameter Chinchilla model, this technique achieved a 2 to 2.5 times speedup in processing. The method is particularly effective because it overcomes the memory bandwidth bottlenecks typical of standard autoregressive generation. Ultimately, it provides a practical way to reduce latency in large-scale AI applications without sacrificing sample quality.

Source:

February 3 2023

Accelerating Large Language Model Decoding with Speculative Sampling

DeepMind

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, John Jumper

https://arxiv.org/pdf/2302.01318

Comments

In Channel

Cognizant - New Work, New World 2026

Cognizant - New Work, New World 2026

2026-03-01--:--

Episode: Regular Fourier Features for Nonstationary Gaussian Processes

Episode: Regular Fourier Features for Nonstationary Gaussian Processes

2026-03-01--:--

MatFormer: Nested Transformer for Elastic Inference

MatFormer: Nested Transformer for Elastic Inference

2026-02-2820:17

Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models

Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models

2026-02-2817:14

Apple's Mirror Speculative Decoding: Parallel LLM Inference via Heterogeneous Accelerators

Apple's Mirror Speculative Decoding: Parallel LLM Inference via Heterogeneous Accelerators

2026-02-2820:23

EAGLE: Evolution of Lossless Acceleration for LLM Inference

EAGLE: Evolution of Lossless Acceleration for LLM Inference

2026-02-2819:02

Fast Inference from Transformers via Speculative Decoding

Fast Inference from Transformers via Speculative Decoding

2026-02-2824:44

Building Production-Ready Speculative Decoding with TensorRT-LLM

Building Production-Ready Speculative Decoding with TensorRT-LLM

2026-02-2817:10

QuantSpec: Hierarchical KV Cache for Self-Speculative Decoding

QuantSpec: Hierarchical KV Cache for Self-Speculative Decoding

2026-02-2821:06

CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation

CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation

2026-02-2822:53

Unified Latents (UL): How to train your latents

Unified Latents (UL): How to train your latents

2026-02-2819:45

MagicDec: Breaking Latency-Throughput Tradeoffs via KV-Compressed Speculative Decoding

MagicDec: Breaking Latency-Throughput Tradeoffs via KV-Compressed Speculative Decoding

2026-02-2817:46

KV selection algorithms: static (SnapKV) Vs dynamic (PQCache)

KV selection algorithms: static (SnapKV) Vs dynamic (PQCache)

2026-02-2818:53

Adaptive Control for Batched Speculative Decoding in LLM Serving

Adaptive Control for Batched Speculative Decoding in LLM Serving

2026-02-2818:02

Optimizing Verification and Efficiency in Multi-Draft Speculative Decoding

Optimizing Verification and Efficiency in Multi-Draft Speculative Decoding

2026-02-2621:30

Evaluating Collective Behaviour of Hundreds of LLM Agents

Evaluating Collective Behaviour of Hundreds of LLM Agents

2026-02-2620:34

Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Measuring LLM Reasoning Effort via Deep-Thinking Tokens

2026-02-2621:16

Deep Learning Frameworks for Robust Quadrupedal Locomotion

Deep Learning Frameworks for Robust Quadrupedal Locomotion

2026-02-2621:53

MEDUSA: Parallel Decoding Heads for Accelerated LLM Inference

MEDUSA: Parallel Decoding Heads for Accelerated LLM Inference

2026-02-2622:08

Taming the Long-Tail: Efficient Reasoning RL with Adaptive Drafters

Taming the Long-Tail: Efficient Reasoning RL with Adaptive Drafters

2026-02-2618:16

00:00

00:00

x

Accelerating Large Language Model Decoding
with Speculative Sampling

Accelerating Large Language Model Decoding with Speculative Sampling

mcgrof