When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

Update: 2025-10-01

Description

🤗 Upvotes: 29 | cs.CL

Authors:

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Kevin El-Haddad, Céline Hudelot, Pierre Colombo

Title:

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

Arxiv:

http://arxiv.org/abs/2509.22193v1

Abstract:

Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.

Comments

In Channel

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

2025-10-0225:16

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

2025-10-0223:40

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

2025-10-0229:09

Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

2025-10-0220:04

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

2025-10-0224:43

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

2025-10-0227:25

OceanGym: A Benchmark Environment for Underwater Embodied Agents

2025-10-0222:22

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

2025-10-0224:35

Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

2025-10-0223:19

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

2025-10-0225:33

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

2025-10-0124:35

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

2025-10-0121:00

Multiplayer Nash Preference Optimization

2025-10-0126:13

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

2025-10-0125:04

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

2025-10-0124:39

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

2025-10-0119:35

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

2025-10-0126:23

Democratizing AI scientists using ToolUniverse

2025-10-0126:05

Visual Jigsaw Post-Training Improves MLLMs

2025-10-0123:16

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

2025-10-0124:47

00:00

1.0x

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-175939762470924{-webkit-line-clamp:2;}When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance

Jingwen Liang, Gengyu Wang

When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance