More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Update: 2025-10-02

Description

🤗 Upvotes: 29 | cs.CV, cs.AI

Authors:

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang

Title:

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Arxiv:

http://arxiv.org/abs/2509.25848v1

Abstract:

Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/

Comments

In Channel

LongCodeZip: Compress Long Context for Code Language Models

2025-10-0429:31

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

2025-10-0423:03

ExGRPO: Learning to Reason from Experience

2025-10-0421:57

StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

2025-10-0426:07

Interactive Training: Feedback-Driven Neural Network Optimization

2025-10-0420:55

ModernVBERT: Towards Smaller Visual Document Retrievers

2025-10-0423:21

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

2025-10-0430:49

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

2025-10-0324:12

GEM: A Gym for Agentic LLMs

2025-10-0325:53

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

2025-10-0326:50

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

2025-10-0324:50

PIPer: On-Device Environment Setup via Online Reinforcement Learning

2025-10-0320:19

SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights

2025-10-0324:05

ACON: Optimizing Context Compression for Long-horizon LLM Agents

2025-10-0325:16

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

2025-10-0225:16

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

2025-10-0223:40

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

2025-10-0229:09

Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

2025-10-0220:04

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

2025-10-0224:43

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

2025-10-0227:25

00:00

1.0x

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-17596167300743{-webkit-line-clamp:2;}More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Jingwen Liang, Gengyu Wang

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models