LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

Update: 2025-09-04

Description

🤗 Upvotes: 63 | cs.CV, cs.LG

Authors:

Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, Furong Huang

Title:

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

Arxiv:

http://arxiv.org/abs/2509.00676v1

Abstract:

In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.

Comments

In Channel

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

2025-09-0622:57

From Editor to Dense Geometry Estimator

2025-09-0618:45

Towards a Unified View of Large Language Model Post-Training

2025-09-0623:07

DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

2025-09-0620:11

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

2025-09-0623:03

Open Data Synthesis For Deep Research

2025-09-0523:03

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

2025-09-0521:57

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

2025-09-0424:16

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

2025-09-0423:48

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

2025-09-0422:32

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

2025-09-0420:16

Baichuan-M2: Scaling Medical Capability with Large Verifier System

2025-09-0423:34

Kwai Keye-VL 1.5 Technical Report

2025-09-0418:19

Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic

2025-09-0424:18

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

2025-09-0321:59

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

2025-09-0219:58

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

2025-09-0223:14

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

2025-08-2821:42

VibeVoice Technical Report

2025-08-2821:19

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

2025-08-2820:03

00:00

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-175720444544348{-webkit-line-clamp:2;}LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

Jingwen Liang, Gengyu Wang

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model