Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Update: 2025-09-30

Description

🤗 Upvotes: 48 | cs.CL, cs.AI, cs.LG

Authors:

Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, Tianyu Pang

Title:

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Arxiv:

http://arxiv.org/abs/2509.22638v1

Abstract:

LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.

Comments

In Channel

LongLive: Real-time Interactive Long Video Generation

2025-09-3024:54

Quantile Advantage Estimation for Entropy-Safe Reasoning

2025-09-3023:16

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

2025-09-3027:27

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

2025-09-3025:00

ReviewScore: Misinformed Peer Review Detection with Large Language Models

2025-09-3021:58

Variational Reasoning for Language Models

2025-09-3022:33

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

2025-09-3023:30

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

2025-09-3025:33

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

2025-09-3023:54

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

2025-09-3027:53

VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

2025-09-2722:17

SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

2025-09-2723:35

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

2025-09-2728:47

Tree Search for LLM Agent Reinforcement Learning

2025-09-2724:50

Seedream 4.0: Toward Next-generation Multimodal Image Generation

2025-09-2721:30

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

2025-09-2725:09

AutoIntent: AutoML for Text Classification

2025-09-2722:28

Video models are zero-shot learners and reasoners

2025-09-2624:55

SIM-CoT: Supervised Implicit Chain-of-Thought

2025-09-2624:06

Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR

2025-09-2520:21

00:00

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-175924534061854{-webkit-line-clamp:2;}Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Jingwen Liang, Gengyu Wang

Language Models Can Learn from Verbal Feedback Without Scalar Rewards