DiscoverAI: post transformersNeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example
NeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example

NeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Update: 2025-11-29
Share

Description

This research examines the data efficiency of Reinforcement Learning with Verifiable Reward (RLVR) when applied to large language models for mathematical reasoning tasks. The paper's most significant finding is the success of 1-shot RLVR, showing that comparable performance to using a large training dataset can be achieved using just a single, carefully selected example. This result suggests that RLVR is effective primarily because it activates the strong latent reasoning capabilities already present in the base model, rather than imparting new domain knowledge. An interesting phenomenon observed during training is "post-saturation generalization," where the model's test performance continues to rise long after training accuracy has saturated and the model has begun overfitting the single example. Ablation studies indicate that while policy gradient loss is the main source of improvement, entropy loss is essential for encouraging the exploration needed to realize this enhanced long-term generalization.


Source:

https://openreview.net/pdf?id=IBrRNLr6JA

Comments 
loading
In Channel
Meta: SAM 3

Meta: SAM 3

2025-11-2014:22

loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

NeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example

NeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example

mcgrof