DiscoverBest AI papers explainedRL's Razor: Why Online RL Forgets Less
RL's Razor: Why Online RL Forgets Less

RL's Razor: Why Online RL Forgets Less

Update: 2025-09-07
Share

Description

This paper explores why **Reinforcement Learning (RL) fine-tuning leads to less catastrophic forgetting** in models compared to **Supervised Fine-Tuning (SFT)**, even when both achieve similar performance on new tasks. The authors introduce **"RL's Razor,"** a principle stating that **RL is implicitly biased towards solutions that cause minimal change (KL divergence) from the original model's policy** when learning new tasks. Empirical and theoretical evidence supports this, demonstrating that **KL divergence on the new task is a strong predictor of forgetting**, regardless of the training algorithm. The core reason for RL's advantage is its **on-policy training**, which samples from the model's current distribution and reweights those samples, leading to more conservative and KL-minimal updates compared to SFT's reliance on fixed external annotations.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

RL's Razor: Why Online RL Forgets Less

RL's Razor: Why Online RL Forgets Less

Enoch H. Kang